Bike sharing schemes are an important component of the public transport systems of many cities around the world. London in particular has been using a docking-based scheme since 2010, and has seen the rise (and partial fall) of more experimental floating-bike schemes. Despite their success these schemes still struggle with a few common challenges. For example, in the mornings these networks typically experience a net flow of bikes from residential to commercial areas, and the reverse flow in the evening. This leads to a lack of available bikes, and in the case of docking-based schemes, lack of free docking stations. A variety of strategies have been used to alleviate this issue. For example, TfL employs a fleet of private vans to manually redistribute the bikes. However due to the sheer volume of rides taken in London at peak times the TfL network still regularly suffers from this problem.
The TfL redistribution strategy relies on an algorithm that can work out how and when bikes should be redistributed. This task is an extension of the classic Travelling Salesman problem, with the addition that the network is dynamically evolving in a way that is dependent on many variables such as time of day, day of week, month of year, weather, occurrence of big social events, disruption to other forms of transport, etc. The algorithm will also have to coordinate a fleet of vans in a collaborative way. To tackle this difficult challenge we used a branch of Machine Learning called Reinforcement Learning, which operates in a paradigm consisting of an environment (bike network) and an agent (machine learning model). The agent learns to optimise some performance metric by interacting with many copies of a simulation of the environment. Once trained, this algorithm is able to deliver real-time commands to the van drivers in order to optimise the redistribution of bikes, and respond to unexpected changes.
Good candidates for a Reinforcement Learning approach are those where:
1. you can design a realistic simulation
2. your actions have a delayed response, i.e. the positive outcome of your action or series of actions is not immediate but results from the correct combination and timing of actions.
Good examples of this are self-driving cars, traffic light control, trading on the stock exchange, bidding in auctions, games, and resource management.