Sometime last year, I stumbled upon a paper while I was trying to come up with a really basic way to implement a budget and expenditure planner using an RL agent.

A bit of a background

A typical reinforcement learning setting is one where an agent interacts with an environment 𝜺 over a number of time steps 𝒕.

For each of the time steps 𝒕, the agent receives a State $𝑆_{𝒕}$, and selects an Action $π“ͺ_{𝒕}$ from a set of most possible actions π“ͺ based on its Policy πœ‹, where πœ‹ is a mapping from State $𝑆_{𝒕}$ to Action $π“ͺ_{𝒕}$. After the action, the agent receives the next State $𝑆_{𝒕}+1$ and a scalar reward $𝒓_{𝒕}$.

Easier put, it is about learning the most rewarding behaviour in an environment and enabling the choice of the most optimal decision.

The above process continues until the agent reaches a terminal state, and then the process restarts - with each process a further learning upon the previous. The goal of the agent is to maximise the expected return (reward) from each state $𝑆_{𝒕}$.

Unimportant extra info

Apart from the agent and the environment, there are three other main elements in a reinforcement learning system:


Now, back to the paper I stumbled upon.

This paper, titled β€œAsynchronous Methods for Deep Reinforcement Learning” was published in 2016. I found the article really rad (subtle mention here that it is a tad complicated rather than straight-forward).

It proposed that asynchronous implementation of parallel learner agents could stabilise deep neural network training, and then went ahead to discuss several ways to achieve this asynchronicity in deep reinforcement learning.

It worked!

What stood out for me was what they actually solved. Because it works very fine, and it is now what modern RL solutions are built upon.

Since the beginning of what we know Reinforcement Learning to be, a number of algorithms have been proposed over the years, and have had their great runs. And it was initially thought - and almost believed - that combining RL algorithms with deep neural networks was fundamentally unstable.

Several approaches had been proposed to attempt to stabilise the algorithms in DNNs. Their idea was to use an experience replay memory to store the agent’s data so it can be batched (batching usually does save the day but maybe not this time) or sampled in a random manner from distinct time-steps.

However, the drawbacks were enormous: way more memory and computation usage for every substantial interaction, and it required learning algorithms that do not depend on policy i.e they were mostly useful for offline learning where the agents do not need to explore much in the given environment.

But instead of experience replay, the authors of this paper attempted to asynchronously execute multiple agents in parallel on multiple instances of the same environment. Their idea - vigorously applied using deep neural networks - proved to have a much larger effect on a wide range of elementary on-policy RL algorithms such as Sarsa, n-steps, and actor-critic methods, as well as off-policy RL algorithms like Q-learning.

In my next article of this RL series, I’ll explain the actor-critic method and go ahead to show a nice sample using TensorFlow.