Exploring concurrency systems for asyncronous reinforcement learning

In which we look at Erlang/OTP as a management system for asyncronous RL.

Architecture

As independent component we have the inference workers whose job it is to serve HTTP requests with the provided policy. This policy can be quantised. It has a requirement on the current policy weights.

We have the rollout workers which simply execute a given scenario, i.e. computing rewards. The challenge here is that it must communicate with an inference worker in a loop and the log probabilities and token IDs must be intercepted somewhere so that we can use them for training.

Then there’s the trainer worker which is responsible for collecting all of the training data, doing the correct masking and then computing the gradients and updating the weights. For that reason it must have a reference to the current policy weights as well as the weights of the reference policy - though this can be statically served somewhere as we only need the log probabilities.

In the center we have the controller which is responsible for orchestrating the rollouts. It needs to send rollout requests to the worker and it is important that it is centralised such that we can steer the data distribution during training through adaptive sampling.

One needs to carefully balance rollout worker throughput and learner capacity. Also need to consider cancellation and resumption of previous rollouts.