RL Toolkit

MARO provides a full-stack abstraction for reinforcement learning (RL), which enables users to apply predefined and customized components to various scenarios. The main abstractions include fundamental components such as Agentand Shaper, and training routine controllers such as Actor <#actor> and Learner <#learner>.

Agent

The Agent is the kernel abstraction of the RL formulation for a real-world problem. Our abstraction decouples agent and its underlying model so that an agent can exist as an RL paradigm independent of the inner workings of the models it uses to generate actions or estimate values. For example, the actor-critic algorithm does not need to concern itself with the structures and optimizing schemes of the actor and critic models. This decoupling is achieved by the Core Model abstraction described below.

Agent
class AbsAgent(ABC):
    def __init__(self, model: AbsCoreModel, config, experience_pool=None):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = model.to(self.device)
        self.config = config
        self._experience_pool = experience_pool

Core Model

MARO provides an abstraction for the underlying models used by agents to form policies and estimate values. The abstraction consists of AbsBlock and AbsCoreModel, both of which subclass torch’s nn.Module. The AbsBlock represents the smallest structural unit of an NN-based model. For instance, the FullyConnectedBlock provided in the toolkit is a stack of fully connected layers with features like batch normalization, drop-out and skip connection. The AbsCoreModel is a collection of network components with embedded optimizers and serves as an agent’s “brain” by providing a unified interface to it. regardless of how many individual models it requires and how complex the model architecture might be.

As an example, the initialization of the actor-critic algorithm may look like this:

actor_stack = FullyConnectedBlock(...)
critic_stack = FullyConnectedBlock(...)
model = SimpleMultiHeadModel(
    {"actor": actor_stack, "critic": critic_stack},
    optim_option={
      "actor": OptimizerOption(cls=Adam, params={"lr": 0.001})
      "critic": OptimizerOption(cls=RMSprop, params={"lr": 0.0001})
    }
)
agent = ActorCritic("actor_critic", learning_model, config)

Choosing an action is simply:

model(state, task_name="actor", training=False)

And performing one gradient step is simply:

model.learn(critic_loss + actor_loss)

Explorer

MARO provides an abstraction for exploration in RL. Some RL algorithms such as DQN and DDPG require explicit exploration governed by a set of parameters. The AbsExplorer class is designed to cater to these needs. Simple exploration schemes, such as EpsilonGreedyExplorer for discrete action space and UniformNoiseExplorer and GaussianNoiseExplorer for continuous action space, are provided in the toolkit.

As an example, the exploration for DQN may be carried out with the aid of an EpsilonGreedyExplorer:

explorer = EpsilonGreedyExplorer(num_actions=10)
greedy_action = learning_model(state, training=False).argmax(dim=1).data
exploration_action = explorer(greedy_action)

Tools for Training

RL Overview

The RL toolkit provides tools that make local and distributed training easy: * Learner, the central controller of the learning process, which consists of collecting simulation data from

remote actors and training the agents with them. The training data collection can be done in local or distributed fashion by loading an Actor or ActorProxy instance, respectively.

  • Actor, which implements the roll_out method where the agent interacts with the environment for one episode. It consists of an environment instance and an agent (a single agent or multiple agents wrapped by MultiAgentWrapper). The class provides the as_worker() method which turns it to an event loop where roll-outs are performed on the learner’s demand. In distributed RL, there are typically many actor processes running simultaneously to parallelize training data collection.

  • Actor proxy, which also implements the roll_out method with the same signature, but manages a set of remote actors for parallel data collection.

  • Trajectory, which is primarily responsible for translating between scenario-specific information and model input / output. It implements the following methods which are used as callbacks in the actor’s roll-out loop: * get_state, which converts observations of an environment into model input. For example, the observation

    may be represented by a multi-level data structure, which gets encoded by a state shaper to a one-dimensional vector as input to a neural network. The state shaper usually goes hand in hand with the underlying policy or value models.

    • get_action, which provides model output with necessary context so that it can be executed by the environment simulator.

    • get_reward, which computes a reward for a given action.

    • on_env_feedback, which defines things to do upon getting feedback from the environment.

    • on_finish, which defines things to do upon completion of a roll-out episode.