RL Toolkit ========== MARO provides a full-stack abstraction for reinforcement learning (RL), which enables users to apply predefined and customized components to various scenarios. The main abstractions include fundamental components such as `Agent <#agent>`_\ and `Shaper <#shaper>`_\ , and training routine controllers such as `Actor <#actor>` and `Learner <#learner>`. Agent ----- The Agent is the kernel abstraction of the RL formulation for a real-world problem. Our abstraction decouples agent and its underlying model so that an agent can exist as an RL paradigm independent of the inner workings of the models it uses to generate actions or estimate values. For example, the actor-critic algorithm does not need to concern itself with the structures and optimizing schemes of the actor and critic models. This decoupling is achieved by the Core Model abstraction described below. .. image:: ../images/rl/agent.svg :target: ../images/rl/agent.svg :alt: Agent .. code-block:: python class AbsAgent(ABC): def __init__(self, model: AbsCoreModel, config, experience_pool=None): self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.model = model.to(self.device) self.config = config self._experience_pool = experience_pool Core Model ---------- MARO provides an abstraction for the underlying models used by agents to form policies and estimate values. The abstraction consists of ``AbsBlock`` and ``AbsCoreModel``, both of which subclass torch's nn.Module. The ``AbsBlock`` represents the smallest structural unit of an NN-based model. For instance, the ``FullyConnectedBlock`` provided in the toolkit is a stack of fully connected layers with features like batch normalization, drop-out and skip connection. The ``AbsCoreModel`` is a collection of network components with embedded optimizers and serves as an agent's "brain" by providing a unified interface to it. regardless of how many individual models it requires and how complex the model architecture might be. As an example, the initialization of the actor-critic algorithm may look like this: .. code-block:: python actor_stack = FullyConnectedBlock(...) critic_stack = FullyConnectedBlock(...) model = SimpleMultiHeadModel( {"actor": actor_stack, "critic": critic_stack}, optim_option={ "actor": OptimizerOption(cls=Adam, params={"lr": 0.001}) "critic": OptimizerOption(cls=RMSprop, params={"lr": 0.0001}) } ) agent = ActorCritic("actor_critic", learning_model, config) Choosing an action is simply: .. code-block:: python model(state, task_name="actor", training=False) And performing one gradient step is simply: .. code-block:: python model.learn(critic_loss + actor_loss) Explorer -------- MARO provides an abstraction for exploration in RL. Some RL algorithms such as DQN and DDPG require explicit exploration governed by a set of parameters. The ``AbsExplorer`` class is designed to cater to these needs. Simple exploration schemes, such as ``EpsilonGreedyExplorer`` for discrete action space and ``UniformNoiseExplorer`` and ``GaussianNoiseExplorer`` for continuous action space, are provided in the toolkit. As an example, the exploration for DQN may be carried out with the aid of an ``EpsilonGreedyExplorer``: .. code-block:: python explorer = EpsilonGreedyExplorer(num_actions=10) greedy_action = learning_model(state, training=False).argmax(dim=1).data exploration_action = explorer(greedy_action) Tools for Training ------------------------------ .. image:: ../images/rl/learner_actor.svg :target: ../images/rl/learner_actor.svg :alt: RL Overview The RL toolkit provides tools that make local and distributed training easy: * Learner, the central controller of the learning process, which consists of collecting simulation data from remote actors and training the agents with them. The training data collection can be done in local or distributed fashion by loading an ``Actor`` or ``ActorProxy`` instance, respectively. * Actor, which implements the ``roll_out`` method where the agent interacts with the environment for one episode. It consists of an environment instance and an agent (a single agent or multiple agents wrapped by ``MultiAgentWrapper``). The class provides the as_worker() method which turns it to an event loop where roll-outs are performed on the learner's demand. In distributed RL, there are typically many actor processes running simultaneously to parallelize training data collection. * Actor proxy, which also implements the ``roll_out`` method with the same signature, but manages a set of remote actors for parallel data collection. * Trajectory, which is primarily responsible for translating between scenario-specific information and model input / output. It implements the following methods which are used as callbacks in the actor's roll-out loop: * ``get_state``, which converts observations of an environment into model input. For example, the observation may be represented by a multi-level data structure, which gets encoded by a state shaper to a one-dimensional vector as input to a neural network. The state shaper usually goes hand in hand with the underlying policy or value models. * ``get_action``, which provides model output with necessary context so that it can be executed by the environment simulator. * ``get_reward``, which computes a reward for a given action. * ``on_env_feedback``, which defines things to do upon getting feedback from the environment. * ``on_finish``, which defines things to do upon completion of a roll-out episode.