RL Toolkit¶
MARO provides a full-stack abstraction for reinforcement learning (RL), which enables users to apply predefined and customized components to various scenarios. The main abstractions include fundamental components such as Agentand Shaper, and training routine controllers such as Actor <#actor> and Learner <#learner>.
Agent¶
The Agent is the kernel abstraction of the RL formulation for a real-world problem. Our abstraction decouples agent and its underlying model so that an agent can exist as an RL paradigm independent of the inner workings of the models it uses to generate actions or estimate values. For example, the actor-critic algorithm does not need to concern itself with the structures and optimizing schemes of the actor and critic models. This decoupling is achieved by the Core Model abstraction described below.
class AbsAgent(ABC):
def __init__(self, model: AbsCoreModel, config, experience_pool=None):
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model = model.to(self.device)
self.config = config
self._experience_pool = experience_pool
Core Model¶
MARO provides an abstraction for the underlying models used by agents to form policies and estimate values.
The abstraction consists of AbsBlock and AbsCoreModel, both of which subclass torch’s nn.Module.
The AbsBlock represents the smallest structural unit of an NN-based model. For instance, the FullyConnectedBlock
provided in the toolkit is a stack of fully connected layers with features like batch normalization,
drop-out and skip connection. The AbsCoreModel is a collection of network components with
embedded optimizers and serves as an agent’s “brain” by providing a unified interface to it. regardless of how many individual models it requires and how
complex the model architecture might be.
As an example, the initialization of the actor-critic algorithm may look like this:
actor_stack = FullyConnectedBlock(...)
critic_stack = FullyConnectedBlock(...)
model = SimpleMultiHeadModel(
{"actor": actor_stack, "critic": critic_stack},
optim_option={
"actor": OptimizerOption(cls=Adam, params={"lr": 0.001})
"critic": OptimizerOption(cls=RMSprop, params={"lr": 0.0001})
}
)
agent = ActorCritic("actor_critic", learning_model, config)
Choosing an action is simply:
model(state, task_name="actor", training=False)
And performing one gradient step is simply:
model.learn(critic_loss + actor_loss)
Explorer¶
MARO provides an abstraction for exploration in RL. Some RL algorithms such as DQN and DDPG require
explicit exploration governed by a set of parameters. The AbsExplorer class is designed to cater
to these needs. Simple exploration schemes, such as EpsilonGreedyExplorer for discrete action space
and UniformNoiseExplorer and GaussianNoiseExplorer for continuous action space, are provided in
the toolkit.
As an example, the exploration for DQN may be carried out with the aid of an EpsilonGreedyExplorer:
explorer = EpsilonGreedyExplorer(num_actions=10)
greedy_action = learning_model(state, training=False).argmax(dim=1).data
exploration_action = explorer(greedy_action)
Tools for Training¶
The RL toolkit provides tools that make local and distributed training easy: * Learner, the central controller of the learning process, which consists of collecting simulation data from
remote actors and training the agents with them. The training data collection can be done in local or distributed fashion by loading an
ActororActorProxyinstance, respectively.
Actor, which implements the
roll_outmethod where the agent interacts with the environment for one episode. It consists of an environment instance and an agent (a single agent or multiple agents wrapped byMultiAgentWrapper). The class provides the as_worker() method which turns it to an event loop where roll-outs are performed on the learner’s demand. In distributed RL, there are typically many actor processes running simultaneously to parallelize training data collection.Actor proxy, which also implements the
roll_outmethod with the same signature, but manages a set of remote actors for parallel data collection.Trajectory, which is primarily responsible for translating between scenario-specific information and model input / output. It implements the following methods which are used as callbacks in the actor’s roll-out loop: *
get_state, which converts observations of an environment into model input. For example, the observationmay be represented by a multi-level data structure, which gets encoded by a state shaper to a one-dimensional vector as input to a neural network. The state shaper usually goes hand in hand with the underlying policy or value models.
get_action, which provides model output with necessary context so that it can be executed by the environment simulator.get_reward, which computes a reward for a given action.on_env_feedback, which defines things to do upon getting feedback from the environment.on_finish, which defines things to do upon completion of a roll-out episode.