Agent¶

maro.rl.agent.abs_agent¶

class maro.rl.agent.abs_agent.AbsAgent(model: maro.rl.model.learning_model.AbsCoreModel, config)[source]¶

Bases: abc.ABC

Abstract RL agent class.

It’s a sandbox for the RL algorithm. Scenario-specific details will be excluded. We focus on the abstraction algorithm development here. Environment observation and decision events will be converted to a uniform format before calling in. The output will be converted to an environment executable format before return back to the environment. Its key responsibility is optimizing policy based on interaction with the environment.

Parameters

model (AbsCoreModel) – Task model or container of task models required by the algorithm.
config – Settings for the algorithm.

abstract choose_action(state)[source]¶

This method uses the underlying model(s) to compute an action from a shaped state.

Parameters: state – A state object shaped by a StateShaper to conform to the model input format.
Returns: The action to be taken given state. It is usually necessary to use an ActionShaper to convert this to an environment executable action.

dump_model()[source]¶: Return the algorithm’s trainable models.

dump_model_to_file(path: str)[source]¶

Dump the algorithm’s trainable models to disk.

Dump trainable models to the specified directory. The model file is always prefixed with the agent’s name.

Parameters: path (str) – path to the directory where the models are saved.

abstract learn(*args, **kwargs)[source]¶

Algorithm-specific training logic.

The parameters are data to train the underlying model on. Algorithm-specific loss and optimization should be reflected here.

load_model(model)[source]¶: Load models from memory.

load_model_from_file(path: str)[source]¶

Load trainable models from disk.

Load trainable models from the specified directory. The model file is always prefixed with the agent’s name.

Parameters: path (str) – path to the directory where the models are saved.

set_exploration_params(**params)[source]¶

to_device(device)[source]¶

maro.rl.agent.dqn¶

class maro.rl.agent.dqn.DQN(model: maro.rl.model.learning_model.SimpleMultiHeadModel, config: maro.rl.agent.dqn.DQNConfig)[source]¶

Bases: maro.rl.agent.abs_agent.AbsAgent

The Deep-Q-Networks algorithm.

See https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf for details.

Parameters

model (SimpleMultiHeadModel) – Q-value model.
config – Configuration for DQN algorithm.

choose_action(state: numpy.ndarray) → Union[int, numpy.ndarray][source]¶

This method uses the underlying model(s) to compute an action from a shaped state.

Parameters: state – A state object shaped by a StateShaper to conform to the model input format.
Returns: The action to be taken given state. It is usually necessary to use an ActionShaper to convert this to an environment executable action.

learn(states: numpy.ndarray, actions: numpy.ndarray, rewards: numpy.ndarray, next_states: numpy.ndarray)[source]¶

Algorithm-specific training logic.

The parameters are data to train the underlying model on. Algorithm-specific loss and optimization should be reflected here.

set_exploration_params(epsilon)[source]¶

class maro.rl.agent.dqn.DQNConfig(reward_discount: float, target_update_freq: int, epsilon: float = 0.0, tau: float = 0.1, double: bool = True, advantage_type: str = None, loss_cls=<class 'torch.nn.modules.loss.MSELoss'>)[source]¶

Bases: object

Configuration for the DQN algorithm.

Parameters

reward_discount (float) – Reward decay as defined in standard RL terminology.
epsilon (float) – Exploration rate for epsilon-greedy exploration. Defaults to None.
tau (float) – Soft update coefficient, i.e., target_model = tau * eval_model + (1 - tau) * target_model.
double (bool) – If True, the next Q values will be computed according to the double DQN algorithm, i.e., q_next = Q_target(s, argmax(Q_eval(s, a))). Otherwise, q_next = max(Q_target(s, a)). See https://arxiv.org/pdf/1509.06461.pdf for details. Defaults to False.
advantage_type (str) – Advantage mode for the dueling architecture. Defaults to None, in which case it is assumed that the regular Q-value model is used.
loss_cls – Loss function class for evaluating TD errors. Defaults to torch.nn.MSELoss.
target_update_freq (int) – Number of training rounds between target model updates.

advantage_type¶

double¶

epsilon¶

loss_func¶

reward_discount¶

target_update_freq¶

tau¶

maro.rl.agent.ddpg¶

class maro.rl.agent.ddpg.DDPG(model: maro.rl.model.learning_model.SimpleMultiHeadModel, config: maro.rl.agent.ddpg.DDPGConfig, explorer: maro.rl.exploration.noise_explorer.NoiseExplorer = None)[source]¶

Bases: maro.rl.agent.abs_agent.AbsAgent

The Deep Deterministic Policy Gradient (DDPG) algorithm.

References: https://arxiv.org/pdf/1509.02971.pdf https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/ddpg

Parameters

model (SimpleMultiHeadModel) – DDPG policy and q-value models.
config – Configuration for DDPG algorithm.
explorer (NoiseExplorer) – An NoiseExplorer instance for generating exploratory actions. Defaults to None.

choose_action(state) → Union[float, numpy.ndarray][source]¶

This method uses the underlying model(s) to compute an action from a shaped state.

Parameters: state – A state object shaped by a StateShaper to conform to the model input format.
Returns: The action to be taken given state. It is usually necessary to use an ActionShaper to convert this to an environment executable action.

learn(states: numpy.ndarray, actions: numpy.ndarray, rewards: numpy.ndarray, next_states: numpy.ndarray)[source]¶

Algorithm-specific training logic.

The parameters are data to train the underlying model on. Algorithm-specific loss and optimization should be reflected here.

class maro.rl.agent.ddpg.DDPGConfig(reward_discount: float, q_value_loss_func: Callable, target_update_freq: int, policy_loss_coefficient: float = 1.0, tau: float = 1.0)[source]¶

Bases: object

Configuration for the DDPG algorithm. :Parameters: * reward_discount (float) – Reward decay as defined in standard RL terminology.

q_value_loss_func (Callable) – Loss function for the Q-value estimator.

target_update_freq (int) – Number of training rounds between policy target model updates.

actor_loss_coefficient (float) – The coefficient for policy loss in the total loss function, e.g., loss = q_value_loss + policy_loss_coefficient * policy_loss. Defaults to 1.0.

tau (float) – Soft update coefficient, e.g., target_model = tau * eval_model + (1-tau) * target_model. Defaults to 1.0.

policy_loss_coefficient¶

q_value_loss_func¶

reward_discount¶

target_update_freq¶

tau¶

latest

Agent¶

maro.rl.agent.abs_agent¶

maro.rl.agent.dqn¶

maro.rl.agent.ddpg¶

maro.rl.agent.policy_optimization¶

Agent Manager¶

maro.rl.agent.abs_agent_manager¶

Model¶

maro.rl.model.learning_model¶

Explorer¶

maro.rl.exploration.abs_explorer¶

maro.rl.exploration.epsilon_greedy_explorer¶

maro.rl.exploration.noise_explorer¶

Scheduler¶

maro.rl.scheduling.scheduler¶

maro.rl.scheduling.simple_parameter_scheduler¶

Shaping¶

maro.rl.shaping.abs_shaper¶

Storage¶

maro.rl.storage.abs_store¶

maro.rl.storage.simple_store¶

Actor¶

maro.rl.actor.abs_actor¶

maro.rl.actor.simple_actor¶

Learner¶

maro.rl.learner.abs_learner¶

maro.rl.learner.simple_learner¶

Distributed Topologies¶

maro.rl.dist_topologies.common¶

maro.rl.dist_topologies.single_learner_multi_actor_sync_mode¶