Multi Agent DQN for CIM¶
This example demonstrates how to use MARO’s reinforcement learning (RL) toolkit to solve the container inventory management (CIM) problem. It is formalized as a multi-agent reinforcement learning problem, where each port acts as a decision agent. When a vessel arrives at a port, these agents must take actions by transfering a certain amount of containers to / from the vessel. The objective is for the agents to learn policies that minimize the overall container shortage.
Trajectory¶
The CIMTrajectoryForDQN inherits from Trajectory function and implements methods to be used as callbacks
in the roll-out loop. In this example,
get_stateconverts environment observations to state vectors that encode temporal and spatial information. The temporal information includes relevant port and vessel information, such as shortage and remaining space, over the past k days (here k = 7). The spatial information includes features of the downstream ports.
get_actionconverts agents’ output (an integer that maps to a percentage of containers to be loaded to or unloaded from the vessel) to action objects that can be executed by the environment.
get_offline_rewardcomputes the reward of a given action as a linear combination of fulfillment and shortage within a future time frame.
on_finishprocesses a complete trajectory into data that can be used directly by the learning agents.
Agent¶
The out-of-the-box DQN is used as our agent.
Training¶
The distributed training consists of one learner process and multiple actor processes. The learner optimizes the policy by collecting roll-out data from the actors to train the underlying agents.
The actor process must create a roll-out executor for performing the requested roll-outs, which means that the the environment simulator and shapers should be created here. In this example, inference is performed on the actor’s side, so a set of DQN agents must be created in order to load the models (and exploration parameters) from the learner.
The learner’s side requires a concrete learner class that inherits from AbsLearner and implements the run
method which contains the main training loop. Here the implementation is similar to the single-threaded version
except that the collect method is used to obtain roll-out data from the actors (since the roll-out executors
are located on the actors’ side). The agents created here are where training occurs and hence always contains the
latest policies.
Note
All related code snippets are supported in maro playground.