Multi Agent DQN for CIM

This example demonstrates how to use MARO’s reinforcement learning (RL) toolkit to solve the container inventory management (CIM) problem. It is formalized as a multi-agent reinforcement learning problem, where each port acts as a decision agent. When a vessel arrives at a port, these agents must take actions by transfering a certain amount of containers to / from the vessel. The objective is for the agents to learn policies that minimize the overall container shortage.

Trajectory

The CIMTrajectoryForDQN inherits from Trajectory function and implements methods to be used as callbacks in the roll-out loop. In this example,

  • get_state converts environment observations to state vectors that encode temporal and spatial information. The temporal information includes relevant port and vessel information, such as shortage and remaining space, over the past k days (here k = 7). The spatial information includes features of the downstream ports.

  • get_action converts agents’ output (an integer that maps to a percentage of containers to be loaded to or unloaded from the vessel) to action objects that can be executed by the environment.

  • get_offline_reward computes the reward of a given action as a linear combination of fulfillment and shortage within a future time frame.

  • on_finish processes a complete trajectory into data that can be used directly by the learning agents.

Agent

The out-of-the-box DQN is used as our agent.

Training

The distributed training consists of one learner process and multiple actor processes. The learner optimizes the policy by collecting roll-out data from the actors to train the underlying agents.

The actor process must create a roll-out executor for performing the requested roll-outs, which means that the the environment simulator and shapers should be created here. In this example, inference is performed on the actor’s side, so a set of DQN agents must be created in order to load the models (and exploration parameters) from the learner.

The learner’s side requires a concrete learner class that inherits from AbsLearner and implements the run method which contains the main training loop. Here the implementation is similar to the single-threaded version except that the collect method is used to obtain roll-out data from the actors (since the roll-out executors are located on the actors’ side). The agents created here are where training occurs and hence always contains the latest policies.

Note

All related code snippets are supported in maro playground.