Grass Cluster Provisioning on Azure

With the following guide, you can build up a MARO cluster in grass/azure mode on Azure and run your training job in a distributed environment.

Cluster Management

  • Create a cluster with a deployment

    # Create a grass cluster with a grass-create deployment
    maro grass create ./grass-azure-create.yml
    
  • Scale the cluster

    Check VM Size to see more node specifications.

    # Scale nodes with 'Standard_D4s_v3' specification to 2
    maro grass node scale myGrassCluster Standard_D4s_v3 2
    
    # Scale nodes with 'Standard_D2s_v3' specification to 0
    maro grass node scale myGrassCluster Standard_D2s_v3 0
    
  • Delete the cluster

    # Delete a grass cluster
    maro grass delete myGrassCluster
    
  • Start/Stop nodes to save costs

    # Start 2 nodes with 'Standard_D4s_v3' specification
    maro grass node start myGrassCluster Standard_D4s_v3 2
    
    # Stop 2 nodes with 'Standard_D4s_v3' specification
    maro grass node stop myGrassCluster Standard_D4s_v3 2
    
  • Get statuses of the cluster

    # Get master status
    maro grass status myGrassCluster master
    
    # Get nodes status
    maro grass status myGrassCluster nodes
    
    # Get containers status
    maro grass status myGrassCluster containers
    
  • Clean up the cluster

    Delete all running jobs, schedules, containers in the cluster.

    maro grass clean myGrassCluster
    

Run Job

  • Push your training image from local machine

    # Push image 'myImage' to the cluster,
    # 'myImage' is a docker image that loaded on the machine that executed this command
    maro grass image push myGrassCluster --image-name myImage
    
  • Push your training data

    # Push dqn folder under './myTrainingData/' to a relative path '/myTrainingData' in the cluster
    # You can then assign your mapping location in the start-job-deployment
    maro grass data push myGrassCluster ./myTrainingData/dqn /myTrainingData
    
  • Start a training job with a start-job-deployment

    # Start a training job with a start-job deployment
    maro grass job start myGrassCluster ./grass-start-job.yml
    
  • Or, schedule batch jobs with a start-schedule-deployment

    These jobs will shared the same specification of components.

    A best practice to use this command will be: Push your training configs all at once with “maro grass data push”, and get the jobName from environment variables in the containers, then use the specific training config based on the jobName.

    # Start a training schedule with a start-schedule deployment
    maro grass schedule start myGrassCluster ./grass-start-schedule.yml
    
  • Get the logs of the job

    # Get the logs of the job
    maro grass job logs myGrassCluster myJob1
    
  • List the current status of the job

    # List the current status of the job
    maro grass job list myGrassCluster
    
  • Stop a training job

    # Stop a training job
    maro grass job stop myJob1
    

Sample Deployments

grass-azure-create

mode: grass/azure
name: myGrassCluster

cloud:
  resource_group: myResourceGroup
  subscription: mySubscription
  location: eastus
  default_username: admin
  default_public_key: "{ssh public key}"

user:
  admin_id: admin

master:
  node_size: Standard_D2s_v3

grass-start-job

You can replace {project root} with a valid linux path. e.g. /home/admin

Then the data you push will be mount into this folder.

mode: grass
name: myJob1

allocation:
  mode: single-metric-balanced
  metric: cpu

components:
  actor:
    command: "python {project root}/myTrainingData/dqn/job1/start_actor.py"
    image: myImage
    mount:
      target: "{project root}"
    num: 5
    resources:
      cpu: 1
      gpu: 0
      memory: 1024m
  learner:
    command: "python {project root}/myTrainingData/dqn/job1/start_learner.py"
    image: myImage
    mount:
      target: "{project root}"
    num: 1
    resources:
      cpu: 2
      gpu: 0
      memory: 2048m

grass-start-schedule

mode: grass
name: mySchedule1

allocation:
  mode: single-metric-balanced
  metric: cpu

job_names:
  - myJob2
  - myJob3
  - myJob4
  - myJob5

components:
  actor:
    command: "python {project root}/myTrainingData/dqn/schedule1/actor.py"
    image: myImage
    mount:
      target: “{project root}”
    num: 5
    resources:
      cpu: 1
      gpu: 0
      memory: 1024m
  learner:
    command: "bash {project root}/myTrainingData/dqn/schedule1/learner.py"
    image: myImage
    mount:
      target: "{project root}"
    num: 1
    resources:
      cpu: 2
      gpu: 0
      memory: 2048m