Grass Cluster Provisioning on Azure¶
With the following guide, you can build up a MARO cluster in grass/azure mode on Azure and run your training job in a distributed environment.
Prerequisites¶
Cluster Management¶
Create a cluster with a deployment
# Create a grass cluster with a grass-create deployment maro grass create ./grass-azure-create.ymlScale the cluster
Check VM Size to see more node specifications.
# Scale nodes with 'Standard_D4s_v3' specification to 2 maro grass node scale myGrassCluster Standard_D4s_v3 2 # Scale nodes with 'Standard_D2s_v3' specification to 0 maro grass node scale myGrassCluster Standard_D2s_v3 0
Delete the cluster
# Delete a grass cluster maro grass delete myGrassClusterStart/Stop nodes to save costs
# Start 2 nodes with 'Standard_D4s_v3' specification maro grass node start myGrassCluster Standard_D4s_v3 2 # Stop 2 nodes with 'Standard_D4s_v3' specification maro grass node stop myGrassCluster Standard_D4s_v3 2
Get statuses of the cluster
# Get master status maro grass status myGrassCluster master # Get nodes status maro grass status myGrassCluster nodes # Get containers status maro grass status myGrassCluster containers
Clean up the cluster
Delete all running jobs, schedules, containers in the cluster.
maro grass clean myGrassCluster
Run Job¶
Push your training image from local machine
# Push image 'myImage' to the cluster, # 'myImage' is a docker image that loaded on the machine that executed this command maro grass image push myGrassCluster --image-name myImage
Push your training data
# Push dqn folder under './myTrainingData/' to a relative path '/myTrainingData' in the cluster # You can then assign your mapping location in the start-job-deployment maro grass data push myGrassCluster ./myTrainingData/dqn /myTrainingData
Start a training job with a start-job-deployment
# Start a training job with a start-job deployment maro grass job start myGrassCluster ./grass-start-job.ymlOr, schedule batch jobs with a start-schedule-deployment
These jobs will shared the same specification of components.
A best practice to use this command will be: Push your training configs all at once with “
maro grass data push”, and get the jobName from environment variables in the containers, then use the specific training config based on the jobName.# Start a training schedule with a start-schedule deployment maro grass schedule start myGrassCluster ./grass-start-schedule.ymlGet the logs of the job
# Get the logs of the job maro grass job logs myGrassCluster myJob1List the current status of the job
# List the current status of the job maro grass job list myGrassClusterStop a training job
# Stop a training job maro grass job stop myJob1
Sample Deployments¶
grass-azure-create¶
mode: grass/azure
name: myGrassCluster
cloud:
resource_group: myResourceGroup
subscription: mySubscription
location: eastus
default_username: admin
default_public_key: "{ssh public key}"
user:
admin_id: admin
master:
node_size: Standard_D2s_v3
grass-start-job¶
You can replace {project root} with a valid linux path. e.g. /home/admin
Then the data you push will be mount into this folder.
mode: grass
name: myJob1
allocation:
mode: single-metric-balanced
metric: cpu
components:
actor:
command: "python {project root}/myTrainingData/dqn/job1/start_actor.py"
image: myImage
mount:
target: "{project root}"
num: 5
resources:
cpu: 1
gpu: 0
memory: 1024m
learner:
command: "python {project root}/myTrainingData/dqn/job1/start_learner.py"
image: myImage
mount:
target: "{project root}"
num: 1
resources:
cpu: 2
gpu: 0
memory: 2048m
grass-start-schedule¶
mode: grass
name: mySchedule1
allocation:
mode: single-metric-balanced
metric: cpu
job_names:
- myJob2
- myJob3
- myJob4
- myJob5
components:
actor:
command: "python {project root}/myTrainingData/dqn/schedule1/actor.py"
image: myImage
mount:
target: “{project root}”
num: 5
resources:
cpu: 1
gpu: 0
memory: 1024m
learner:
command: "bash {project root}/myTrainingData/dqn/schedule1/learner.py"
image: myImage
mount:
target: "{project root}"
num: 1
resources:
cpu: 2
gpu: 0
memory: 2048m