K8S Cluster Provisioning on Azure¶
With the following guide, you can build up a MARO cluster in k8s/aks on Azure and run your training job in a distributed environment.
Prerequisites¶
Install docker and Configure docker to make sure it can be managed as a non-root user
Download AzCopy, then move the AzCopy executable to /bin folder or add the directory location of the AzCopy executable to your system path:
# Take AzCopy version 10.6.0 as an example
# Linux
tar xvf ./azcopy_linux_amd64_10.6.0.tar.gz; cp ./azcopy_linux_amd64_10.6.0/azcopy /usr/local/bin
# MacOS (may required MacOS Security & Privacy setting)
unzip ./azcopy_darwin_amd64_10.6.0.zip; cp ./azcopy_darwin_amd64_10.6.0/azcopy /usr/local/bin
# Windows
# 1. Unzip ./azcopy_windows_amd64_10.6.0.zip
# 2. Add the path of ./azcopy_windows_amd64_10.6.0 folder to your Environment Variables
# Ref: https://superuser.com/questions/949560/how-do-i-set-system-environment-variables-in-windows-10
Cluster Management¶
Create a cluster with a deployment
# Create a k8s cluster maro k8s create ./k8s-azure-create.ymlScale the cluster
Check `VM Size <https://docs.microsoft.com/en-us/azure/virtual-machines/sizes>`_ to see more node specifications. # Scale nodes with 'Standard_D4s_v3' specification to 2 maro k8s node scale myK8sCluster Standard_D4s_v3 2 # Scale nodes with 'Standard_D2s_v3' specification to 0 maro k8s node scale myK8sCluster Standard_D2s_v3 0
Delete the cluster
# Delete a k8s cluster maro k8s delete myK8sCluster
Run Job¶
Push your training image
# Push image 'myImage' to the cluster maro k8s image push myK8sCluster --image-name myImagePush your training data
# Push dqn folder under './myTrainingData/' to a relative path '/myTrainingData' in the cluster # You can then assign your mapping location in the start-job-deployment maro k8s data push myGrassCluster ./myTrainingData/dqn /myTrainingData
Start a training job with a deployment
# Start a training job with a start-job-deployment maro k8s job start myK8sCluster ./k8s-start-job.ymlOr, schedule batch jobs with a deployment
# Start a training schedule with a start-schedule-deployment maro k8s schedule start myK8sCluster ./k8s-start-schedule.ymlGet the logs of the job
# Logs will be exported to current directory maro k8s job logs myK8sCluster myJob1List the current status of the job
# List current status of jobs maro k8s job list myK8sCluster myJob1Stop a training job
# Stop a training job maro k8s job stop myK8sCluster myJob1
Sample Deployments¶
k8s-aks-create¶
mode: k8s/aks
name: myK8sCluster
cloud:
subscription: mySubscription
resource_group: myResourceGroup
location: eastus
default_public_key: "{ssh public key}"
default_username: admin
master:
node_size: Standard_D2s_v3
k8s-start-job¶
mode: k8s/aks
name: myJob1
components:
actor:
command: ["python", "{project root}/myTrainingData/dqn/start_actor.py"]
image: myImage
mount:
target: "{project root}"
num: 5
resources:
cpu: 2
gpu: 0
memory: 2048M
learner:
command: ["python", "{project root}/myTrainingData/dqn/start_learner.py"]
image: myImage
mount:
target: "{project root}"
num: 1
resources:
cpu: 2
gpu: 0
memory: 2048M
k8s-start-schedule¶
mode: k8s/aks
name: mySchedule1
job_names:
- myJob2
- myJob3
- myJob4
- myJob5
components:
actor:
command: ["python", "{project root}/myTrainingData/dqn/start_actor.py"]
image: myImage
mount:
target: "{project root}"
num: 5
resources:
cpu: 2
gpu: 0
memory: 2048M
learner:
command: ["python", "{project root}/myTrainingData/dqn/start_learner.py"]
image: myImage
mount:
target: "{project root}"
num: 1
resources:
cpu: 2
gpu: 0
memory: 2048M