Run a MXJob
This page shows how to leverage Kueue’s scheduling and resource management capabilities when running Training Operator MXJobs.
This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.
Before you begin
Check administer cluster quotas for details on the initial cluster setup.
Check the Training Operator installation guide.
Note that the minimum requirement training-operator version is v1.7.0.
You can modify kueue configurations from installed releases to include MXJobs as an allowed workload.
Note
In order to use Training Operator, prior to v0.8.1, you need to restart Kueue after the installation. You can do it by running:kubectl delete pods -lcontrol-plane=controller-manager -nkueue-system
.
MXJob definition
a. Queue selection
The target local queue should be specified in the metadata.labels
section of the MXJob configuration.
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. Optionally set Suspend field in MXJobs
spec:
runPolicy:
suspend: true
By default, Kueue will set suspend
to true via webhook and unsuspend it when the MXJob is admitted.
Sample MXJob
This example is based on https://github.com/kubeflow/training-operator/blob/a4c0cec561a4bfe478720f1a102f305ed656071b/examples/mxnet/mxjob_dist_v1.yaml.
apiVersion: kubeflow.org/v1
kind: MXJob
metadata:
name: mxnet-job
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
jobMode: MXTrain
mxReplicaSpecs:
Scheduler:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: mxnet
image: kubeflow/mxnet-gpu:latest
resources:
limits:
cpu: 100m
memory: 0.2Gi
ports:
- containerPort: 9991
name: mxjob-port
Server:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: mxnet
image: kubeflow/mxnet-gpu:latest
resources:
limits:
cpu: 100m
memory: 0.2Gi
ports:
- containerPort: 9991
name: mxjob-port
Worker:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: mxnet
image: kubeflow/mxnet-gpu:latest
command:
- python3
args:
- /mxnet/mxnet/example/image-classification/train_mnist.py
- --num-epochs=1
- --num-layers=2
- --kv-store=dist_device_sync
resources:
limits:
cpu: 2
memory: 1Gi
ports:
- containerPort: 9991
name: mxjob-port
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.