Get started with the ML Diagnostics CLI
Use the ML Diagnostics Google Cloud CLI to create a machine learning run, deploy XProf as a managed instance with a scalable backend, and provide a managed profiling experience on Google Cloud.
There are two categories of ML Diagnostics gcloud CLI commands:
machine-learning-run commands and profiler commands. Use the
machine-learning-run commands to create, delete, describe, list, and update
machine learning runs. Use the profiler commands to list nodes and capture
on-demand profiles from the CLI.
Machine-learning-runcommands:Create,Delete,Describe,List,Update.- Profiler commands:
profiler-target:Listprofiler-session:Capture,List
All gcloud CLI commands require a project defined in the environment. To set the project:
gcloud config set project PROJECT_ID
For more information on the ML Diagnostics gcloud CLI commands, see the API reference.
Capture profiles
You can capture XProf profiles of your ML workload with programmatic capture or on-demand capture (manual capture). Programmatic capture involves embedding profiling commands directly into your machine learning code, and explicitly stating when to start and stop recording data. On-demand capture occurs in real-time, where you trigger the profiler while the workload is already actively running.
To enable on-demand profile capture, you need to start the XProf server within
your code and call the profiler.start_server method. This starts an XProf
server on your ML workload that listens for the on-demand capture trigger to
start capturing profiles. Use port 9999 for this command:
profiler.start_server(port=9999)
For both programmatic and on-demand profile capture, specify the location to
store the captured profiles. For example: gs://my-bucket/my-run. Profiles are
stored in directories nested within the location:gs://my-bucket/my-run/plugins/profile/session1/. Programmatic profile capture
and on-demand capture must not occur during the same time period.
For on-demand profile capture, set up a GKE cluster and deploy
workload with the label: managed-mldiagnostics-gke=true.
For more information about profiling with JAX, see Profiling computation.
Create machine learning run
Create a machine learning run resource in a specified project and location. The
machine-learning-run create command deploys XProf as a managed instance in
your project. The managed XProf instance is used for viewing all profiles in the
project, and is created when the first machine learning run is created in the
project.
Use the machine-learning-run create command:
gcloud alpha mldiagnostics machine-learning-run create
There are two ways to create a machine learning run:
- Register existing captured profiles to the ML Diagnostics platform.
- Use ML Diagnostics to perform on-demand profile capture by registering an
active run. This requires a GKE cluster setup and a deployed workload on
GKE with the label
managed-mldiagnostics-gke=true.
Create ML Run and register existing captured profiles
The following code creates a run and registers existing captured profiles to ML Diagnostics:
gcloud alpha mldiagnostics machine-learning-run create RUN_NAME \
--location LOCATION \
--run-group GROUP_NAME \
--gcs-path gs://BUCKET_NAME \
--display-name DISPLAY_NAME \
--labels "list_existing_sessions_only"="true"
The code example uses the following flags:
| Flag | Requirement | Description |
|---|---|---|
machine-learning-run |
Required | A unique identifier for this specific run. If the name is not unique, the run creation fails with the message: "ML Run already exists". |
location |
Required | All Cluster Director locations
are supported except us-east5. This flag can be set by an
argument for each command, or with the command:
gcloud config set compute/region. |
gcs-path |
Required | The Google Cloud Storage location where all profiles are saved.
For example: gs://my-bucket or gs://my-bucket/folder1.
Required only if the SDK is used for profile capture. |
run-group |
Optional | An identifier that can help group multiple runs belonging to the same experiment. For example, all runs associated with a TPU slice size sweep could belong to the same group. |
display-name |
Optional | Display name for the machine learning run. If not provided, it is set to machine learning run ID. |
The --labels list_existing_sessions_only=true flag is required if you want to
view and manage existing collected profiles in ML Diagnostics. The flag does the
following:
- Creates a machine learning run with state "Completed".
- Recursively searches for xplane.pb files within the Cloud Storage directory path.
- Loads all located profile sessions into the ML Diagnostics database to view in Google Cloud, creates shareable links for the profile sessions, and allows users to manage these profiles with ML Diagnostics platform.
If the --labels list_existing_sessions_only flag is set to true for a run,
you cannot perform on-demand profiling or update the run. You can only view and
manage existing profiles.
Create ML Run to perform on-demand profile capture
The following code creates an mlrun in order perform on-demand profile
capture:
gcloud alpha mldiagnostics machine-learning-run create RUN_NAME \
--location LOCATION \
--orchestrator gke \
--run-group RUN_GROUP \
--gcs-path gs://BUCKET_NAME \
--display-name DISPLAY_NAME \
--gke-cluster-name projects/user/locations/LOCATION/clusters/CLUSTER_NAME \
--gke-namespace NAMESPACE \
--gke-workload-name WORKLOAD_NAME \
--gke-kind GKE_KIND \
--gke-workload-create-time CREATE_TIME \
--run-phase RUN_PHASE
Along with the flags from the previous example, the code example uses the following additional flags:
| Flag | Requirement | Description |
|---|---|---|
orchestrator |
Optional |
The orchestrator used for the run. If not specified, gke is
used by default. Valid values: gce, gke, slurm.
|
gke-cluster-name |
Required for GKE |
The cluster of the workload. For example:
/projects/<project_id>/locations/<location>/clusters/<cluster_name>.
|
gke-kind |
Required for GKE |
The kind of the workload. For example: JobSet.
|
gke-namespace |
Required for GKE |
The namespace of the workload. For example: default.
|
gke-workload-name |
Required for GKE |
The identifier of the workload. For example: jobset-abcd.
|
gke-workload-create-time |
Required for GKE |
The creation timestamp for a JobSet in ISO timestamp format. For example: 2026-02-20T06:00:00Z.
|
run-phase |
Optional |
Phase and state of a run. If not provided, it is ACTIVE by default.
|
Describe machine learning run
View the details of a machine learning run with the machine-learning-run
describe command:
gcloud alpha mldiagnostics machine-learning-run describe RUN_NAME --FORMAT=FORMAT
The following example is a request for run details in JSON:
gcloud alpha mldiagnostics machine-learning-run describe my-run-on-demand \
--format json
The output is similar to the following:
{
"artifacts": {
"gcsPath": "gs://my-bucket"
},
"createTime": "2026-02-05T16:25:28.367865234Z",
"displayName": "mldiagnostics-my-run-on-demand",
"endTime": "0001-01-01T00:00:00Z",
"etag": "1f54a7f4-bd25-4f98-a91c-97bfa1c5b7a6",
"name": "projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand",
"orchestrator": "GKE",
"runPhase": "ACTIVE",
"runSet": "my-run-on-demand-group",
"tools": [
{
"XProf": {}
}
],
"updateTime": "2026-02-05T16:25:28.367865344Z",
"workloadDetails": {
"gke": {
"cluster": "projects/163028815180/locations/us-central1/clusters/my-cluster",
"id": "jobset-abcd",
"kind": "JobSet",
"namespace": "default"
}
}
}
List machine learning runs
Get a list of machine learning runs within a specified project and location with
the machine-learning-run list command:
gcloud alpha mldiagnostics machine-learning-run list
The following example is a request for a list of up to two runs, with outputs of their URI paths:
gcloud alpha mldiagnostics machine-learning-run list --limit 2 --uri
https://hypercomputecluster.googleapis.com/v1alpha/projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand
https://hypercomputecluster.googleapis.com/v1alpha/projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand-2
Update machine learning runs
Update a machine learning run in a specified project and location. You can
update the display name, run phase, orchestrator, and GKE
workload details. You cannot change the run ID and location. Update a run with
the machine-learning-run update command:
gcloud alpha mldiagnostics machine-learning-run update
Provide all fields that were included in the create request. If
mandatory fields are not provided during update request, they are overridden by
the default values.
The etag flag is a mandatory field, and should be the latest ETag (entity tag)
value for an ML Run resource. For more information, see Use entity tags for
optimistic concurrency control. Use the following
to find the correct ETAG value:
gcloud alpha mldiagnostics machine-learning-run describe RUN_NAME
The following is an example of a complete update request:
gcloud alpha mldiagnostics machine-learning-run update my-run-on-demand \
--orchestrator gke \
--run-group my-run-on-demand-group \
--gcs-path gs://my-bucket \
--display-name mldiagnostics-my-run-on-demand-completed \
--gke-cluster-name projects/user/locations/us-central1/clusters/my-cluster \
--gke-namespace default \
--gke-workload-name jobset-abcd \
--gke-kind JobSet \
--gke-workload-create-time 2026-02-20T06:06:06Z \
--run-phase COMPLETED \
--etag 1f54a7f4-bd25-4f98-a91c-97bfa1c5b7a6
Delete machine learning runs
Delete a machine learning run in a specified project and location with the
machine-learning-run delete command:
gcloud alpha mldiagnostics machine-learning-run delete RUN_NAME
Deleting an ML run does not delete any data in Cloud Storage, Cloud Logging,
or the GKE
workload. Deleting the mlrun only deletes metadata related to the run within
the ML Diagnostics system.
Profiler commands
You can use the profiler command group to list all profiles, find GKE nodes of workload where the XProf server is running, and capture on-demand profiles from the CLI.
List profiler targets
List all profiler targets associated with a machine learning run in a specified project and location:
gcloud alpha mldiagnostics profiler-target list --machine-learning-run RUN_NAME
This command requires the following:
- On-demand Xprof is enabled in the workload, which deploys XProf server into all nodes of the workload.
- GKE cluster is set up for ML Diagnostics, with deployed webhook and operator.
- Deployed workload on GKE with the label:
managed-mldiagnostics-gke=true.
The following is an example of a request:
gcloud alpha mldiagnostics profiler-target list \
--machine-learning-run my-run-on-demand
The following is an example of the output:
---
hostname: gke-tpu-1f0789b5-jqx9
name: projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand/profilerTargets/jobset-abcd-tpu-slice-0-0-tcw2k
---
hostname: gke-tpu-1f0789b5-rxvf
name: projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand/profilerTargets/jobset-abcd-tpu-slice-0-1-dct59
List profiler sessions
List all profiler sessions associated with a machine learning run in a specified project and location with the following command:
gcloud alpha mldiagnostics profiler-session list --machine-learning-run RUN_NAME
This profiler command does not require GKE or workload setup. It will list all profile sessions, both programmatic and on-demand. If you only have programmatic profile captures, use this command to list all profile sessions. There is no required GKE setup, GKE workload labeling, or on-demand XProf enablement.
The following is an example of a request:
gcloud alpha mldiagnostics profiler-session list \
--machine-learning-run my-run-on-demand
Capture on-demand profiler sessions
You can capture an on-demand profiler session for a machine learning run on a specified set of nodes that the workload is running on (profiler targets).
This command requires the following:
- On-demand XProf is enabled in the workload, which deploys XProf server into all nodes of the workload
- GKE cluster is set up for ML Diagnostics, with deployed webhook and operator
- Deployed workload on GKE with the label:
managed-mldiagnostics-gke=true.
The following is an example of a request:
gcloud alpha mldiagnostics profiler-session capture \
profiler-session-on-demand \
--machine-learning-run RUN_NAME \
--targets TARGET \
--duration DURATION
The example uses the following flags:
| Flag | Requirement | Description |
|---|---|---|
profiler-session-name |
Required | Name of profiler session to be captured. |
duration |
Required |
Duration for the profiler session capture. It is of Duration type.
For example, specify a duration of 1s for 1 second,
400ms for 400 milliseconds, and 5m for 5 minutes.
|
targets |
Required | IDs of the profiler targets or fully qualified identifiers for the profiler-targets. Must match with a list of targets associated with the run. |
device-tracer-level |
Optional |
Device tracer level for the session. Accepted values: device-tracer-level-enabled, device-tracer-level-disabled (default).
|
host-tracer-level |
Optional |
Host tracer level for the session. Accepted values: host-tracer-level-info (default), host-tracer-level-critical, host-tracer-level-disabled, host-tracer-level-verbose.
|
python-tracer-level |
Optional |
Python tracer level for the session. Accepted values: python-tracer-level-disabled (default), python-tracer-level-enabled.
|