Get started with the ML Diagnostics CLI

Use the ML Diagnostics Google Cloud CLI to create a machine learning run, deploy XProf as a managed instance with a scalable backend, and provide a managed profiling experience on Google Cloud.

There are two categories of ML Diagnostics gcloud CLI commands: machine-learning-run commands and profiler commands. Use the machine-learning-run commands to create, delete, describe, list, and update machine learning runs. Use the profiler commands to list nodes and capture on-demand profiles from the CLI.

  • Machine-learning-run commands: Create, Delete, Describe, List, Update.
  • Profiler commands:
    • profiler-target: List
    • profiler-session: Capture, List

All gcloud CLI commands require a project defined in the environment. To set the project:

gcloud config set project PROJECT_ID

For more information on the ML Diagnostics gcloud CLI commands, see the API reference.

Capture profiles

You can capture XProf profiles of your ML workload with programmatic capture or on-demand capture (manual capture). Programmatic capture involves embedding profiling commands directly into your machine learning code, and explicitly stating when to start and stop recording data. On-demand capture occurs in real-time, where you trigger the profiler while the workload is already actively running.

To enable on-demand profile capture, you need to start the XProf server within your code and call the profiler.start_server method. This starts an XProf server on your ML workload that listens for the on-demand capture trigger to start capturing profiles. Use port 9999 for this command: profiler.start_server(port=9999)

For both programmatic and on-demand profile capture, specify the location to store the captured profiles. For example: gs://my-bucket/my-run. Profiles are stored in directories nested within the location:gs://my-bucket/my-run/plugins/profile/session1/. Programmatic profile capture and on-demand capture must not occur during the same time period.

For on-demand profile capture, set up a GKE cluster and deploy workload with the label: managed-mldiagnostics-gke=true.

For more information about profiling with JAX, see Profiling computation.

Create machine learning run

Create a machine learning run resource in a specified project and location. The machine-learning-run create command deploys XProf as a managed instance in your project. The managed XProf instance is used for viewing all profiles in the project, and is created when the first machine learning run is created in the project.

Use the machine-learning-run create command:

gcloud alpha mldiagnostics machine-learning-run create

There are two ways to create a machine learning run:

  • Register existing captured profiles to the ML Diagnostics platform.
  • Use ML Diagnostics to perform on-demand profile capture by registering an active run. This requires a GKE cluster setup and a deployed workload on GKE with the label managed-mldiagnostics-gke=true.

Create ML Run and register existing captured profiles

The following code creates a run and registers existing captured profiles to ML Diagnostics:

gcloud alpha mldiagnostics machine-learning-run create RUN_NAME \
  --location LOCATION \
  --run-group GROUP_NAME \
  --gcs-path gs://BUCKET_NAME \
  --display-name DISPLAY_NAME \
  --labels "list_existing_sessions_only"="true"

The code example uses the following flags:

Flag Requirement Description
machine-learning-run Required A unique identifier for this specific run. If the name is not unique, the run creation fails with the message: "ML Run already exists".
location Required All Cluster Director locations are supported except us-east5. This flag can be set by an argument for each command, or with the command: gcloud config set compute/region.
gcs-path Required The Google Cloud Storage location where all profiles are saved. For example: gs://my-bucket or gs://my-bucket/folder1. Required only if the SDK is used for profile capture.
run-group Optional An identifier that can help group multiple runs belonging to the same experiment. For example, all runs associated with a TPU slice size sweep could belong to the same group.
display-name Optional Display name for the machine learning run. If not provided, it is set to machine learning run ID.

The --labels list_existing_sessions_only=true flag is required if you want to view and manage existing collected profiles in ML Diagnostics. The flag does the following:

  1. Creates a machine learning run with state "Completed".
  2. Recursively searches for xplane.pb files within the Cloud Storage directory path.
  3. Loads all located profile sessions into the ML Diagnostics database to view in Google Cloud, creates shareable links for the profile sessions, and allows users to manage these profiles with ML Diagnostics platform.

If the --labels list_existing_sessions_only flag is set to true for a run, you cannot perform on-demand profiling or update the run. You can only view and manage existing profiles.

Create ML Run to perform on-demand profile capture

The following code creates an mlrun in order perform on-demand profile capture:

gcloud alpha mldiagnostics machine-learning-run create RUN_NAME \
  --location LOCATION \
  --orchestrator gke \
  --run-group RUN_GROUP \
  --gcs-path gs://BUCKET_NAME \
  --display-name DISPLAY_NAME \
  --gke-cluster-name projects/user/locations/LOCATION/clusters/CLUSTER_NAME \
  --gke-namespace NAMESPACE \
  --gke-workload-name WORKLOAD_NAME \
  --gke-kind GKE_KIND \
  --gke-workload-create-time CREATE_TIME \
  --run-phase RUN_PHASE

Along with the flags from the previous example, the code example uses the following additional flags:

Flag Requirement Description
orchestrator Optional The orchestrator used for the run. If not specified, gke is used by default. Valid values: gce, gke, slurm.
gke-cluster-name Required for GKE The cluster of the workload. For example: /projects/<project_id>/locations/<location>/clusters/<cluster_name>.
gke-kind Required for GKE The kind of the workload. For example: JobSet.
gke-namespace Required for GKE The namespace of the workload. For example: default.
gke-workload-name Required for GKE The identifier of the workload. For example: jobset-abcd.
gke-workload-create-time Required for GKE The creation timestamp for a JobSet in ISO timestamp format. For example: 2026-02-20T06:00:00Z.
run-phase Optional Phase and state of a run. If not provided, it is ACTIVE by default.

Describe machine learning run

View the details of a machine learning run with the machine-learning-run describe command:

gcloud alpha mldiagnostics machine-learning-run describe RUN_NAME --FORMAT=FORMAT

The following example is a request for run details in JSON:

gcloud alpha mldiagnostics machine-learning-run describe my-run-on-demand \
  --format json

The output is similar to the following:

{
  "artifacts": {
    "gcsPath": "gs://my-bucket"
  },
  "createTime": "2026-02-05T16:25:28.367865234Z",
  "displayName": "mldiagnostics-my-run-on-demand",
  "endTime": "0001-01-01T00:00:00Z",
  "etag": "1f54a7f4-bd25-4f98-a91c-97bfa1c5b7a6",
   "name": "projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand",
  "orchestrator": "GKE",
  "runPhase": "ACTIVE",
  "runSet": "my-run-on-demand-group",
  "tools": [
    {
      "XProf": {}
    }
  ],
  "updateTime": "2026-02-05T16:25:28.367865344Z",
  "workloadDetails": {
    "gke": {
      "cluster": "projects/163028815180/locations/us-central1/clusters/my-cluster",
      "id": "jobset-abcd",
      "kind": "JobSet",
      "namespace": "default"
    }
  }
}

List machine learning runs

Get a list of machine learning runs within a specified project and location with the machine-learning-run list command:

gcloud alpha mldiagnostics machine-learning-run list

The following example is a request for a list of up to two runs, with outputs of their URI paths:

gcloud alpha mldiagnostics machine-learning-run list --limit 2 --uri
https://hypercomputecluster.googleapis.com/v1alpha/projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand
https://hypercomputecluster.googleapis.com/v1alpha/projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand-2

Update machine learning runs

Update a machine learning run in a specified project and location. You can update the display name, run phase, orchestrator, and GKE workload details. You cannot change the run ID and location. Update a run with the machine-learning-run update command:

gcloud alpha mldiagnostics machine-learning-run update

Provide all fields that were included in the create request. If mandatory fields are not provided during update request, they are overridden by the default values.

The etag flag is a mandatory field, and should be the latest ETag (entity tag) value for an ML Run resource. For more information, see Use entity tags for optimistic concurrency control. Use the following to find the correct ETAG value:

gcloud alpha mldiagnostics machine-learning-run describe RUN_NAME

The following is an example of a complete update request:

gcloud alpha mldiagnostics machine-learning-run update my-run-on-demand \
  --orchestrator gke \
  --run-group my-run-on-demand-group \
  --gcs-path gs://my-bucket \
  --display-name mldiagnostics-my-run-on-demand-completed \
  --gke-cluster-name projects/user/locations/us-central1/clusters/my-cluster \
  --gke-namespace default \
  --gke-workload-name jobset-abcd \
  --gke-kind JobSet \
  --gke-workload-create-time 2026-02-20T06:06:06Z \
  --run-phase COMPLETED \
  --etag 1f54a7f4-bd25-4f98-a91c-97bfa1c5b7a6

Delete machine learning runs

Delete a machine learning run in a specified project and location with the machine-learning-run delete command:

gcloud alpha mldiagnostics machine-learning-run delete RUN_NAME

Deleting an ML run does not delete any data in Cloud Storage, Cloud Logging, or the GKE workload. Deleting the mlrun only deletes metadata related to the run within the ML Diagnostics system.

Profiler commands

You can use the profiler command group to list all profiles, find GKE nodes of workload where the XProf server is running, and capture on-demand profiles from the CLI.

List profiler targets

List all profiler targets associated with a machine learning run in a specified project and location:

gcloud alpha mldiagnostics profiler-target list --machine-learning-run RUN_NAME

This command requires the following:

  • On-demand Xprof is enabled in the workload, which deploys XProf server into all nodes of the workload.
  • GKE cluster is set up for ML Diagnostics, with deployed webhook and operator.
  • Deployed workload on GKE with the label: managed-mldiagnostics-gke=true.

The following is an example of a request:

gcloud alpha mldiagnostics profiler-target list \
  --machine-learning-run my-run-on-demand

The following is an example of the output:

---
hostname: gke-tpu-1f0789b5-jqx9
name: projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand/profilerTargets/jobset-abcd-tpu-slice-0-0-tcw2k
---
hostname: gke-tpu-1f0789b5-rxvf
name: projects/163028815180/locations/us-central1/machineLearningRuns/my-run-on-demand/profilerTargets/jobset-abcd-tpu-slice-0-1-dct59

List profiler sessions

List all profiler sessions associated with a machine learning run in a specified project and location with the following command:

gcloud alpha mldiagnostics profiler-session list --machine-learning-run RUN_NAME

This profiler command does not require GKE or workload setup. It will list all profile sessions, both programmatic and on-demand. If you only have programmatic profile captures, use this command to list all profile sessions. There is no required GKE setup, GKE workload labeling, or on-demand XProf enablement.

The following is an example of a request:

gcloud alpha mldiagnostics profiler-session list \
  --machine-learning-run my-run-on-demand

Capture on-demand profiler sessions

You can capture an on-demand profiler session for a machine learning run on a specified set of nodes that the workload is running on (profiler targets).

This command requires the following:

  • On-demand XProf is enabled in the workload, which deploys XProf server into all nodes of the workload
  • GKE cluster is set up for ML Diagnostics, with deployed webhook and operator
  • Deployed workload on GKE with the label: managed-mldiagnostics-gke=true.

The following is an example of a request:

gcloud alpha mldiagnostics profiler-session capture \
  profiler-session-on-demand \
  --machine-learning-run RUN_NAME \
  --targets TARGET \
  --duration DURATION

The example uses the following flags:

Flag Requirement Description
profiler-session-name Required Name of profiler session to be captured.
duration Required Duration for the profiler session capture. It is of Duration type. For example, specify a duration of 1s for 1 second, 400ms for 400 milliseconds, and 5m for 5 minutes.
targets Required IDs of the profiler targets or fully qualified identifiers for the profiler-targets. Must match with a list of targets associated with the run.
device-tracer-level Optional Device tracer level for the session. Accepted values: device-tracer-level-enabled, device-tracer-level-disabled (default).
host-tracer-level Optional Host tracer level for the session. Accepted values: host-tracer-level-info (default), host-tracer-level-critical, host-tracer-level-disabled, host-tracer-level-verbose.
python-tracer-level Optional Python tracer level for the session. Accepted values: python-tracer-level-disabled (default), python-tracer-level-enabled.