Random Search

Random search is a black box algorithm for searching for an optimal hyperparameter vector. It assumes nothing about the model and trials can be run in parallel.

Random search selects points at random from the entire search space.

Random

Random search provides a good coverage for multiple hyperparameters in the search space. If you want a generic baseline, it is always a good idea to start with a Random search.

Now let us create a random search experiment using Katib.

Experiment

Let us start by creating an experiment.

Random search experiment
apiVersion: "kubeflow.org/v1alpha3" kind: Experiment metadata: namespace: kubeflow name: tfjob-random spec: parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 objective: type: maximize goal: 0.99 objectiveMetricName: accuracy_1 algorithm: algorithmName: random metricsCollectorSpec: source: fileSystemPath: path: /train kind: Directory collector: kind: TensorFlowEvent parameters: - name: --learning_rate parameterType: double feasibleSpace: min: "0.01" max: "0.05" - name: --batch_size parameterType: int feasibleSpace: min: "100" max: "200" trialTemplate: goTemplate: rawTemplate: |- apiVersion: "kubeflow.org/v1" kind: TFJob metadata: name: {{.Trial}} namespace: {{.NameSpace}} spec: tfReplicaSpecs: Worker: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 imagePullPolicy: IfNotPresent command: - "python" - "/var/tf_mnist/mnist_with_summaries.py" - "--log_dir=/train/metrics" {{- with .HyperParameters}} {{- range .}} - "{{.Name}}={{.Value}}" {{- end}} {{- end}}

Experiment

cd $HOME/tutorial/examples/v1alpha3 kubectl apply -f tfjob-random.yaml
Sample Output experiment.kubeflow.org/tfjob-random created

Check that the Experiment tfjob-random has started.

kubectl -n kubeflow get experiment
Sample Output NAME STATUS AGE tfjob-random Running 98s

Check the details of the Experiment tfjob-random

kubectl -n kubeflow get experiment tfjob-random -o json
Sample Output
{ "apiVersion": "kubeflow.org/v1alpha3", "kind": "Experiment", "metadata": { "annotations": { "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"kubeflow.org/v1alpha3\",\"kind\":\"Experiment\",\"metadata\":{\"annotations\":{},\"name\":\"tfjob-random\",\"namespace\":\"kubeflow\"},\"spec\":{\"algorithm\":{\"algorithmName\":\"random\"},\"maxFailedTrialCount\":3,\"maxTrialCount\":12,\"metricsCollectorSpec\":{\"collector\":{\"kind\":\"TensorFlowEvent\"},\"source\":{\"fileSystemPath\":{\"kind\":\"Directory\",\"path\":\"/train\"}}},\"objective\":{\"goal\":0.99,\"objectiveMetricName\":\"accuracy_1\",\"type\":\"maximize\"},\"parallelTrialCount\":3,\"parameters\":[{\"feasibleSpace\":{\"max\":\"0.05\",\"min\":\"0.01\"},\"name\":\"--learning_rate\",\"parameterType\":\"double\"},{\"feasibleSpace\":{\"max\":\"200\",\"min\":\"100\"},\"name\":\"--batch_size\",\"parameterType\":\"int\"}],\"trialTemplate\":{\"goTemplate\":{\"rawTemplate\":\"apiVersion: \\\"kubeflow.org/v1\\\"\\nkind: TFJob\\nmetadata:\\n name: {{.Trial}}\\n namespace: {{.NameSpace}}\\nspec:\\n tfReplicaSpecs:\\n Worker:\\n replicas: 1 \\n restartPolicy: OnFailure\\n template:\\n spec:\\n containers:\\n - name: tensorflow \\n image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0\\n imagePullPolicy: IfNotPresent\\n command:\\n - \\\"python\\\"\\n - \\\"/var/tf_mnist/mnist_with_summaries.py\\\"\\n - \\\"--log_dir=/train/metrics\\\"\\n {{- with .HyperParameters}}\\n {{- range .}}\\n - \\\"{{.Name}}={{.Value}}\\\"\\n {{- end}}\\n {{- end}}\"}}}}\n" }, "creationTimestamp": "2019-10-27T02:46:02Z", "finalizers": [ "update-prometheus-metrics" ], "generation": 2, "name": "tfjob-random", "namespace": "kubeflow", "resourceVersion": "21979", "selfLink": "/apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/tfjob-random", "uid": "e9f888cb-f863-11e9-88ef-080027c5bc64" }, "spec": { "algorithm": { "algorithmName": "random", "algorithmSettings": null }, "maxFailedTrialCount": 3, "maxTrialCount": 12, "metricsCollectorSpec": { "collector": { "kind": "TensorFlowEvent" }, "source": { "fileSystemPath": { "kind": "Directory", "path": "/train" } } }, "objective": { "goal": 0.99, "objectiveMetricName": "accuracy_1", "type": "maximize" }, "parallelTrialCount": 3, "parameters": [ { "feasibleSpace": { "max": "0.05", "min": "0.01" }, "name": "--learning_rate", "parameterType": "double" }, { "feasibleSpace": { "max": "200", "min": "100" }, "name": "--batch_size", "parameterType": "int" } ], "trialTemplate": { "goTemplate": { "rawTemplate": "apiVersion: \"kubeflow.org/v1\"\nkind: TFJob\nmetadata:\n name: {{.Trial}}\n namespace: {{.NameSpace}}\nspec:\n tfReplicaSpecs:\n Worker:\n replicas: 1 \n restartPolicy: OnFailure\n template:\n spec:\n containers:\n - name: tensorflow \n image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0\n imagePullPolicy: IfNotPresent\n command:\n - \"python\"\n - \"/var/tf_mnist/mnist_with_summaries.py\"\n - \"--log_dir=/train/metrics\"\n {{- with .HyperParameters}}\n {{- range .}}\n - \"{{.Name}}={{.Value}}\"\n {{- end}}\n {{- end}}" } } }, "status": { "completionTime": null, "conditions": [ { "lastTransitionTime": "2019-10-27T02:46:02Z", "lastUpdateTime": "2019-10-27T02:46:02Z", "message": "Experiment is created", "reason": "ExperimentCreated", "status": "True", "type": "Created" } ], "currentOptimalTrial": { "observation": { "metrics": null }, "parameterAssignments": null }, "startTime": "2019-10-27T02:46:02Z" } }

Under the hood, Katib controller is looping in a reconcile loop to satisfy this Experiment request.

Trials

Suggestions

You can see Katib creating Suggestions using the random algorithm.

kubectl -n kubeflow get suggestions tfjob-random -o yaml
Sample Output - before suggestions are ready
apiVersion: kubeflow.org/v1alpha3 kind: Suggestion metadata: creationTimestamp: "2019-10-27T02:57:58Z" generation: 1 name: tfjob-random namespace: kubeflow ownerReferences: - apiVersion: kubeflow.org/v1alpha3 blockOwnerDeletion: true controller: true kind: Experiment name: tfjob-random uid: 94e07a51-f865-11e9-88ef-080027c5bc64 resourceVersion: "24296" selfLink: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/suggestions/tfjob-random uid: 94e5930d-f865-11e9-88ef-080027c5bc64 spec: algorithmName: random requests: 3 status: conditions: - lastTransitionTime: "2019-10-27T02:57:58Z" lastUpdateTime: "2019-10-27T02:57:58Z" message: Suggestion is created reason: SuggestionCreated status: "True" type: Created - lastTransitionTime: "2019-10-27T02:57:58Z" lastUpdateTime: "2019-10-27T02:57:58Z" message: Deployment is not ready reason: DeploymentNotReady status: "False" type: DeploymentReady startTime: "2019-10-27T02:57:58Z"

We now have a Suggestion resource created. The Katib Suggestion Service takes control and generates a deployment to run the specified Suggestion.

Suggestion The suggestion service provides suggestions based on the current state of the system. On each new suggestion request, it reevaluates and provides the next best set of suggestions.

Sample Output - after suggestions are ready
apiVersion: v1 items: - apiVersion: kubeflow.org/v1alpha3 kind: Suggestion metadata: creationTimestamp: "2019-10-27T02:57:58Z" generation: 10 name: tfjob-random namespace: kubeflow ownerReferences: - apiVersion: kubeflow.org/v1alpha3 blockOwnerDeletion: true controller: true kind: Experiment name: tfjob-random uid: 94e07a51-f865-11e9-88ef-080027c5bc64 resourceVersion: "25675" selfLink: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/suggestions/tfjob-random uid: 94e5930d-f865-11e9-88ef-080027c5bc64 spec: algorithmName: random requests: 12 status: conditions: - lastTransitionTime: "2019-10-27T02:57:58Z" lastUpdateTime: "2019-10-27T02:57:58Z" message: Suggestion is created reason: SuggestionCreated status: "True" type: Created - lastTransitionTime: "2019-10-27T02:58:16Z" lastUpdateTime: "2019-10-27T02:58:16Z" message: Deployment is ready reason: DeploymentReady status: "True" type: DeploymentReady - lastTransitionTime: "2019-10-27T02:59:16Z" lastUpdateTime: "2019-10-27T02:59:16Z" message: Suggestion is running reason: SuggestionRunning status: "True" type: Running startTime: "2019-10-27T02:57:58Z" suggestionCount: 12 suggestions: - name: tfjob-random-npjpbgmd parameterAssignments: - name: --learning_rate value: "0.03684477847537918" - name: --batch_size value: "112" - name: tfjob-random-mmc8dqvq parameterAssignments: - name: --learning_rate value: "0.010960280128777096" - name: --batch_size value: "126" - name: tfjob-random-6h7229dt parameterAssignments: - name: --learning_rate value: "0.011672960430260329" - name: --batch_size value: "181" - name: tfjob-random-hfzrfh8j parameterAssignments: - name: --learning_rate value: "0.03510831325099869" - name: --batch_size value: "156" - name: tfjob-random-7kg9zhrt parameterAssignments: - name: --learning_rate value: "0.02709470325001432" - name: --batch_size value: "157" - name: tfjob-random-gng5qx9x parameterAssignments: - name: --learning_rate value: "0.021854230935173045" - name: --batch_size value: "148" - name: tfjob-random-5sfxkhmc parameterAssignments: - name: --learning_rate value: "0.011053371330636894" - name: --batch_size value: "131" - name: tfjob-random-7bzhkvvd parameterAssignments: - name: --learning_rate value: "0.039025808494984444" - name: --batch_size value: "139" - name: tfjob-random-xjm458qc parameterAssignments: - name: --learning_rate value: "0.023093126743054533" - name: --batch_size value: "105" - name: tfjob-random-zb89h929 parameterAssignments: - name: --learning_rate value: "0.017877859019641958" - name: --batch_size value: "192" - name: tfjob-random-wqglhpqj parameterAssignments: - name: --learning_rate value: "0.018670804338535255" - name: --batch_size value: "191" - name: tfjob-random-484zhpzq parameterAssignments: - name: --learning_rate value: "0.029127223437729596" - name: --batch_size value: "133"

Trials

Once the suggestions are ready, Katib Trail controller is ready to run the trials. Each Trial evaluates the performance for the suggested hyperparameter vector and records the performance in the metric collector.

Suggestion

You can see Katib creating multiple Trials.

kubectl -n kubeflow get trials
Sample Output

NAME TYPE STATUS AGE tfjob-random-5xq64qwz Created True 25s tfjob-random-h9l2h54d Created True 25s tfjob-random-pf5htw5f Created True 25s

Each trial starts a TFJob resource.

kubectl -n kubeflow get tfjobs
Sample Output

NAME TYPE STATUS AGE tfjob-random-5xq64qwz Created True 25s tfjob-random-h9l2h54d Created True 25s tfjob-random-pf5htw5f Created True 25s

Each TFJob creates a Worker pod to run the trial.

kubectl -n kubeflow get po -l controller-name=tf-operator
Sample Output
NAME READY STATUS RESTARTS AGE tfjob-random-484zhpzq-worker-0 2/2 Running 0 39s tfjob-random-wqglhpqj-worker-0 2/2 Running 0 40s tfjob-random-zb89h929-worker-0 2/2 Running 0 41s

Metric Collection

When we talked about the Kubernetes architecture we briefly mentioned how a user creates resources using the Kubernetes API server and how the Kubernetes API server stores this data in etcd.

kubeapi-etcd

In reality, there are several stages between the Kubernetes API server receiving a request before it is accepted. In particular, there are two common extension points where external controllers can do additional tasks. These are mutating admission controllers and validating admission controllers. Katib controller registers itself as both a mutating and validating controller.

katib-webhook

You can see the webhooks as follows.

kubectl get MutatingWebhookConfiguration
Sample Output
NAME CREATED AT katib-mutating-webhook-config 2019-10-26T21:00:30Z
kubectl get ValidatingWebhookConfiguration
Sample Output
NAME CREATED AT katib-validating-webhook-config 2019-10-26T21:00:30Z

The mutating webhook looks at Katib configuration and injects a side car container to the Trial jobs/pods. You can see the configurations as follows.

kubectl -n kubeflow get cm katib-config -o yaml
Sample Output
apiVersion: v1 data: metrics-collector-sidecar: |- { "StdOut": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/file-metrics-collector:v0.7.0" }, "File": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/file-metrics-collector:v0.7.0" }, "TensorFlowEvent": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0" } } suggestion: |- { "random": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt:v0.7.0" }, "grid": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-chocolate:v0.7.0" }, "hyperband": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperband:v0.7.0" }, "bayesianoptimization": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-skopt:v0.7.0" }, "tpe": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-hyperopt:v0.7.0" }, "nasrl": { "image": "gcr.io/kubeflow-images-public/katib/v1alpha3/suggestion-nasrl:v0.7.0" } } kind: ConfigMap

We can see the metrics-collector container injected into the TFJob worker pod.

kubectl -n kubeflow describe po -l controller-name=tf-operator
Sample Output
Name: tfjob-random-g4p7jx5b-worker-0 Namespace: kubeflow Priority: 0 PriorityClassName: <none> Node: katib/10.0.2.15 Start Time: Tue, 29 Oct 2019 18:37:20 +0000 Labels: controller-name=tf-operator group-name=kubeflow.org job-name=tfjob-random-g4p7jx5b job-role=master tf-job-name=tfjob-random-g4p7jx5b tf-replica-index=0 tf-replica-type=worker Annotations: <none> Status: Pending IP: Controlled By: TFJob/tfjob-random-g4p7jx5b Containers: tensorflow: Container ID: Image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 Image ID: Port: 2222/TCP Host Port: 0/TCP Command: python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/metrics --learning_rate=0.044867652686667765 --batch_size=179 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /train from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-mskhc (ro) metrics-collector: Container ID: Image: gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0 Image ID: Port: <none> Host Port: <none> Args: -t tfjob-random-g4p7jx5b -m accuracy_1 -s katib-manager.kubeflow:6789 -path /train State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /train from metrics-volume (rw) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-mskhc: Type: Secret (a volume populated by a Secret) SecretName: default-token-mskhc Optional: false metrics-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4s default-scheduler Successfully assigned kubeflow/tfjob-random-g4p7jx5b-worker-0 to katib Normal Pulled 2s kubelet, katib Container image "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0" already present on machine Normal Created 2s kubelet, katib Created container tensorflow Normal Started 1s kubelet, katib Started container tensorflow Normal Pulled 1s kubelet, katib Container image "gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0" already present on machine Normal Created 1s kubelet, katib Created container metrics-collector Normal Started 1s kubelet, katib Started container metrics-collector Name: tfjob-random-jcdvtfdf-worker-0 Namespace: kubeflow Priority: 0 PriorityClassName: <none> Node: katib/10.0.2.15 Start Time: Tue, 29 Oct 2019 18:35:44 +0000 Labels: controller-name=tf-operator group-name=kubeflow.org job-name=tfjob-random-jcdvtfdf job-role=master tf-job-name=tfjob-random-jcdvtfdf tf-replica-index=0 tf-replica-type=worker Annotations: <none> Status: Running IP: 192.168.0.231 Controlled By: TFJob/tfjob-random-jcdvtfdf Containers: tensorflow: Container ID: docker://2792566751a57a6ab804621a9ec8b56e29ced44ddaceb6e395cd4fb8d7b0f7d6 Image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 Image ID: docker-pullable://gcr.io/kubeflow-ci/tf-mnist-with-summaries@sha256:5c3181c3a97bc6f88fab204d4ac19ea12413b192953e21dc0ed07e7b821ddbe2 Port: 2222/TCP Host Port: 0/TCP Command: python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/metrics --learning_rate=0.025430765523205827 --batch_size=140 State: Running Started: Tue, 29 Oct 2019 18:35:46 +0000 Ready: True Restart Count: 0 Environment: <none> Mounts: /train from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-mskhc (ro) metrics-collector: Container ID: docker://4109a92753bdfddb2aa9dd4f866fcddb068016e2a65b24a321ac0ac832fac48f Image: gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0 Image ID: docker-pullable://gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector@sha256:d7c8fa8147f99ebb563c4d59fc6c333f96684f1598cce2f7eae629a878671656 Port: <none> Host Port: <none> Args: -t tfjob-random-jcdvtfdf -m accuracy_1 -s katib-manager.kubeflow:6789 -path /train State: Running Started: Tue, 29 Oct 2019 18:35:46 +0000 Ready: True Restart Count: 0 Environment: <none> Mounts: /train from metrics-volume (rw) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: default-token-mskhc: Type: Secret (a volume populated by a Secret) SecretName: default-token-mskhc Optional: false metrics-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 99s default-scheduler Successfully assigned kubeflow/tfjob-random-jcdvtfdf-worker-0 to katib Normal Pulled 97s kubelet, katib Container image "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0" already present on machine Normal Created 97s kubelet, katib Created container tensorflow Normal Started 97s kubelet, katib Started container tensorflow Normal Pulled 97s kubelet, katib Container image "gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0" already present on machine Normal Created 97s kubelet, katib Created container metrics-collector Normal Started 97s kubelet, katib Started container metrics-collector Name: tfjob-random-r66tzmjr-worker-0 Namespace: kubeflow Priority: 0 PriorityClassName: <none> Node: katib/10.0.2.15 Start Time: Tue, 29 Oct 2019 18:35:43 +0000 Labels: controller-name=tf-operator group-name=kubeflow.org job-name=tfjob-random-r66tzmjr job-role=master tf-job-name=tfjob-random-r66tzmjr tf-replica-index=0 tf-replica-type=worker Annotations: <none> Status: Running IP: 192.168.0.230 Controlled By: TFJob/tfjob-random-r66tzmjr Containers: tensorflow: Container ID: docker://b3836fc16b83e82b3c3cad90472eeb079762f320270c834038d4a58f845f45b1 Image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 Image ID: docker-pullable://gcr.io/kubeflow-ci/tf-mnist-with-summaries@sha256:5c3181c3a97bc6f88fab204d4ac19ea12413b192953e21dc0ed07e7b821ddbe2 Port: 2222/TCP Host Port: 0/TCP Command: python /var/tf_mnist/mnist_with_summaries.py --log_dir=/train/metrics --learning_rate=0.04503686583590331 --batch_size=120 State: Running Started: Tue, 29 Oct 2019 18:35:45 +0000 Ready: True Restart Count: 0 Environment: <none> Mounts: /train from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-mskhc (ro) metrics-collector: Container ID: docker://a94b366df6c6e3e318be7b9a72e92c26e988167338aacae68498ef68662eb619 Image: gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0 Image ID: docker-pullable://gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector@sha256:d7c8fa8147f99ebb563c4d59fc6c333f96684f1598cce2f7eae629a878671656 Port: <none> Host Port: <none> Args: -t tfjob-random-r66tzmjr -m accuracy_1 -s katib-manager.kubeflow:6789 -path /train State: Running Started: Tue, 29 Oct 2019 18:35:45 +0000 Ready: True Restart Count: 0 Environment: <none> Mounts: /train from metrics-volume (rw) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: default-token-mskhc: Type: Secret (a volume populated by a Secret) SecretName: default-token-mskhc Optional: false metrics-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 100s default-scheduler Successfully assigned kubeflow/tfjob-random-r66tzmjr-worker-0 to katib Normal Pulled 99s kubelet, katib Container image "gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0" already present on machine Normal Created 99s kubelet, katib Created container tensorflow Normal Started 98s kubelet, katib Started container tensorflow Normal Pulled 98s kubelet, katib Container image "gcr.io/kubeflow-images-public/katib/v1alpha3/tfevent-metrics-collector:v0.7.0" already present on machine Normal Created 98s kubelet, katib Created container metrics-collector Normal Started 98s kubelet, katib Started container metrics-collector

Experiment Completion

Once the Trials created by Katib are complete, the Experiment enters a completed state. Check the completion status of the Experiment tfjob-random

kubectl -n kubeflow get experiment tfjob-random -o json
Sample Output

You can observe the status of this experiment under the status field of the output.

You can also see that Katib has cleaned up the Trial worker pods.

kubectl -n kubeflow get pods
Sample Output
NAME READY STATUS RESTARTS AGE katib-controller-7665868558-nfghw 1/1 Running 1 21m katib-db-594756f779-dxttq 1/1 Running 0 21m katib-manager-769b7bcbfb-7vvgx 1/1 Running 0 21m katib-ui-854969c97-tl4wg 1/1 Running 0 21m pytorch-operator-794899d49b-ww59g 1/1 Running 0 21m tf-job-operator-7b589f5f5f-fpr2p 1/1 Running 0 21m tfjob-example-random-6d68b59ccd-fcn8f 1/1 Running 0 15m

In summary all you need to know about is to create the Experiment specification. Katib magically does the rest for you. Katib Magic