- 投稿日:2020-03-26T19:37:39+09:00
Kubeflow 1.0 on AWS #3 TF-JOBの実行
はじめに
これは、Kubeflow 1.0 をAWSで構築する記事です。
動作確認が主な目的ですので、本番環境での利用は全く想定していません。前回まで
Kubeflow 1.0 on AWS #2 Notebook作成
今回の内容
exampleのTFJOBを実行して、最低限の動きができていることを確認します
参考資料
- https://www.kubeflow.org/docs/components/training/tftraining/
- https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tfevent-volume/tfevent-pv.yaml
- https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tfevent-volume/tfevent-pvc.yaml
- https://raw.githubusercontent.com/kubeflow/tf-operator/master/examples/v1/mnist_with_summaries/tf_job_mnist.yaml
共有ストレージEFSの用意
データ置き場としてEFSを利用します。S3を使う方法もあるかと思いますが、それはあとでやってみようと思います。
こちらの通りにやりました
https://qiita.com/asahi0301/items/1116c1f030db3136ff49efs-sc(storageclass),efs-pv(PV),efs-clain(PVC)を namespace anonymous上に作成しました
k apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master" cat <<EOF > efs.yaml --- apiVersion: v1 kind: PersistentVolume metadata: name: efs-pv spec: capacity: storage: 5Gi volumeMode: Filesystem accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain storageClassName: efs-sc csi: driver: efs.csi.aws.com volumeHandle: fs-xxxxx ## ここを自分の環境の値に変更する --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: efs-claim spec: accessModes: - ReadWriteMany storageClassName: efs-sc resources: requests: storage: 5Gi --- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: efs-sc provisioner: efs.csi.aws.com EOF k apply -n anonymous -f efs.yaml persistentvolume/efs-pv created persistentvolumeclaim/efs-claim created storageclass.storage.k8s.io/efs-sc created#確認 k get pvc -n anonymous | grep efs efs-claim Bound efs-pv 5Gi RWX efs-sc 117stfjob(シングルワーカー)
Jobの実行
tensorflow with mnist のトレーニングを動かします。
ポイントは、 `sidecar.istio.io/inject: "false"
で sidecar injectionを無効にすることです。
これがないと、traingingが終わってTensorflowのコンテナが停止しても、envoyが動いているため、tfjobは永久にrunningのままになりますapiVersion: "kubeflow.org/v1" kind: "TFJob" metadata: name: "mnist" namespace: anonymous spec: cleanPodPolicy: None tfReplicaSpecs: Worker: replicas: 1 restartPolicy: Never template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name: tensorflow image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0 command: - "python" - "/var/tf_mnist/mnist_with_summaries.py" - "--log_dir=/train/logs" - "--learning_rate=0.01" - "--batch_size=150" volumeMounts: - mountPath: "/train" name: "training" volumes: - name: "training" persistentVolumeClaim: claimName: "efs-claim"k apply -f tf_job_mnist.yaml tfjob.kubeflow.org/mnist created
確認
k -n anonymous get tfjobs NAME STATE AGE mnist Running 16m k -n anonymous get pod NAME READY STATUS RESTARTS AGE mnist-worker-0 2/2 Running 0 15s k -n anonymous logs -f mnist-worker-0 tensorflow WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future ver sion. Instructions for updating: Please write your own downloading logic. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please use urllib or similar directly. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data to implement this functionality. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data to implement this functionality. WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. 2020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz Accuracy at step 0: 0.1164 Accuracy at step 10: 0.777 Accuracy at step 20: 0.8484 Accuracy at step 30: 0.8958 Accuracy at step 40: 0.9104 Accuracy at step 50: 0.9235 Accuracy at step 60: 0.9296 Accuracy at step 70: 0.9308 Accuracy at step 80: 0.9347 Accuracy at step 90: 0.9348 Adding run metadata for 99 Accuracy at step 100: 0.9388 Accuracy at step 110: 0.9457 Accuracy at step 120: 0.9472 Accuracy at step 130: 0.9491 Accuracy at step 140: 0.9486 Accuracy at step 150: 0.9493 Accuracy at step 160: 0.9532 Accuracy at step 170: 0.9497 Accuracy at step 180: 0.9489 Accuracy at step 190: 0.9545 Adding run metadata for 199 (続く)終了の確認
k get tfjobs -n anonymous NAME STATE AGE mnist Succeeded 6m57s
EFSの確認
efs中身確認用のpodを用意
同じpvcを使い回せばなんでもいいのですが、例えばこんなdeploymentをつくります
yaml|test-eks-toolkit-deployment.yamlapiVersion: apps/v1 kind: Deployment metadata: name: eks-toolkit-deployment namespace: anonymous labels: app: eks-toolkit spec: replicas: 1 selector: matchLabels: app: eks-toolkit template: metadata: labels: app: eks-toolkit spec: containers: - name: eks-toolkit image: asahi0301/eks-toolkit command: ["tail"] args: ["-f", "/dev/null"] volumeMounts: - mountPath: "/data" name: "data" volumes: - name: "data" persistentVolumeClaim: claimName: "efs-claim"確認
Podのシェルに入って、EFSがマウントされていることを確認する
k apply -f eks-toolkit-deployment.yaml k -n anonymous exec -it eks-toolkit-deployment-7f699fd967-8jfvm bash Defaulting container name to eks-toolkit. Use 'kubectl describe pod/eks-toolkit-deployment-7f699fd967-8jfvm -n anonymous' to see all of the containers in this pod. bash-4.2# bash-4.2# bash-4.2# ls bash-4.2# df Filesystem 1K-blocks Used Available Use% Mounted on overlay 20959212 4965980 15993232 24% / tmpfs 65536 0 65536 0% /dev tmpfs 3932516 0 3932516 0% /sys/fs/cgroup fs-xxxx.efs.us-west-2.amazonaws.com:/ 9007199254739968 23552 9007199254716416 1% /data /dev/nvme0n1p1 20959212 4965980 15993232 24% /etc/hosts shm 65536 4772 60764 8% /dev/shm tmpfs 3932516 12 3932504 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 3932516 0 3932516 0% /proc/acpi tmpfs 3932516 0 3932516 0% /sys/firmwareEFSの中身をみてみるろ、ログが保存されていることが分かります
bash-4.2# pwd /data/logs bash-4.2# ls test train bash-4.2#TFJOB(分散学習)
yamlの用意
サンプルコードの分散学習を試してみます
yaml|tf_job_dist_mnist.yamlapiVersion: "kubeflow.org/v1" kind: "TFJob" metadata: name: "dist-mnist-pct" namespace: anonymous spec: tfReplicaSpecs: PS: replicas: 1 restartPolicy: Never template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name: tensorflow image: emsixteeen/tf-dist-mnist-test:1.0 Worker: replicas: 2 restartPolicy: Never template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - name: tensorflow image: emsixteeen/tf-dist-mnist-test:1.0実行
k apply -f tf_job_dist_mnist.yaml
確認
k -n anonymous get tfjobs NAME STATE AGE dist-mnist-pct Succeeded 3m41s
まとめ
TFJOBを使ったtrainingを行ってみました。
シングルワーカー、分散学習なども試してみてみました。
ストレージはEFSを共有ストレージとして利用しましたが、近々S3で試して見たいと思います