20200326のTensorFlowに関する記事は1件です。

Kubeflow 1.0 on AWS #3 TF-JOBの実行

はじめに

これは、Kubeflow 1.0 をAWSで構築する記事です。
動作確認が主な目的ですので、本番環境での利用は全く想定していません。

前回まで

Kubeflow 1.0 on AWS #2 Notebook作成

今回の内容

exampleのTFJOBを実行して、最低限の動きができていることを確認します

参考資料

共有ストレージEFSの用意

データ置き場としてEFSを利用します。S3を使う方法もあるかと思いますが、それはあとでやってみようと思います。
こちらの通りにやりました
https://qiita.com/asahi0301/items/1116c1f030db3136ff49

efs-sc(storageclass),efs-pv(PV),efs-clain(PVC)を namespace anonymous上に作成しました

k apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"

cat <<EOF > efs.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-xxxxx ## ここを自分の環境の値に変更する
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
EOF

k apply -n anonymous -f efs.yaml
persistentvolume/efs-pv created
persistentvolumeclaim/efs-claim created
storageclass.storage.k8s.io/efs-sc created
#確認
k get pvc -n anonymous | grep efs
efs-claim         Bound    efs-pv                                     5Gi        RWX            efs-sc         117s

tfjob(シングルワーカー)

Jobの実行

tensorflow with mnist のトレーニングを動かします。
ポイントは、 `sidecar.istio.io/inject: "false" で sidecar injectionを無効にすることです。
これがないと、traingingが終わってTensorflowのコンテナが停止しても、envoyが動いているため、tfjobは永久にrunningのままになります

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "mnist"
  namespace: anonymous
spec:
  cleanPodPolicy: None 
  tfReplicaSpecs:
    Worker:
      replicas: 1 
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
                - "--log_dir=/train/logs"
                - "--learning_rate=0.01"
                - "--batch_size=150"
              volumeMounts:
                - mountPath: "/train"
                  name: "training"
          volumes:
            - name: "training"
              persistentVolumeClaim:
                claimName: "efs-claim"    
k apply -f tf_job_mnist.yaml 
tfjob.kubeflow.org/mnist created

確認

k -n anonymous get tfjobs
NAME    STATE     AGE
mnist   Running   16m

k -n anonymous get pod
NAME             READY   STATUS    RESTARTS   AGE
mnist-worker-0   2/2     Running   0          15s

k -n anonymous logs -f mnist-worker-0 tensorflow
WARNING:tensorflow:From /var/tf_mnist/mnist_with_summaries.py:39: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future ver
sion.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:252: wrapped_fn (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please use urllib or similar directly.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: __init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
020-03-26 06:35:37.447521: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1164
Accuracy at step 10: 0.777
Accuracy at step 20: 0.8484
Accuracy at step 30: 0.8958
Accuracy at step 40: 0.9104
Accuracy at step 50: 0.9235
Accuracy at step 60: 0.9296
Accuracy at step 70: 0.9308
Accuracy at step 80: 0.9347
Accuracy at step 90: 0.9348
Adding run metadata for 99
Accuracy at step 100: 0.9388
Accuracy at step 110: 0.9457
Accuracy at step 120: 0.9472
Accuracy at step 130: 0.9491
Accuracy at step 140: 0.9486
Accuracy at step 150: 0.9493
Accuracy at step 160: 0.9532
Accuracy at step 170: 0.9497
Accuracy at step 180: 0.9489
Accuracy at step 190: 0.9545
Adding run metadata for 199
(続く)

終了の確認

k  get tfjobs -n anonymous
NAME    STATE       AGE
mnist   Succeeded   6m57s

EFSの確認

efs中身確認用のpodを用意

同じpvcを使い回せばなんでもいいのですが、例えばこんなdeploymentをつくります

yaml|test-eks-toolkit-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eks-toolkit-deployment
  namespace: anonymous
  labels:
    app: eks-toolkit
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eks-toolkit
  template:
    metadata:
      labels:
        app: eks-toolkit
    spec:
      containers:
      - name: eks-toolkit
        image: asahi0301/eks-toolkit
        command: ["tail"]
        args: ["-f", "/dev/null"] 
        volumeMounts:
        - mountPath: "/data"
          name: "data"
      volumes:
      - name: "data"
        persistentVolumeClaim:
          claimName: "efs-claim" 

確認

Podのシェルに入って、EFSがマウントされていることを確認する

k apply -f eks-toolkit-deployment.yaml

k -n anonymous exec -it eks-toolkit-deployment-7f699fd967-8jfvm bash
Defaulting container name to eks-toolkit.
Use 'kubectl describe pod/eks-toolkit-deployment-7f699fd967-8jfvm -n anonymous' to see all of the containers in this pod.
bash-4.2# 
bash-4.2# 
bash-4.2# ls
bash-4.2# df
Filesystem                                       1K-blocks    Used        Available Use% Mounted on
overlay                                           20959212 4965980         15993232  24% /
tmpfs                                                65536       0            65536   0% /dev
tmpfs                                              3932516       0          3932516   0% /sys/fs/cgroup
fs-xxxx.efs.us-west-2.amazonaws.com:/ 9007199254739968   23552 9007199254716416   1% /data
/dev/nvme0n1p1                                    20959212 4965980         15993232  24% /etc/hosts
shm                                                  65536    4772            60764   8% /dev/shm
tmpfs                                              3932516      12          3932504   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                              3932516       0          3932516   0% /proc/acpi
tmpfs                                              3932516       0          3932516   0% /sys/firmware

EFSの中身をみてみるろ、ログが保存されていることが分かります

bash-4.2# pwd
/data/logs
bash-4.2# ls
test  train
bash-4.2# 

TFJOB(分散学習)

yamlの用意

サンプルコードの分散学習を試してみます

yaml|tf_job_dist_mnist.yaml
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "dist-mnist-pct"
  namespace: anonymous
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: emsixteeen/tf-dist-mnist-test:1.0
    Worker:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: emsixteeen/tf-dist-mnist-test:1.0

実行

k apply -f tf_job_dist_mnist.yaml 

確認

k -n anonymous get tfjobs
NAME             STATE       AGE
dist-mnist-pct   Succeeded   3m41s

まとめ

TFJOBを使ったtrainingを行ってみました。
シングルワーカー、分散学習なども試してみてみました。
ストレージはEFSを共有ストレージとして利用しましたが、近々S3で試して見たいと思います

  • このエントリーをはてなブックマークに追加
  • Qiitaで続きを読む