- 投稿日:2019-10-26T11:33:55+09:00
Tensorflow-Tacotron-2をtensorflow-rocmで動くかどうかの検証メモ
https://github.com/Rayhane-mamah/Tacotron-2
pythochでのTacotron2はうまく動かなかったのでなるべくTensorflowでは動いてほしかったのですが大変厳しかったです
その時の動作検証メモになります.環境
CPU Xeon E5-2603 v4
GPU RadeonⅦ
GPU GTX1080Ti
RAM DDR4 48GBOS Ubuntu 18.04.3 LTS
Kernel Linux rocm 5.0.0-31-generic
Python環境 miniconda 4.7.10
ROCm vesion 2.9.6
CUDA V10.1.243
NVIDIA Driver Version: 430.26上記の環境を整えた上で本テストを行いました.
ネイティブ環境での動作検証
git clone https://github.com/Rayhane-mamah/Tacotron-2.git cd ./Tacotron-2 wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 tar xf ./LJSpeech-1.1.tar.bz2 LJSpeech-1.1/miniconda を使用してpython3.6環境を立ち上げてから
pip install -r ./requirements.txt pip install -r tensorflow-rocm==1.13.4 python preprocess.pypyaudioのみ、インストールがpipで出来なかったので
conda install pyaudioで行った、あまりpipとcondaの混合環境は推奨できないのでできればrequirements.txtでインストールできるならしたほうがよいです.python preprocess.pyの処理はそれなりに時間がかかるのでしばらく待ちます
python train.py --model='Tacotron-2'これで学習が開始できるはずなのですがCPU TFでは学習ができたのに対し、ROCmでは
ValueError: operands could not be broadcast together with shapes (1,1025) (0,)
となってしまい学習が出来ませんでした.https://qiita.com/Tarooo000/items/dcce992672c7ea539049
[備忘録] tensorflow-rocmでimport時にnumpyのFutureWaringが出た話
これを参考に1.16.4にダウングレードしましたが動かなかった.
rocmが悪いのか設定とかが悪いのかわからないのNVIDIA-Dockerでnumpyのバージョン等を検討してから再検証することにしました.
NVIDIA-Dockerでの動作検証
そこで元々NVIDIA-CUDA用のDockerfileがあるのでこれで動作検証をして動くようならtensorflow-rocm imageにdocker fileを置き換えて実験するのが良いのではないかと考えた
RepositoryのDockerfileは以下のようになっているが
FROM continuumio/anaconda3:latest FROM tensorflow/tensorflow:latest-gpu-py3 RUN apt-get update RUN apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools wget git vim RUN wget http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 RUN tar -jxvf LJSpeech-1.1.tar.bz2 RUN git clone https://github.com/Rayhane-mamah/Tacotron-2.git WORKDIR Tacotron-2 RUN ln -s ../LJSpeech-1.1 . RUN pip install -r requirements.txt
RUN apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools wget git vimこの部分の
libhav-toolsを消してください ubuntu18.04ではすでに廃止されたパッケージです
またなんのためのDockerなんだって感じはあるんですがデフォルトで最新のtensorflow-gpuがinstall
されるようになっているのが仇になってしまいtf2.0系ではうまく動作しないのでコンテナに入ったら
pip3 install tensorflow-gpu==1.15.0をしてTFを1.X系に戻す作業が必要になります.sudo docker build -t tacotron2:cuda . sudo docker run -it --name tacotron2 tacotron2:cuda```#python preprocess.py #python train.py --model='Tacotron-2' Step 1 [63.983 sec/step, loss=24.80627, avg_loss=24.80627] Saving Model Character Embeddings visualization.. Tacotron Character embeddings have been updated on tensorboard! Step 4 [49.477 sec/step, loss=10.18475, avg_loss=15.85305] Step 5 [45.228 sec/step, loss=12.18279, avg_loss=15.11899]無事学習自体はできるようなのでこれを元にROCm Dockerで動かしてみます
ROCm-Dockerで動作確認
なるべく独立した環境で再検証するべくDockerでやってみます
rocm2.9-tf1.15-devのimageが配布されているのでコレを使います
https://github.com/RadeonOpenCompute/ROCm-docker
https://hub.docker.com/r/rocm/tensorflow/tagsFROM rocm/tensorflow:rocm2.9-tf1.15-dev RUN apt-get update RUN apt-get install -y libasound-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libav-tools wget git vim RUN wget http://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 RUN tar -jxvf LJSpeech-1.1.tar.bz2 RUN git clone https://github.com/Rayhane-mamah/Tacotron-2.git WORKDIR Tacotron-2 RUN ln -s ../LJSpeech-1.1 . RUN pip install -r requirements.txtdockerfileを保存してビルドする.
sudo docker build -t tacotron2:rocm .build自体は成功したのですが
ERROR: tensorflow 1.15.0rc2 has requirement numpy<2.0,>=1.16.0, but you'll have numpy 1.14.0 which is incompatible.やはりTensorflow-rocmの仕様でnumpy1.16.0以上を使わないとダメなようでそこがネックになりそうです
動かしてみないとわからないのでひとまずDocker runしてみますsudo docker run -it --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video tacotron2:rocmひとますnumpy==1.14.0でうごかしてみます
python3 preprocess.py ImportError: No module named 'numpy.core._multiarray_umath' ImportError: numpy.core.multiarray failed to import During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<frozen importlib._bootstrap>", line 968, in _find_and_load SystemError: PyEval_EvalFrameEx returned a result with an error set ImportError: numpy.core._multiarray_umath failed to import ImportError: numpy.core.umath failed to import 2019-10-24 18:03:18.669564: F tensorflow/python/lib/core/bfloat16.cc:675] Check failed: PyBfloat16_Type.tp_base != nullptr Aborted (core dumped)これだとtensorflow-rocm=1.15.0では動かないようなので
1.16.0にしてみます.Traceback (most recent call last): File "/usr/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/root/Tacotron-2/datasets/preprocessor.py", line 117, in _process_utterance mel_spectrogram = audio.melspectrogram(preem_wav, hparams).astype(np.float32) File "/root/Tacotron-2/datasets/audio.py", line 73, in melspectrogram S = _amp_to_db(_linear_to_mel(np.abs(D)**hparams.magnitude_power, hparams), hparams) - hparams.ref_level_db File "/root/Tacotron-2/datasets/audio.py", line 228, in _linear_to_mel _mel_basis = _build_mel_basis(hparams) File "/root/Tacotron-2/datasets/audio.py", line 246, in _build_mel_basis fmin=hparams.fmin, fmax=hparams.fmax) File "/usr/local/lib/python3.5/dist-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,1025) (0,) """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "preprocess.py", line 110, in <module> main() File "preprocess.py", line 106, in main run_preprocess(args, modified_hp) File "preprocess.py", line 83, in run_preprocess preprocess(args, input_folders, output_folder, hparams) File "preprocess.py", line 17, in preprocess metadata = preprocessor.build_from_path(hparams, input_folders, mel_dir, linear_dir, wav_dir, args.n_jobs, tqdm=tqdm) File "/root/Tacotron-2/datasets/preprocessor.py", line 42, in build_from_path return [future.result() for future in tqdm(futures) if future.result() is not None] File "/root/Tacotron-2/datasets/preprocessor.py", line 42, in <listcomp> return [future.result() for future in tqdm(futures) if future.result() is not None] File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result return self.__get_result() File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result raise self._exception ValueError: operands could not be broadcast together with shapes (1,1025) (0,)1.16.0ではやはりダメなようで
https://github.com/Rayhane-mamah/Tacotron-2/issues/427
check numpy version and scipy version, make sure they are same as requirements
1.14.0を使うとtensorflow-rocmが未対応で1.16.0を使うとTactron2が対応していないと言う挟み撃ち状態になってしまいました
最後にネイティブ環境でTF-ROCm1.12.0で試してみる
ダメな原因がわかったので古いTFなら古いnumpyでも動くのではないかと思ったので最初に環境を流用して実行してみます
参考までにpip listは以下の通り------------------ --------- absl-py 0.8.1 astor 0.8.0 audioread 2.1.5 certifi 2019.9.11 cffi 1.12.3 cget 0.1.8 click 6.6 cycler 0.10.0 decorator 4.4.0 falcon 1.2.0 gast 0.3.2 grpcio 1.24.3 h5py 2.10.0 inflect 0.2.5 joblib 0.14.0 Keras 2.3.1 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.0 kiwisolver 1.1.0 librosa 0.5.1 llvmlite 0.30.0 lws 1.2.5 Markdown 3.1.1 matplotlib 2.0.2 numba 0.46.0 numpy 1.14.0 pbr 5.4.3 pip 19.3.1 protobuf 3.10.0 PyAudio 0.2.11 pycparser 2.19 pyparsing 2.4.2 python-dateutil 2.8.0 python-mimeparse 1.6.0 pytz 2019.3 PyYAML 5.1.2 rbuild 0.0.1 resampy 0.2.2 scikit-learn 0.21.3 scipy 1.0.0 setuptools 41.2.0 six 1.12.0 sounddevice 0.3.10 SoundFile 0.10.2 tensorboard 1.12.2 tensorflow-rocm 1.12.0 termcolor 1.1.0 tqdm 4.11.2 Unidecode 0.4.20 Werkzeug 0.16.0 wheel 0.33.6$ python ./preprocess.py Traceback (most recent call last): File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module> from tensorflow.python.pywrap_tensorflow_internal import * File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module> _pywrap_tensorflow_internal = swig_import_helper() File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description) File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/imp.py", line 243, in load_module return load_dynamic(name, filename, file) File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/imp.py", line 343, in load_dynamic return _load(spec) ImportError: /home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN8hip_impl22hipLaunchKernelGGLImplEmRK4dim3S2_jP12ihipStream_tPPv During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./preprocess.py", line 5, in <module> from datasets import preprocessor File "/home/rocm/Tacotron-2/datasets/preprocessor.py", line 6, in <module> from datasets import audio File "/home/rocm/Tacotron-2/datasets/audio.py", line 4, in <module> import tensorflow as tf File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module> from tensorflow.python import pywrap_tensorflow File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module> raise ImportError(msg) ImportError: Traceback (most recent call last): File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module> from tensorflow.python.pywrap_tensorflow_internal import * File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module> _pywrap_tensorflow_internal = swig_import_helper() File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description) File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/imp.py", line 243, in load_module return load_dynamic(name, filename, file) File "/home/rocm/miniconda3/envs/tf13/lib/python3.6/imp.py", line 343, in load_dynamic return _load(spec) ImportError: /home/rocm/miniconda3/envs/tf13/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN8hip_impl22hipLaunchKernelGGLImplEmRK4dim3S2_jP12ihipStream_tPPv Failed to load the native TensorFlow runtime. See https://www.tensorflow.org/install/errors for some common reasons and solutions. Include the entire stack trace above this error message when asking for help.どうやら足りない機能があるらしくTF-ROCm1.12では動かないようだった
TF-ROCm 1.13でも同様の結果に終わったので結論としてはpreprocess.pyによる前処理はROCmでは難しいと言うのが実情だと思います.preprocess.pyだけCUDAにやらせてみたいと思います
pip install tensorflow-gpu==1.15.0 python ./preprocess.py pip install tensorflow-rocm==1.14.0 pip uninstall tensorflow-gpurocm1.14.xを入れた時点で自動的にnumpy1.17.0がinstallされるのは注意が必要です
1.14.1を入れるとこのように
python train.py --model='Tacotron-2' Traceback (most recent call last): File "train.py", line 7, in <module> from hparams import hparams File "/home/rocm/Tacotron-2/hparams.py", line 5, in <module> hparams = tf.contrib.training.HParams( AttributeError: module 'tensorflow' has no attribute 'contrib'TF-ROCm1.14.1ではなぜかTF-ROCm2.0と同じエラーが出ます
ただ1.14.0でも同様に
Traceback (most recent call last): File "train.py", line 138, in <module> main() File "train.py", line 132, in main train(args, log_dir, hparams) File "train.py", line 52, in train checkpoint = tacotron_train(args, log_dir, hparams) File "/home/rocm/Tacotron-2/tacotron/train.py", line 399, in tacotron_train return train(log_dir, args, hparams) File "/home/rocm/Tacotron-2/tacotron/train.py", line 176, in train GLGPU_mel_outputs = audio.inv_mel_spectrogram_tensorflow(GLGPU_mel_inputs, hparams) File "/home/rocm/Tacotron-2/datasets/audio.py", line 142, in inv_mel_spectrogram_tensorflow S = _mel_to_linear_tensorflow(S, hparams) # Convert back to linear File "/home/rocm/Tacotron-2/datasets/audio.py", line 240, in _mel_to_linear_tensorflow _inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams)) File "/home/rocm/Tacotron-2/datasets/audio.py", line 246, in _build_mel_basis fmin=hparams.fmin, fmax=hparams.fmax) File "/home/rocm/.local/lib/python3.6/site-packages/librosa/filters.py", line 247, in mel lower = -ramps[i] / fdiff[i] ValueError: operands could not be broadcast together with shapes (1,1025) (0,)以上のようになってしまうのでやはりROCmでは厳しいかなと言う結論に至りました
numpy==1.14.0 TF-ROCm==1.12.0などの組み合わせでも動作しない為厳しいなと言った感じでした
ただTF1.13.4+numpy1.14.0の組み合わせでは一応Train自体は走るのですが
Tacotron training set to a maximum of 100000 steps *** stack smashing detected ***: <unknown> terminated Aborted (core dumped)途中でコアダンプで落ちてしまってるのでなんか問題がありそうです.