DONE V=1 AI training workstation setup -> Marvin

Dirigibility · July 7, 2021, 9:49pm

First thing I needed is a machine that can train and infer large datasets beyond the abilities of the CPU.

Main Board
So I built up Marvin, it’s based on an old Intel Core i5-7500 CPU with 8 gigs of RAM from one of my own crypto miners.
It only has 4 cores, but it doesn’t matter since the magic happens on the GPUs.

Processing
The crunching needs a good GPU, I had a couple of AMD RX580s from the miner with 8 gigs of RAM, but at that time (and maybe still) AMD software was problematic for AI.
Luckily old Nvidia Tesla cards are dirt cheap cause they can’t be used for gaming (no video output) and can’t be used for crypto mining (too low RAM), so I got Nvidia Tesla K20Xm with 6 gigs RAM from ebay for 65 eur.

These are the to RX580s and the Tesla K20x

Lately the many 12 gig Nvidia Tesla cards came out to Ebay, I guess it’s not enough RAM for Ethereum so they dumped them on Ebay.
I scored one Nvidia K40x with 2*12 Gig on board!
Massive processing power for 130 USD (this card used to cost 9000$ in 2015).

Storage
The datasets are massive folders filled with jpg picture files, sometimes to create the datasets a video has to be cut into frames and each frame becomes a jpg.
For example the imagenet dataset includes 3.5 million files.
The training process itself requires high disk IO to read the images and feed the GPUs, then write the model snapshots to the disk quickly to continue crunching to the next round (also called EPOCH).

So I used a large 3 TB RAID-0 array of rotational disks for size and IO speed.
it has no failsafe or internal redundancy, if one of disks dies the whole array is gone, therefore it rsyncs nightly to Tonto my super reliable home ZFS storage server with 8TB on board and a Zraid Z2 topology which keeps the data safe to easily be restored in the case of Marvin array failure.

For the fast disks, I used a LVM array of individually connected SSD disks of small capacity (120 gig each) giving me a 550GB SSD workspace for the actual project im crunching at that moment, it has enough IO capacity to feed the training of the models.
Again, no failover, if one of disks is out the whole LVM is gone, but it’s a risk im willing to take for storage size, especially since this array is also backed up nightly to Tonto.

Setup
I decided on not using a case, since it’s easier to get stuff connected and disconnected, after some testing and tinkering at work I got a nice working system on my table at home.
I had to 3d-print the disk brackets, I think it came out quite nice, stable and cool (literally with the fans attached to the fanless Tesla cards).
It can and will be improved in time when the need arises, but for now Marvin is fully equipped for the tasks I have for him:

I will need to print some more SSD disk brackets.

It would like to thank my employer MessageToTheMoon.nl and my bosses Monique and Hans-Willem for allowing me to use the office facility to study after hours, this way I didn’t wake up the kids with my nightly tinkering

Bellow I will elaborate on the actual thing that drives Marvin the software.

Dirigibility · August 22, 2021, 10:45am

For the OS I chose Ubuntu, it’s well supported and I am comfy with the apt package management system. It’s an obselete LTS release (18.04) and Kernel, but this is the only equilibrium I have found so far that runs CUDA, Tensoflow, MXnet, TFOD in hardware without issues:

root@marvin:/home/erani/BigData-2/imagenet# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
root@marvin:/home/erani/BigData-2/imagenet# uname -a
Linux marvin 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

These two articles helped me with setting up the basic system:

Once you have a basic vanilla installed and updated, these commands should get you a working training environment:

sudo apt-get update && sudo apt -y install mc wget mlocate curl && sudo apt-get -y upgrade && sudo reboot

sudo apt-get install -y build-essential cmake unzip pkg-config &&  sudo apt-get install -y libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev &&  sudo apt-get install -y libjpeg-dev libpng-dev libtiff-dev &&  sudo apt-get install -y libavcodec-dev libavformat-dev libswscale-dev libv4l-dev &&  sudo apt-get install -y libxvidcore-dev libx264-dev &&  sudo apt-get install -y libgtk-3-dev &&  sudo apt-get install -y libopenblas-dev libatlas-base-dev liblapack-dev gfortran &&  sudo apt-get install -y libhdf5-serial-dev &&  sudo apt-get install -y python3-dev python3-tk python-imaging-tk && sudo apt-get install -y gcc-6 g++-6 && sudo reboot

sudo add-apt-repository ppa:graphics-drivers/ppa && sudo apt-get update && sudo apt -y install nvidia-driver-455 && sudo reboot now


nvidia-smi
----------
wget -O cuda_9.0.176_384.81_linux-run https://cloud.ebenjamin.nl/index.php/s/CfGqcYxJPzszBet/download && chmod +x cuda_9.0.176_384.81_linux-run && sudo ./cuda_9.0.176_384.81_linux-run --override

#Answer the questions:
Do you accept the previously read EULA?
accept/decline/quit:         accept

You are attempting to install on an unsupported configuration. Do you wish to continue?
(y)es/(n)o [ default is no ]: yes

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: no

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: y

Enter CUDA Samples Location
 [ default is /home/erani ]: 

------------------


mcedit ~/.bashrc

add
# NVIDIA CUDA Toolkit
export PATH=/usr/local/cuda-9.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64

source ~/.bashrc

nvcc -V


wget -O cudnn-9.0-linux-x64-v7.4.1.5.tgz https://cloud.ebenjamin.nl/index.php/s/bsPpBWTr4QBg5Yg/download && tar -zxf cudnn-9.0-linux-x64-v7.4.1.5.tgz && cd cuda && sudo cp -P lib64/* /usr/local/cuda/lib64/ && sudo cp -P include/* /usr/local/cuda/include/ && cd ~ && wget https://bootstrap.pypa.io/get-pip.py && sudo python3 get-pip.py && sudo pip install virtualenv virtualenvwrapper && sudo rm -rf ~/get-pip.py ~/.cache/pip

mcedit ~/.bashrc
add
# virtualenv and virtualenvwrapper
export WORKON_HOME=$HOME/.virtualenvs
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
source /usr/local/bin/virtualenvwrapper.sh

source ~/.bashrc

mkvirtualenv dl4cv -p python3 && workon dl4cv && pip install numpy && pip install opencv-contrib-python &&  pip install scipy matplotlib pillow && pip install imutils h5py requests progressbar2 && pip install scikit-learn scikit-image && pip install tensorflow-gpu==1.12.0 && sudo reboot

-----
pip install keras
#Problematic, use import tensorflow.keras
#Also problematic for chapter 10 book3 check Ticket #798 or https://github.com/keras-team/keras/issues/10648

---------------------------
#msxnet

cd ~ && wget -O opencv.zip https://github.com/opencv/opencv/archive/3.4.4.zip && wget -O opencv_contrib.zip https://github.com/opencv/opencv_contrib/archive/3.4.4.zip && unzip opencv.zip && unzip opencv_contrib.zip && mv opencv-3.4.4 opencv && mv opencv_contrib-3.4.4 opencv_contrib && mkvirtualenv mxnet -p python3 && workon mxnet && pip install numpy scipy matplotlib pillow && pip install imutils h5py requests progressbar2 && pip install scikit-learn scikit-image && cd ~/opencv && mkdir build && cd build

cmake -D CMAKE_BUILD_TYPE=RELEASE \
	-D CMAKE_INSTALL_PREFIX=/usr/local \
	-D INSTALL_PYTHON_EXAMPLES=ON \
	-D INSTALL_C_EXAMPLES=OFF \
	-D OPENCV_GENERATE_PKGCONFIG=YES \
	-D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
	-D PYTHON_EXECUTABLE=~/.virtualenvs/mxnet/bin/python \
	-D OPENCV_ENABLE_NONFREE=ON \
	-D BUILD_EXAMPLES=ON ..

make -j4 && cd /usr/share/pkgconfig && sudo ln -s /usr/local/lib/opencv4.pc opencv4.pc && sudo make install && sudo ldconfig && pkg-config --modversion opencv && cd /usr/local/python/cv2/python-3.6 && sudo mv cv2.cpython-36m-x86_64-linux-gnu.so cv2.opencv3.4.4.so && cd ~/.virtualenvs/mxnet/lib/python3.6/site-packages && ln -s /usr/local/python/cv2/python-3.6/cv2.opencv3.4.4.so cv2.so && cd /usr/share/pkgconfig && sudo ln -s /usr/local/lib/opencv4.pc opencv4.pc 

cd ~
workon mxnet
python
>>> import cv2
>>> cv2.__version__



cd /usr/bin && sudo rm gcc g++ && sudo ln -s gcc-6 gcc && sudo ln -s g++-6 g++ && cd ~ && git clone --recursive --no-checkout https://github.com/apache/incubator-mxnet.git mxnet && cd mxnet && git checkout v1.3.x && git submodule update --init && workon mxnet

make -j4 USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

cd ~/.virtualenvs/mxnet/lib/python3.6/site-packages/ && ln -s ~/mxnet/python/mxnet mxnet

#Test:
workon mxnet
cd ~
python
>>> import mxnet
>>>

cd /usr/bin && sudo rm gcc g++ && sudo ln -s gcc-7 gcc && sudo ln -s g++-7 g++ && cd ~ && rm -rf opencv/ && rm -rf opencv_contrib/

------------------------------------------------------------------------------------------------

TFOD

sudo apt-get -y install protobuf-compiler python-tk && mkvirtualenv tfod_api -p python3 && pip install numpy scipy cython && pip install scikit-learn matplotlib && pip install lxml jupyter && pip install Pillow imutils && pip install tensorflow-gpu==1.12 && pip install beautifulsoup4 && cd ~/.virtualenvs/tfod_api/lib/python3.6/site-packages/ && ln -s /usr/local/lib/python3.6/site-packages/cv2.cpython-35m-x86_64-linux-gnu.so cv2.so && cd ~ && pip install opencv-contrib-python && cd ~ && git clone https://github.com/cocodataset/cocoapi.git && cd cocoapi/PythonAPI && python setup.py install


#TEST
python
>>> import tensorflow
>>> import cv2 # optional
>>>

cd ~ && git clone https://github.com/tensorflow/models

#TEST
ls
cuda  installers  models  mxnet

#cd models && git checkout c9f03bf6a8ae58b9ead119796a6c3cd9bd04d450 && cd ~/models/research/ && protoc object_detection/protos/*.proto --python_out=. && pip install pycocotools
cd ~/models/research/ && protoc object_detection/protos/*.proto --python_out=. && cp object_detection/packages/tf1/setup.py . && python -m pip install --use-feature=2020-resolver . && pip install pycocotools && workon tfod_api && cd ~/models/research && export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

#Edit new file:
mcedit ~/setup.sh
#!/bin/sh
export PYTHONPATH=$PYTHONPATH:/home/`whoami`/models/research:/home/`whoami`/models/research/slim

#Use these commandes to log on to TFOD env:
workon tfod_api
source setup.sh

----------------------------------------------------------------------------------

KERAS RETINANET
deactivate
cd ~ && mkvirtualenv retinanet -p python3 && pip install numpy scipy h5py && pip install scikit-learn Pillow imutils && pip install beautifulsoup4 && pip install tensorflow-gpu && pip install keras && cd ~/.virtualenvs/retinanet/lib/python3.6/site-packages/ && ln -s /usr/local/lib/python3.6/site-packages/cv2.cpython-35m-x86_64-linux-gnu.so cv2.so && pip install opencv-contrib-python && cd ~ && git clone https://github.com/fizyr/keras-retinanet && cd keras-retinanet && git checkout 42068ef9e406602d92a1afe2ee7d470f7e9860df && python setup.py install
	
python
>>> import tensorflow.keras
Using TensorFlow backend.
>>> import cv2
>>> import keras_retinanet
>>>

-------------------------------------------------------------------------------

Keras Mask R-CNN

deactivate
cd ~ && mkvirtualenv mask_rcnn -p python3 && workon mask_rcnn && pip install numpy scipy h5py && pip install scikit-learn Pillow && pip install imgaug imutils && pip install beautifulsoup4 && pip install tensorflow-gpu==1.12 && pip install keras && cd ~/.virtualenvs/mask_rcnn/lib/python3.6/site-packages/ && ln -s /usr/local/lib/python3.46/site-packages/cv2.cpython-35m-x86_64-linux-gnu.so cv2.so && cd ~ && pip install opencv-contrib-python && cd ~ && git clone https://github.com/matterport/Mask_RCNN && cd Mask_RCNN && git checkout 1aca439c37849dcd085167c4e69d3abcd9d368d7 && pip install -r requirements.txt

python
>>> import mrcnn
>>>