First thing I needed is a machine that can train and infer large datasets beyond the abilities of the CPU.
Main Board
So I built up Marvin, it’s based on an old Intel Core i5-7500 CPU with 8 gigs of RAM from one of my own crypto miners.
It only has 4 cores, but it doesn’t matter since the magic happens on the GPUs.
Processing
The crunching needs a good GPU, I had a couple of AMD RX580s from the miner with 8 gigs of RAM, but at that time (and maybe still) AMD software was problematic for AI.
Luckily old Nvidia Tesla cards are dirt cheap cause they can’t be used for gaming (no video output) and can’t be used for crypto mining (too low RAM), so I got Nvidia Tesla K20Xm with 6 gigs RAM from ebay for 65 eur.
These are the to RX580s and the Tesla K20x
Lately the many 12 gig Nvidia Tesla cards came out to Ebay, I guess it’s not enough RAM for Ethereum so they dumped them on Ebay.
I scored one Nvidia K40x with 2*12 Gig on board!
Massive processing power for 130 USD (this card used to cost 9000$ in 2015).
Storage
The datasets are massive folders filled with jpg picture files, sometimes to create the datasets a video has to be cut into frames and each frame becomes a jpg.
For example the imagenet dataset includes 3.5 million files.
The training process itself requires high disk IO to read the images and feed the GPUs, then write the model snapshots to the disk quickly to continue crunching to the next round (also called EPOCH).
So I used a large 3 TB RAID-0 array of rotational disks for size and IO speed.
it has no failsafe or internal redundancy, if one of disks dies the whole array is gone, therefore it rsyncs nightly to Tonto my super reliable home ZFS storage server with 8TB on board and a Zraid Z2 topology which keeps the data safe to easily be restored in the case of Marvin array failure.
For the fast disks, I used a LVM array of individually connected SSD disks of small capacity (120 gig each) giving me a 550GB SSD workspace for the actual project im crunching at that moment, it has enough IO capacity to feed the training of the models.
Again, no failover, if one of disks is out the whole LVM is gone, but it’s a risk im willing to take for storage size, especially since this array is also backed up nightly to Tonto.
Setup
I decided on not using a case, since it’s easier to get stuff connected and disconnected, after some testing and tinkering at work I got a nice working system on my table at home.
I had to 3d-print the disk brackets, I think it came out quite nice, stable and cool (literally with the fans attached to the fanless Tesla cards).
It can and will be improved in time when the need arises, but for now Marvin is fully equipped for the tasks I have for him:
I will need to print some more SSD disk brackets.
It would like to thank my employer MessageToTheMoon.nl and my bosses Monique and Hans-Willem for allowing me to use the office facility to study after hours, this way I didn’t wake up the kids with my nightly tinkering
Bellow I will elaborate on the actual thing that drives Marvin the software.