Running KataGo/KaTrain on Jetson AGX Xavier

Set-up Jetson AGX Xavier
From Ubuntu 20.04, use sdkmanager and follow instructions
Temporarily modify VERSION_ID in /etc/os-release from 20.04 to 18.04

Preparation-CMake (on Jetson)
KataGo requires version 3.18.2 or higher, but the default wa 3.10.2, hence build from source.
Download CMake source file then install at /usr/local/CMake
git clone https://github.com/Kitware/CMake
./bootstrap --prefix=/usr/local/CMake
make
sudo make install

Preparation-KataGo v1.10.0 (on Jetson)
Refer github.com/lightvector/KataGo/blob/master/Compiling.md#linux
For CMake, refer Preparation-CMake (on Jetson) above.
Check all packages are installed or updated, including zlib1g-dev, libzip-dev, libgoogle-perftools-dev, libssl-dev, and ocl-icd-opencl-dev.

Build KataGo
git clone https://github.com/lightvector/KataGo.git
git checkout v1.10.0
cd KataGo/cpp
/usr/local/CMake/cmake . -DUSE_BACKEND=CUDA -DUSE_TCMALLOC=1
[Note]
Selecting TensorRT backend(TENSORRT) will give check error. "TensorRT 8.2 or greater is required but 8.0.1 was found". Currently it is not easy to upgrade TensorRT package, hence ignore.
make

Test/Benchmark KataGo
./katago benchmark -model tests/models/g170-b6c96-s175395328-d26788732.bin.gz -config configs/gtp_example.cfg

Ordered summary of results: 

numSearchThreads =  5: 10 / 10 positions, visits/s = 788.64 nnEvals/s = 701.63 nnBatches/s = 281.81 avgBatchSize = 2.49 (10.2 secs) (EloDiff baseline)
numSearchThreads = 10: 10 / 10 positions, visits/s = 1105.99 nnEvals/s = 979.53 nnBatches/s = 197.14 avgBatchSize = 4.97 (7.3 secs) (EloDiff +114)
numSearchThreads = 12: 10 / 10 positions, visits/s = 1271.85 nnEvals/s = 1138.23 nnBatches/s = 192.11 avgBatchSize = 5.92 (6.4 secs) (EloDiff +163)
numSearchThreads = 16: 10 / 10 positions, visits/s = 1297.56 nnEvals/s = 1172.58 nnBatches/s = 143.77 avgBatchSize = 8.16 (6.3 secs) (EloDiff +163)
numSearchThreads = 20: 10 / 10 positions, visits/s = 1494.94 nnEvals/s = 1355.48 nnBatches/s = 135.07 avgBatchSize = 10.04 (5.5 secs) (EloDiff +210)
numSearchThreads = 24: 10 / 10 positions, visits/s = 1540.22 nnEvals/s = 1406.60 nnBatches/s = 116.03 avgBatchSize = 12.12 (5.3 secs) (EloDiff +215)
numSearchThreads = 32: 10 / 10 positions, visits/s = 1608.93 nnEvals/s = 1478.43 nnBatches/s = 84.03 avgBatchSize = 17.59 (5.2 secs) (EloDiff +219)
numSearchThreads = 40: 10 / 10 positions, visits/s = 1682.52 nnEvals/s = 1579.64 nnBatches/s = 74.20 avgBatchSize = 21.29 (5.0 secs) (EloDiff +224)
numSearchThreads = 48: 10 / 10 positions, visits/s = 1648.66 nnEvals/s = 1564.58 nnBatches/s = 55.86 avgBatchSize = 28.01 (5.1 secs) (EloDiff +203)
numSearchThreads = 64: 10 / 10 positions, visits/s = 1628.13 nnEvals/s = 1574.74 nnBatches/s = 43.20 avgBatchSize = 36.45 (5.3 secs) (EloDiff +172)

Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  5: (baseline)
numSearchThreads = 10:  +114 Elo
numSearchThreads = 12:  +163 Elo
numSearchThreads = 16:  +163 Elo
numSearchThreads = 20:  +210 Elo
numSearchThreads = 24:  +215 Elo
numSearchThreads = 32:  +219 Elo
numSearchThreads = 40:  +224 Elo (recommended)
numSearchThreads = 48:  +203 Elo
numSearchThreads = 64:  +172 Elo

Might consider, changing maxVisits to 800, and numSearchThreads to 40 in configs/gtp_example.cfg.

Rename katago to katago-v1.10.0-cuda10.2-arm64, configs/gtp_example.cfg to gtp_v1.10.0-cuda10.2-arm64.cfg, and pack executable and the configuration file for later use.

[Ref.] Performance when Eigen backend is chosen. (15W DESKTOP Mode -> 4 cores)
Ordered summary of results: 

numSearchThreads =  3: 10 / 10 positions, visits/s = 103.06 nnEvals/s = 98.79 nnBatches/s = 94.39 avgBatchSize = 1.05 (8.0 secs) (EloDiff baseline)
numSearchThreads =  4: 10 / 10 positions, visits/s = 122.75 nnEvals/s = 119.35 nnBatches/s = 112.69 avgBatchSize = 1.06 (6.8 secs) (EloDiff +57)
numSearchThreads =  5: 10 / 10 positions, visits/s = 116.70 nnEvals/s = 115.45 nnBatches/s = 102.81 avgBatchSize = 1.12 (7.2 secs) (EloDiff +30)
numSearchThreads =  6: 10 / 10 positions, visits/s = 118.30 nnEvals/s = 116.36 nnBatches/s = 103.97 avgBatchSize = 1.12 (7.2 secs) (EloDiff +27)
numSearchThreads =  8: 10 / 10 positions, visits/s = 122.20 nnEvals/s = 121.78 nnBatches/s = 105.77 avgBatchSize = 1.15 (7.1 secs) (EloDiff +24)
numSearchThreads = 12: 10 / 10 positions, visits/s = 104.99 nnEvals/s = 105.23 nnBatches/s = 80.65 avgBatchSize = 1.30 (8.7 secs) (EloDiff -67)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: 
numSearchThreads =  3: (baseline)
numSearchThreads =  4:   +57 Elo (recommended)
numSearchThreads =  5:   +30 Elo
numSearchThreads =  6:   +27 Elo
numSearchThreads =  8:   +24 Elo
numSearchThreads = 12:   -67 Elo

Preparation katrain v1.10.1 (on Jetson)
Refer https://matham.github.io/ffpyplayer/installation.html#ubuntu-18-04
sudo apt install python3-pip libsdl2-dev libavdevice-dev libavformat-dev libjpeg-dev
Refer https://github.com/sanderland/katrain/blob/master/INSTALL.md#LinuxQuick
git clone https://github.com/sanderland/katrain.git
cd katrain
requirements.txt 수정: Pillow==9.0.0 -> Pillow==8.4.0
pip3 install -r requirements.txt
pip3 install .
(Reboot required?)
Change default katago model with the compiled katago file.
cp ~/.local/lib/python3.6/site-packages/katrain/KataGo/katago ~/.local/lib/python3.6/site-packages/katrain/KataGo/katago.org
cp ~/katago/KataGo/cpp/katago-v1.10.0-cuda10.2-arm64 ~/.local/lib/python3.6/site-packages/katrain/KataGo/katago
Modify configuration file with the benchmarked configuration
vi ~/.local/lib/python3.6/site-packages/katrain/KataGo/analysis_config.cfg
maxVisits = 800 # was 500
numSearchThreads = 40 # was 8
nnMaxBatchSize = 480 # was 96, 12 * 40
Run with katrain or python3 -m katrain
If you face CRITICAL errors, reinstall kivy with pip3 uninstall kivy; pip3 install kivy==2.0.0rc1, pip3 install importlib-metadata
sudo apt install libfreetype6-dev libsdl2-image-dev libsdl2-mixer-dev libsdl2-net-dev libsdl2-ttf-dev libportmidi-dev xclip xsel
pip3 install pygame
If face DBUS error, install python3-sdl2 python-pygame-sdl2, then run katrain again.
sudo apt install python3-sdl2 python-pygame-sdl2
katrain

If face an exception (TypeError : "'NoneType' object is not subscriptable") when closing the window, then modify ~/.local/lib/python3.6/site-packages/katrain/__main__.py around line 821 and 855.
# lines 821~822
Window._left = win_left # was Window.left
Window._top = win_top # was Window.top
#lines 855~856
self.gui._config["ui_state"]["top"] = Window._top # was Window.top
self.gui._config["ui_state"]["left"] = Window._left # was Window.left

[TIP] For proper GUI
pip3 uninstall kivy
pip3 install kivy-jetson


For comparison
On a machine with i7-8700 & 32GB RAM with Intel UHD Graphics 630
KataGo v1.10.0 from https://github.com/lightvector/KataGo/releases

katago-v1.10.0-eigenavx2-windows-x64
Ordered summary of results:

numSearchThreads =  5: 10 / 10 positions, visits/s = 247.93 nnEvals/s = 242.92 nnBatches/s = 181.23 avgBatchSize = 1.34 (3.4 secs) (EloDiff baseline)
numSearchThreads =  6: 10 / 10 positions, visits/s = 300.14 nnEvals/s = 291.67 nnBatches/s = 228.46 avgBatchSize = 1.28 (2.8 secs) (EloDiff +67)
numSearchThreads =  8: 10 / 10 positions, visits/s = 383.77 nnEvals/s = 378.47 nnBatches/s = 334.36 avgBatchSize = 1.13 (2.3 secs) (EloDiff +150)
numSearchThreads = 10: 10 / 10 positions, visits/s = 384.28 nnEvals/s = 383.42 nnBatches/s = 345.85 avgBatchSize = 1.11 (2.3 secs) (EloDiff +142)
numSearchThreads = 12: 10 / 10 positions, visits/s = 357.28 nnEvals/s = 355.71 nnBatches/s = 319.98 avgBatchSize = 1.11 (2.5 secs) (EloDiff +104)
numSearchThreads = 20: 10 / 10 positions, visits/s = 296.14 nnEvals/s = 297.34 nnBatches/s = 268.02 avgBatchSize = 1.11 (3.3 secs) (EloDiff -11)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads =  5: (baseline)
numSearchThreads =  6:   +67 Elo
numSearchThreads =  8:  +150 Elo (recommended)
numSearchThreads = 10:  +142 Elo
numSearchThreads = 12:  +104 Elo
numSearchThreads = 20:   -11 Elo

katago-v1.10.0-eigen-windows-x64
Ordered summary of results:

numSearchThreads =  5: 10 / 10 positions, visits/s = 174.02 nnEvals/s = 169.67 nnBatches/s = 129.07 avgBatchSize = 1.31 (4.8 secs) (EloDiff baseline)
numSearchThreads =  8: 10 / 10 positions, visits/s = 152.93 nnEvals/s = 151.52 nnBatches/s = 143.43 avgBatchSize = 1.06 (5.7 secs) (EloDiff -70)
numSearchThreads = 10: 10 / 10 positions, visits/s = 169.49 nnEvals/s = 169.49 nnBatches/s = 153.11 avgBatchSize = 1.11 (5.3 secs) (EloDiff -44)
numSearchThreads = 12: 10 / 10 positions, visits/s = 223.09 nnEvals/s = 222.11 nnBatches/s = 195.39 avgBatchSize = 1.14 (4.1 secs) (EloDiff +51)
numSearchThreads = 16: 10 / 10 positions, visits/s = 139.48 nnEvals/s = 139.77 nnBatches/s = 120.83 avgBatchSize = 1.16 (6.8 secs) (EloDiff -165)
numSearchThreads = 20: 10 / 10 positions, visits/s = 101.38 nnEvals/s = 101.38 nnBatches/s = 79.26 avgBatchSize = 1.28 (9.8 secs) (EloDiff -327)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads =  5: (baseline)
numSearchThreads =  8:   -70 Elo
numSearchThreads = 10:   -44 Elo
numSearchThreads = 12:   +51 Elo (recommended)
numSearchThreads = 16:  -165 Elo
numSearchThreads = 20:  -327 Elo

katago-v1.10.0-opencl-windows-x64
Ordered summary of results:

numSearchThreads =  5: 10 / 10 positions, visits/s = 411.34 nnEvals/s = 370.10 nnBatches/s = 148.52 avgBatchSize = 2.49 (19.5 secs) (EloDiff baseline)
numSearchThreads = 10: 10 / 10 positions, visits/s = 457.81 nnEvals/s = 415.14 nnBatches/s = 83.70 avgBatchSize = 4.96 (17.7 secs) (EloDiff +20)
numSearchThreads = 12: 10 / 10 positions, visits/s = 574.24 nnEvals/s = 525.45 nnBatches/s = 88.72 avgBatchSize = 5.92 (14.1 secs) (EloDiff +100)
numSearchThreads = 16: 10 / 10 positions, visits/s = 584.77 nnEvals/s = 540.79 nnBatches/s = 68.52 avgBatchSize = 7.89 (13.9 secs) (EloDiff +94)
numSearchThreads = 20: 10 / 10 positions, visits/s = 582.89 nnEvals/s = 536.19 nnBatches/s = 54.45 avgBatchSize = 9.85 (14.0 secs) (EloDiff +79)
numSearchThreads = 24: 10 / 10 positions, visits/s = 626.24 nnEvals/s = 577.01 nnBatches/s = 48.77 avgBatchSize = 11.83 (13.1 secs) (EloDiff +95)
numSearchThreads = 32: 10 / 10 positions, visits/s = 655.88 nnEvals/s = 613.73 nnBatches/s = 37.17 avgBatchSize = 16.51 (12.7 secs) (EloDiff +89)


Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads =  5: (baseline)
numSearchThreads = 10:   +20 Elo
numSearchThreads = 12:  +100 Elo (recommended)
numSearchThreads = 16:   +94 Elo
numSearchThreads = 20:   +79 Elo
numSearchThreads = 24:   +95 Elo
numSearchThreads = 32:   +89 Elo

댓글

이 블로그의 인기 게시물

[DevTip] Windows에서 tail 쓰기...

환경개선부담금

Electropermanent Magnets: Programmable Magnets with Zero Static Power Consumption Enable Smallest Modular Robots Yet