Running KataGo/KaTrain on Jetson AGX Xavier

Set-up Jetson AGX Xavier

From Ubuntu 20.04, use sdkmanager and follow instructions

Temporarily modify VERSION_ID in /etc/os-release from 20.04 to 18.04

Preparation-CMake (on Jetson)

KataGo requires version 3.18.2 or higher, but the default wa 3.10.2, hence build from source.

Download CMake source file then install at /usr/local/CMake

git clone https://github.com/Kitware/CMake

./bootstrap --prefix=/usr/local/CMake

make

sudo make install

Preparation-KataGo v1.10.0 (on Jetson)

Refer github.com/lightvector/KataGo/blob/master/Compiling.md#linux

For CMake, refer Preparation-CMake (on Jetson) above.

Check all packages are installed or updated, including zlib1g-dev, libzip-dev, libgoogle-perftools-dev, libssl-dev, and ocl-icd-opencl-dev.

Build KataGo

git clone https://github.com/lightvector/KataGo.git

git checkout v1.10.0

cd KataGo/cpp

/usr/local/CMake/cmake . -DUSE_BACKEND=CUDA -DUSE_TCMALLOC=1

[Note]

Selecting TensorRT backend(TENSORRT) will give check error. "TensorRT 8.2 or greater is required but 8.0.1 was found". Currently it is not easy to upgrade TensorRT package, hence ignore.

make

Test/Benchmark KataGo

./katago benchmark -model tests/models/g170-b6c96-s175395328-d26788732.bin.gz -config configs/gtp_example.cfg

Ordered summary of results:

numSearchThreads = 5: 10 / 10 positions, visits/s = 788.64 nnEvals/s = 701.63 nnBatches/s = 281.81 avgBatchSize = 2.49 (10.2 secs) (EloDiff baseline)

numSearchThreads = 10: 10 / 10 positions, visits/s = 1105.99 nnEvals/s = 979.53 nnBatches/s = 197.14 avgBatchSize = 4.97 (7.3 secs) (EloDiff +114)

numSearchThreads = 12: 10 / 10 positions, visits/s = 1271.85 nnEvals/s = 1138.23 nnBatches/s = 192.11 avgBatchSize = 5.92 (6.4 secs) (EloDiff +163)

numSearchThreads = 16: 10 / 10 positions, visits/s = 1297.56 nnEvals/s = 1172.58 nnBatches/s = 143.77 avgBatchSize = 8.16 (6.3 secs) (EloDiff +163)

numSearchThreads = 20: 10 / 10 positions, visits/s = 1494.94 nnEvals/s = 1355.48 nnBatches/s = 135.07 avgBatchSize = 10.04 (5.5 secs) (EloDiff +210)

numSearchThreads = 24: 10 / 10 positions, visits/s = 1540.22 nnEvals/s = 1406.60 nnBatches/s = 116.03 avgBatchSize = 12.12 (5.3 secs) (EloDiff +215)

numSearchThreads = 32: 10 / 10 positions, visits/s = 1608.93 nnEvals/s = 1478.43 nnBatches/s = 84.03 avgBatchSize = 17.59 (5.2 secs) (EloDiff +219)

numSearchThreads = 40: 10 / 10 positions, visits/s = 1682.52 nnEvals/s = 1579.64 nnBatches/s = 74.20 avgBatchSize = 21.29 (5.0 secs) (EloDiff +224)

numSearchThreads = 48: 10 / 10 positions, visits/s = 1648.66 nnEvals/s = 1564.58 nnBatches/s = 55.86 avgBatchSize = 28.01 (5.1 secs) (EloDiff +203)

numSearchThreads = 64: 10 / 10 positions, visits/s = 1628.13 nnEvals/s = 1574.74 nnBatches/s = 43.20 avgBatchSize = 36.45 (5.3 secs) (EloDiff +172)

Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.

Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).

So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:

numSearchThreads = 5: (baseline)

numSearchThreads = 10: +114 Elo

numSearchThreads = 12: +163 Elo

numSearchThreads = 16: +163 Elo

numSearchThreads = 20: +210 Elo

numSearchThreads = 24: +215 Elo

numSearchThreads = 32: +219 Elo

numSearchThreads = 40: +224 Elo (recommended)

numSearchThreads = 48: +203 Elo

numSearchThreads = 64: +172 Elo

Might consider, changing maxVisits to 800, and numSearchThreads to 40 in configs/gtp_example.cfg.

Rename katago to katago-v1.10.0-cuda10.2-arm64, configs/gtp_example.cfg to gtp_v1.10.0-cuda10.2-arm64.cfg, and pack executable and the configuration file for later use.

[Ref.] Performance when Eigen backend is chosen. (15W DESKTOP Mode -> 4 cores)

Ordered summary of results:

numSearchThreads = 3: 10 / 10 positions, visits/s = 103.06 nnEvals/s = 98.79 nnBatches/s = 94.39 avgBatchSize = 1.05 (8.0 secs) (EloDiff baseline)

numSearchThreads = 4: 10 / 10 positions, visits/s = 122.75 nnEvals/s = 119.35 nnBatches/s = 112.69 avgBatchSize = 1.06 (6.8 secs) (EloDiff +57)

numSearchThreads = 5: 10 / 10 positions, visits/s = 116.70 nnEvals/s = 115.45 nnBatches/s = 102.81 avgBatchSize = 1.12 (7.2 secs) (EloDiff +30)

numSearchThreads = 6: 10 / 10 positions, visits/s = 118.30 nnEvals/s = 116.36 nnBatches/s = 103.97 avgBatchSize = 1.12 (7.2 secs) (EloDiff +27)

numSearchThreads = 8: 10 / 10 positions, visits/s = 122.20 nnEvals/s = 121.78 nnBatches/s = 105.77 avgBatchSize = 1.15 (7.1 secs) (EloDiff +24)

numSearchThreads = 12: 10 / 10 positions, visits/s = 104.99 nnEvals/s = 105.23 nnBatches/s = 80.65 avgBatchSize = 1.30 (8.7 secs) (EloDiff -67)

Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.

Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).

So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:

numSearchThreads = 3: (baseline)

numSearchThreads = 4: +57 Elo (recommended)

numSearchThreads = 5: +30 Elo

numSearchThreads = 6: +27 Elo

numSearchThreads = 8: +24 Elo

numSearchThreads = 12: -67 Elo

Preparation katrain v1.10.1 (on Jetson)

Refer https://matham.github.io/ffpyplayer/installation.html#ubuntu-18-04

sudo apt install python3-pip libsdl2-dev libavdevice-dev libavformat-dev libjpeg-dev

Refer https://github.com/sanderland/katrain/blob/master/INSTALL.md#LinuxQuick

git clone https://github.com/sanderland/katrain.git

cd katrain

requirements.txt 수정: Pillow==9.0.0 -> Pillow==8.4.0

pip3 install -r requirements.txt

pip3 install .

(Reboot required?)

Change default katago model with the compiled katago file.

cp ~/.local/lib/python3.6/site-packages/katrain/KataGo/katago ~/.local/lib/python3.6/site-packages/katrain/KataGo/katago.org

cp ~/katago/KataGo/cpp/katago-v1.10.0-cuda10.2-arm64 ~/.local/lib/python3.6/site-packages/katrain/KataGo/katago

Modify configuration file with the benchmarked configuration

vi ~/.local/lib/python3.6/site-packages/katrain/KataGo/analysis_config.cfg

maxVisits = 800 # was 500

numSearchThreads = 40 # was 8

nnMaxBatchSize = 480 # was 96, 12 * 40

Run with katrain or python3 -m katrain

If you face CRITICAL errors, reinstall kivy with pip3 uninstall kivy; pip3 install kivy==2.0.0rc1, pip3 install importlib-metadata

sudo apt install libfreetype6-dev libsdl2-image-dev libsdl2-mixer-dev libsdl2-net-dev libsdl2-ttf-dev libportmidi-dev xclip xsel

pip3 install pygame

If face DBUS error, install python3-sdl2 python-pygame-sdl2, then run katrain again.

sudo apt install python3-sdl2 python-pygame-sdl2

katrain

If face an exception (TypeError : "'NoneType' object is not subscriptable") when closing the window, then modify ~/.local/lib/python3.6/site-packages/katrain/__main__.py around line 821 and 855.

# lines 821~822

Window._left = win_left # was Window.left

Window._top = win_top # was Window.top

#lines 855~856

self.gui._config["ui_state"]["top"] = Window._top # was Window.top

self.gui._config["ui_state"]["left"] = Window._left # was Window.left

[TIP] For proper GUI

pip3 uninstall kivy

pip3 install kivy-jetson

For comparison

On a machine with i7-8700 & 32GB RAM with Intel UHD Graphics 630

KataGo v1.10.0 from https://github.com/lightvector/KataGo/releases

katago-v1.10.0-eigenavx2-windows-x64

Ordered summary of results:

numSearchThreads = 5: 10 / 10 positions, visits/s = 247.93 nnEvals/s = 242.92 nnBatches/s = 181.23 avgBatchSize = 1.34 (3.4 secs) (EloDiff baseline)

numSearchThreads = 6: 10 / 10 positions, visits/s = 300.14 nnEvals/s = 291.67 nnBatches/s = 228.46 avgBatchSize = 1.28 (2.8 secs) (EloDiff +67)

numSearchThreads = 8: 10 / 10 positions, visits/s = 383.77 nnEvals/s = 378.47 nnBatches/s = 334.36 avgBatchSize = 1.13 (2.3 secs) (EloDiff +150)

numSearchThreads = 10: 10 / 10 positions, visits/s = 384.28 nnEvals/s = 383.42 nnBatches/s = 345.85 avgBatchSize = 1.11 (2.3 secs) (EloDiff +142)

numSearchThreads = 12: 10 / 10 positions, visits/s = 357.28 nnEvals/s = 355.71 nnBatches/s = 319.98 avgBatchSize = 1.11 (2.5 secs) (EloDiff +104)

numSearchThreads = 20: 10 / 10 positions, visits/s = 296.14 nnEvals/s = 297.34 nnBatches/s = 268.02 avgBatchSize = 1.11 (3.3 secs) (EloDiff -11)

Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.

Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).

So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:

numSearchThreads = 5: (baseline)

numSearchThreads = 6: +67 Elo

numSearchThreads = 8: +150 Elo (recommended)

numSearchThreads = 10: +142 Elo

numSearchThreads = 12: +104 Elo

numSearchThreads = 20: -11 Elo

katago-v1.10.0-eigen-windows-x64

Ordered summary of results:

numSearchThreads = 5: 10 / 10 positions, visits/s = 174.02 nnEvals/s = 169.67 nnBatches/s = 129.07 avgBatchSize = 1.31 (4.8 secs) (EloDiff baseline)

numSearchThreads = 8: 10 / 10 positions, visits/s = 152.93 nnEvals/s = 151.52 nnBatches/s = 143.43 avgBatchSize = 1.06 (5.7 secs) (EloDiff -70)

numSearchThreads = 10: 10 / 10 positions, visits/s = 169.49 nnEvals/s = 169.49 nnBatches/s = 153.11 avgBatchSize = 1.11 (5.3 secs) (EloDiff -44)

numSearchThreads = 12: 10 / 10 positions, visits/s = 223.09 nnEvals/s = 222.11 nnBatches/s = 195.39 avgBatchSize = 1.14 (4.1 secs) (EloDiff +51)

numSearchThreads = 16: 10 / 10 positions, visits/s = 139.48 nnEvals/s = 139.77 nnBatches/s = 120.83 avgBatchSize = 1.16 (6.8 secs) (EloDiff -165)

numSearchThreads = 20: 10 / 10 positions, visits/s = 101.38 nnEvals/s = 101.38 nnBatches/s = 79.26 avgBatchSize = 1.28 (9.8 secs) (EloDiff -327)

Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.

Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).

So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:

numSearchThreads = 5: (baseline)

numSearchThreads = 8: -70 Elo

numSearchThreads = 10: -44 Elo

numSearchThreads = 12: +51 Elo (recommended)

numSearchThreads = 16: -165 Elo

numSearchThreads = 20: -327 Elo

katago-v1.10.0-opencl-windows-x64

Ordered summary of results:

numSearchThreads = 5: 10 / 10 positions, visits/s = 411.34 nnEvals/s = 370.10 nnBatches/s = 148.52 avgBatchSize = 2.49 (19.5 secs) (EloDiff baseline)

numSearchThreads = 10: 10 / 10 positions, visits/s = 457.81 nnEvals/s = 415.14 nnBatches/s = 83.70 avgBatchSize = 4.96 (17.7 secs) (EloDiff +20)

numSearchThreads = 12: 10 / 10 positions, visits/s = 574.24 nnEvals/s = 525.45 nnBatches/s = 88.72 avgBatchSize = 5.92 (14.1 secs) (EloDiff +100)

numSearchThreads = 16: 10 / 10 positions, visits/s = 584.77 nnEvals/s = 540.79 nnBatches/s = 68.52 avgBatchSize = 7.89 (13.9 secs) (EloDiff +94)

numSearchThreads = 20: 10 / 10 positions, visits/s = 582.89 nnEvals/s = 536.19 nnBatches/s = 54.45 avgBatchSize = 9.85 (14.0 secs) (EloDiff +79)

numSearchThreads = 24: 10 / 10 positions, visits/s = 626.24 nnEvals/s = 577.01 nnBatches/s = 48.77 avgBatchSize = 11.83 (13.1 secs) (EloDiff +95)

numSearchThreads = 32: 10 / 10 positions, visits/s = 655.88 nnEvals/s = 613.73 nnBatches/s = 37.17 avgBatchSize = 16.51 (12.7 secs) (EloDiff +89)

Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.

Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).

So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:

numSearchThreads = 5: (baseline)

numSearchThreads = 10: +20 Elo

numSearchThreads = 12: +100 Elo (recommended)

numSearchThreads = 16: +94 Elo

numSearchThreads = 20: +79 Elo

numSearchThreads = 24: +95 Elo

numSearchThreads = 32: +89 Elo

이 블로그 검색

회색 다락방

Running KataGo/KaTrain on Jetson AGX Xavier

댓글

댓글 쓰기

이 블로그의 인기 게시물

Bitnami Redmine 업그레이드 + 이전 (Ubuntu 16.04 LTS)

Electropermanent Magnets: Programmable Magnets with Zero Static Power Consumption Enable Smallest Modular Robots Yet

Bitnami Redmine과 SVN + Git을 자동으로 연결하기