Kuzco from Sam is a pretty interesting project for running distributed inference. This is a guide trying to compile all the things I've offered to help people run their nodes.
Here's how to run Kuzco in Docker, a WSL2 image to use when running it in Windows, and some interesting tips and observations.
The Dockerfile and runner source code can be found in this repo.
WSL2 one-click image needs a little more prep, I'll post on Twitter when they're up.
This guide has only really been tested on Ubuntu 20 and 22, and on Nvidia GPUs. A100s are a known issue due to problems from Ollama and .so drivers.
Make sure you have the basics and your system is updated.
sudo apt update && sudo apt -y install build-essential curl ca-certificates nvtop
Check that you have an Nvidia GPU enabled and working on the base system.
sudo nvidia-smi
You should see something like this:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:04:00.0 Off | Off |
| 0% 28C P2 68W / 450W | 4950MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| |
+---------------------------------------------------------------------------------------+
Follow this guide for better instructions if you run into issues.
Run this:
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Test with this:
sudo docker run hello-world
Follow this guide for more comprehensive instructions.
Or you can do this:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run --rm -it --runtime=nvidia --gpus=all nvidia/cuda:12.3.2-runtime-ubuntu22.04 nvidia-smi
This should give you an output like so:
==========
== CUDA ==
==========
CUDA Version 12.3.2
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
Tue Mar 12 03:01:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:04:00.0 Off | Off |
| 0% 29C P2 68W / 450W | 4950MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
It’s a lot better for disk usage if you have the models already on the base machine, and you can symlink them into the containers.
Install Kuzco on the base machine like so:
curl -fsSL https://kuzco.xyz/install.sh | sh
kuzco init
# Wait until the models are done downloading
You can also use Ollama or just straight-up pull the models, but this is just simpler to put in a guide :)
Let's name our container kuzco
. Better to run these with sudo -i
to save some typing.
Option 1:
docker run --rm --runtime=nvidia --gpus all --name kuzco -d hrishioa/kuzco
OR
If you already have the models downloaded, it's much better to share the models across containers. Option 2:
docker run --rm --runtime=nvidia --gpus=all -v /home/samheutmaker/.kuzco/models:/home/samheutmaker/.kuzco/models --name kuzco -d hrishioa/kuzco
Verify that the container is running with docker ps
.
Don’t start the worker, just log in and create (or register) your worker.
docker exec -it kuzco kuzco init
There's also a Kuzco task manager (run.ts
) included in the image, that will monitor for idle messages from Kuzco and restart if too many heartbeats show up with no inference. The runner should also let you run containers headlessly, and pass the logs through to docker logs
:
docker exec -itd kuzco1 bash -c "cd /kuzco_runner && /root/.bun/bin/bun run_kuzco.ts"
WSL2 is far more efficient when running multiple workers, even with the abstraction layer. Reserved Memory (GPU and CPU) stays constant, which is a pretty big bonus.
There's a prebuilt image I can upload soon (I'll let you know on Twitter, that should make one-click download+install a lot easier, and need no setup. Once you have an image you've made (or the prebuilt), you can skip to the second section. For now, you can clone your existing WSL2 setup.
Find the name of your distro (run these in Powershell, not inside WSL2):
wsl -l
Now image it somewhere:
wsl --export <YourDistributionName> KuzcoBuntu
and then import your image (or the prebuilt image) like so:
wsl --import KuzcoBuntu .\KuzcoBuntu-Drive KuzcoBuntu
wsl --distribution KuzcoBuntu
kuzco init
That's pretty much it really.
nvtop is pretty useful on linux systems to check for thrashing. When booting up a worker, you want to pay close attention to these parts:
time=2024-03-13T09:23:35.895Z level=INFO source=images.go:710 msg="total blobs: 15"
time=2024-03-13T09:23:35.896Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-13T09:23:35.897Z level=INFO source=routes.go:1019 msg="Listening on [::]:14444 (version 0.1.27)"
time=2024-03-13T09:23:35.897Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-13T09:23:39.881Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v6 cpu_avx cuda_v11 cpu_avx2 rocm_v5 cpu]"
time=2024-03-13T09:23:39.881Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-13T09:23:39.881Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-13T09:23:39.881Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.113.01]"
time=2024-03-13T09:23:39.888Z level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-13T09:23:39.888Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-13T09:23:39.893Z level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.9"
These should tell you (at the start) whether the right GPUs have been detected, and some early indications of driver issues.
Then wait for inference, and check the ollama (llama.cpp actually) logs to tell you whether the GPUs are actually being used.
time=2024-03-13T09:23:56.791Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3804676325/cuda_v11/libext_server.so"
time=2024-03-13T09:23:56.791Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Cuda 11 is a good place to start if you have driver issues - 12 is still a little unstable for use with ollama but YMMV.
Checking htop
for signs of CPU thrashing is a good way to check if you're not actually using the GPUs.