Running Kuzco in Docker and WSL2
PSA and quickstart

Kuzco from Sam is a pretty interesting project for running distributed inference. This is a guide trying to compile all the things I've offered to help people run their nodes.

Here's how to run Kuzco in Docker, a WSL2 image to use when running it in Windows, and some interesting tips and observations.

The Dockerfile and runner source code can be found in this repo.

WSL2 one-click image needs a little more prep, I'll post on Twitter when they're up.

§Linux: Docker

This guide has only really been tested on Ubuntu 20 and 22, and on Nvidia GPUs. A100s are a known issue due to problems from Ollama and .so drivers.

§Step 0: Set up your system

Make sure you have the basics and your system is updated.

sudo apt update && sudo apt -y install build-essential curl ca-certificates nvtop

Check that you have an Nvidia GPU enabled and working on the base system.

sudo nvidia-smi

You should see something like this:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:04:00.0 Off |                  Off |
|  0%   28C    P2              68W / 450W |   4950MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
| |
+---------------------------------------------------------------------------------------+

§Step 1: Install docker engine

Follow this guide for better instructions if you run into issues.

Run this:

install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
echo   "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" |   sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Test with this:

sudo docker run hello-world

§Step 2: Install the nvidia container runtime

Follow this guide for more comprehensive instructions.

Or you can do this:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

§Step 3: Test whether docker can see your GPUs

docker run --rm -it --runtime=nvidia --gpus=all nvidia/cuda:12.3.2-runtime-ubuntu22.04 nvidia-smi

This should give you an output like so:

==========
== CUDA ==
==========

CUDA Version 12.3.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Tue Mar 12 03:01:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:04:00.0 Off |                  Off |
|  0%   29C    P2              68W / 450W |   4950MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

§Step 3.5 (optional): Install Kuzco on the base machine

It’s a lot better for disk usage if you have the models already on the base machine, and you can symlink them into the containers.

Install Kuzco on the base machine like so:

curl -fsSL https://kuzco.xyz/install.sh | sh
kuzco init
# Wait until the models are done downloading

You can also use Ollama or just straight-up pull the models, but this is just simpler to put in a guide :)

§Step 4: Pull and start the base image

Let's name our container kuzco. Better to run these with sudo -i to save some typing.

Option 1:

docker run --rm --runtime=nvidia --gpus all --name kuzco -d hrishioa/kuzco

OR

If you already have the models downloaded, it's much better to share the models across containers. Option 2:

docker run --rm --runtime=nvidia --gpus=all -v /home/samheutmaker/.kuzco/models:/home/samheutmaker/.kuzco/models --name kuzco -d hrishioa/kuzco

Verify that the container is running with docker ps.

§Step 5: Init and run kuzco

Don’t start the worker, just log in and create (or register) your worker.

docker exec -it kuzco kuzco init

There's also a Kuzco task manager (run.ts) included in the image, that will monitor for idle messages from Kuzco and restart if too many heartbeats show up with no inference. The runner should also let you run containers headlessly, and pass the logs through to docker logs:

docker exec -itd kuzco1 bash -c "cd /kuzco_runner && /root/.bun/bin/bun run_kuzco.ts"

§Windows: WSL2 Image

WSL2 is far more efficient when running multiple workers, even with the abstraction layer. Reserved Memory (GPU and CPU) stays constant, which is a pretty big bonus.

There's a prebuilt image I can upload soon (I'll let you know on Twitter, that should make one-click download+install a lot easier, and need no setup. Once you have an image you've made (or the prebuilt), you can skip to the second section. For now, you can clone your existing WSL2 setup.

Find the name of your distro (run these in Powershell, not inside WSL2):

wsl -l

Now image it somewhere:

wsl --export <YourDistributionName> KuzcoBuntu

and then import your image (or the prebuilt image) like so:

wsl --import KuzcoBuntu .\KuzcoBuntu-Drive KuzcoBuntu
wsl --distribution KuzcoBuntu
kuzco init

That's pretty much it really.

§Tips and Observations

nvtop is pretty useful on linux systems to check for thrashing. When booting up a worker, you want to pay close attention to these parts:

time=2024-03-13T09:23:35.895Z level=INFO source=images.go:710 msg="total blobs: 15"
time=2024-03-13T09:23:35.896Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-13T09:23:35.897Z level=INFO source=routes.go:1019 msg="Listening on [::]:14444 (version 0.1.27)"
time=2024-03-13T09:23:35.897Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-13T09:23:39.881Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v6 cpu_avx cuda_v11 cpu_avx2 rocm_v5 cpu]"
time=2024-03-13T09:23:39.881Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-13T09:23:39.881Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-13T09:23:39.881Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.113.01]"
time=2024-03-13T09:23:39.888Z level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-13T09:23:39.888Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-13T09:23:39.893Z level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.9"

These should tell you (at the start) whether the right GPUs have been detected, and some early indications of driver issues.

Then wait for inference, and check the ollama (llama.cpp actually) logs to tell you whether the GPUs are actually being used.

time=2024-03-13T09:23:56.791Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3804676325/cuda_v11/libext_server.so"
time=2024-03-13T09:23:56.791Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"

Cuda 11 is a good place to start if you have driver issues - 12 is still a little unstable for use with ollama but YMMV.

Checking htop for signs of CPU thrashing is a good way to check if you're not actually using the GPUs.

me
Hrishi Olickel
12 Mar 2024