← ~/content

Run Ollama with NVIDIA GPU Acceleration on Proxmox LXC (2026)

Tutorial~17 min read
Run Ollama with NVIDIA GPU Acceleration on Proxmox LXC (2026)
This content is a preview and may change or be removed before publication.
intermediate~30 min

Prerequisites

  • Proxmox VE 8.1 or later with IOMMU enabled
  • An NVIDIA GPU physically installed in the host

Tools

  • SSH terminal
  • Web browser

Software

  • debian13
  • ollama0.22.0
  • proxmox-ve9.1
  • nvidia-driver595.58.03
Watch on YouTube

Running large language models locally is one of the best things you can do with a homelab. No API keys, no usage fees, no data leaving your network. The bottleneck is speed — running inference on a CPU is painfully slow, but a mid-range NVIDIA GPU turns a 30-second response into a 2-second one.

This guide walks through the full stack: installing the NVIDIA driver on a Proxmox host, passing the GPU through to an unprivileged LXC container using Proxmox VE's modern dev0: device passthrough syntax, and installing Ollama inside the container for GPU-accelerated inference. By the end, you will have a dedicated Ollama LXC that any other service on your network can call for fast, local AI.

We are using the dev0: syntax introduced in Proxmox VE 8.1, which replaces the old lxc.cgroup2.devices.allow and lxc.mount.entry approach that you will see in older guides. The new syntax is cleaner, handles both cgroup permissions and device bind mounts in a single config line, and works with unprivileged containers out of the box.

What You Will End Up With

  • NVIDIA driver 595.58.03 installed on the Proxmox host with DKMS (survives kernel updates)
  • A dedicated Debian 13 LXC with full GPU access via dev0: passthrough
  • Ollama serving models on the local network with GPU acceleration
  • A test inference confirming everything works end-to-end

For this guide, the Ollama LXC takes CT ID 105 and IP 10.1.20.105. The GPU is an NVIDIA GeForce RTX 3060 12GB. Substitute your own values.

NOTE

About 25 to 30 minutes of hands-on time. The NVIDIA driver compilation is the longest wait.

Step 1: Identify Your GPU on the Proxmox Host

On the Proxmox host, check which NVIDIA GPU is installed and what bus address it is on:

List NVIDIA PCI devices
lspci | grep -i nvidia

You will see output like:

Output:
09:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate]
09:00.1 Audio device: NVIDIA Corporation GA106 High Definition Audio Controller

Note the bus address — 09:00 in this example. Modern NVIDIA cards are multi-function devices: the VGA controller at .0, the audio device at .1, and on Turing+ cards, sometimes USB and serial bus controllers at .2 and .3. All of these belong to the GPU and will be passed through together.

If you have multiple NVIDIA GPUs and want to pass only one to the LXC (keeping others for VM passthrough or other use), note which bus address belongs to which card.

Step 2: Blacklist Nouveau and Reboot

The open-source nouveau driver will try to claim the GPU before the proprietary NVIDIA driver can. We need to blacklist it. We also blacklist nvidiafb since we do not need framebuffer support on a headless server.

Create the blacklist file
nano /etc/modprobe.d/blacklist-nvidia.conf

Paste the following, then save with Ctrl+X, Y, Enter:

/etc/modprobe.d/blacklist-nvidia.conf
blacklist nouveau
blacklist nvidiafb

If you have a single NVIDIA GPU and want it available for LXC passthrough, skip the next section and jump straight to the initramfs rebuild below.

Multi-GPU: Selective VFIO Binding (Optional)

If you have multiple NVIDIA GPUs and want one dedicated to VM passthrough (bound to vfio-pci) while the other uses the NVIDIA driver for LXC passthrough, you need a udev rule that binds by bus address rather than by vendor ID.

First, make sure the vfio modules load at boot:

Create the VFIO modules file
nano /etc/modules-load.d/vfio.conf

Paste the following, then save with Ctrl+X, Y, Enter:

/etc/modules-load.d/vfio.conf
vfio
vfio_iommu_type1
vfio_pci

Now create a udev rule that only binds the VM slot to vfio-pci. Replace 0000:41:00 with the bus address of the GPU you want reserved for VMs:

Create the selective VFIO udev rule
nano /etc/udev/rules.d/10-vfio-nvidia.rules

Paste the following, then save with Ctrl+X, Y, Enter:

/etc/udev/rules.d/10-vfio-nvidia.rules
SUBSYSTEM=="pci", KERNEL=="0000:41:00.*", ATTR{driver_override}="vfio-pci"
FlagWhat it does
SUBSYSTEM=="pci"Only matches PCI devices. Keep.
KERNEL=="0000:41:00.*"Matches all functions at bus address 41:00 (VGA, audio, USB, etc.). Replace with the bus address of the GPU you want bound to vfio-pci for VM passthrough.
ATTR{driver_override}="vfio-pci"Forces vfio-pci as the driver. Keep.

The wildcard .* at the end catches all functions on that bus address. The GPU at the other bus address (our LXC GPU) is not matched, so the NVIDIA driver claims it instead.

Rebuild the Initramfs and Reboot

WARNING

This step is critical. The blacklist (and udev rules, if you created them) are baked into the initramfs at build time. If you skip this, the old initramfs loads at boot with nouveau still active, and the NVIDIA driver installer in the next step will fail. This is the single most common failure in GPU passthrough setups.

Rebuild the initramfs
update-initramfs -u -k $(uname -r)

Now reboot so the kernel comes up with nouveau blocked:

Reboot the host
reboot

After the host comes back up, SSH in and verify nouveau is no longer loaded:

Verify nouveau is not loaded
lsmod | grep nouveau

No output means nouveau is successfully blocked. If you still see it listed, double-check that /etc/modprobe.d/blacklist-nvidia.conf exists and contains the blacklist nouveau line, rebuild the initramfs again, and reboot.

Step 3: Install the NVIDIA Driver on the Host

The NVIDIA driver needs kernel headers to compile its modules. Install them along with build-essential for the compiler toolchain:

Install kernel headers and build tools
apt-get install -y proxmox-headers-$(uname -r) build-essential dkms
FlagWhat it does
proxmox-headers-$(uname -r)Kernel headers matching your running PVE kernel. The $(uname -r) automatically fills in the correct version. Keep.
build-essentialC compiler and make. Needed for NVIDIA module compilation. Keep.
dkmsDynamic Kernel Module Support. Automatically recompiles the NVIDIA modules when the kernel is updated. Keep.

Download the NVIDIA driver. We use the 595.58.03 production branch — the latest Linux driver as of April 2026:

Download the NVIDIA driver
wget -P /root https://us.download.nvidia.com/XFree86/Linux-x86_64/595.58.03/NVIDIA-Linux-x86_64-595.58.03.run

Make it executable and run the installer:

Make the installer executable
chmod +x /root/NVIDIA-Linux-x86_64-595.58.03.run
Run the NVIDIA installer
bash /root/NVIDIA-Linux-x86_64-595.58.03.run --dkms --silent
FlagWhat it does
--dkmsRegisters the modules with DKMS so they automatically rebuild on kernel updates. Keep.
--silentRuns without the ncurses UI. On a headless Proxmox host there is no display server, so the installer's questions about X libraries are irrelevant. Keep.

This takes 2-5 minutes while it compiles the kernel modules. You will see a few warnings about X library paths and 32-bit compatibility libraries — both are harmless on a headless server.

Verify the driver is loaded and sees your GPU:

Verify the NVIDIA driver
nvidia-smi

You should see a table showing your GPU, the driver version (595.58.03), CUDA version, temperature, and memory usage. If you get command not found or No devices were found, the most likely causes are:

  1. Nouveau still loaded — run lsmod | grep nouveau. If it appears, the blacklist did not take effect. Go back to Step 2 and check that /etc/modprobe.d/blacklist-nvidia.conf exists, rebuild the initramfs, and reboot.
  2. GPU bound to vfio-pci — run lspci -nnks <your-bus-address> and check Kernel driver in use. If it says vfio-pci instead of nvidia, your udev rule is matching the wrong GPU. Check the bus address in your rule.

Verify DKMS registered the modules:

Check DKMS status
dkms status

You should see nvidia/595.58.03 listed as installed.

Step 4: Create the Ollama LXC with GPU Passthrough

On the Proxmox host, create the container:

Create the Ollama LXC
pct create 105 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
    --hostname ollama \
    --cores 4 \
    --memory 8192 \
    --swap 2048 \
    --rootfs local-lvm:20 \
    --net0 name=eth0,bridge=vmbr0,ip=10.1.20.105/24,gw=10.1.20.1 \
    --unprivileged 1 \
    --features nesting=1 \
    --onboot 1 \
    --password
FlagWhat it does
105 (positional)Container VMID. Replace with the next free ID on your host.
local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zstOS template. Keep. If missing, run pveam update && pveam download local debian-13-standard_13.1-2_amd64.tar.zst.
--hostname ollamaContainer hostname. Keep or rename to taste.
--cores 4CPU cores. Keep — Ollama uses CPU for prompt processing even with GPU offload.
--memory 8192RAM in MB. Keep — Ollama loads model metadata into system RAM alongside the GPU VRAM.
--swap 2048Swap in MB. Keep — safety net for larger models.
--rootfs local-lvm:2020 GB root filesystem. Replace local-lvm if your storage pool is named differently. Models can be large (4-8 GB each), so 20 GB gives room for a few.
--net0 name=eth0,bridge=vmbr0,ip=10.1.20.105/24,gw=10.1.20.1Network attachment. Replace bridge, ip, and gw with your environment's values.
--unprivileged 1Unprivileged container. Keep — the dev0: syntax works with unprivileged containers.
--features nesting=1Enables systemd cgroup management. Keep.
--onboot 1Auto-start on host boot. Keep.
--passwordPrompts for a root password. Keep.

Now add the GPU device passthrough lines. These use PVE's dev0: syntax to pass the NVIDIA device nodes into the container:

Add GPU passthrough to the container config
nano /etc/pve/lxc/105.conf

Add the following lines at the end of the file, then save with Ctrl+X, Y, Enter:

/etc/pve/lxc/105.conf (append)
dev0: /dev/nvidia0,gid=44
dev1: /dev/nvidiactl,gid=44
dev2: /dev/nvidia-uvm,gid=44
dev3: /dev/nvidia-uvm-tools,gid=44
dev4: /dev/nvidia-caps/nvidia-cap1,gid=44
dev5: /dev/nvidia-caps/nvidia-cap2,gid=44
LineWhat it does
/dev/nvidia0The primary GPU device. If you have multiple GPUs on the NVIDIA driver, nvidia0 is the first one. Keep.
/dev/nvidiactlNVIDIA control device — shared across all GPUs. Keep.
/dev/nvidia-uvmUnified Virtual Memory — required for CUDA compute workloads like LLM inference. Keep.
/dev/nvidia-uvm-toolsUVM tools interface. Keep.
/dev/nvidia-caps/nvidia-cap1, nvidia-cap2NVIDIA capability devices. Keep.
gid=44Sets the group inside the container to GID 44 (the video group on Debian). This lets the ollama user access the devices without running as root. Keep.

Start the container:

Start the container
pct start 105

Verify the GPU devices are visible inside the container:

Check GPU devices inside the container
pct exec 105 -- ls -la /dev/nvidia*

You should see all six device nodes listed with the video group.

Step 5: Install the NVIDIA Userspace Libraries in the Container

The container shares the host's kernel (and its NVIDIA kernel modules), but it needs its own copy of the userspace libraries — nvidia-smi, libcuda.so, and friends. We use the same .run installer with --no-kernel-module so it only installs the userspace components.

On the Proxmox host, copy the installer into the container:

Copy the NVIDIA installer into the container
pct push 105 /root/NVIDIA-Linux-x86_64-595.58.03.run /root/NVIDIA-Linux-x86_64-595.58.03.run --perms 755

Now enter the container:

Enter the container
pct enter 105

The minimal Debian 13 template ships without a generated locale, which causes noisy Perl warnings on every package operation. Fix that before doing anything else:

Generate the en_US.UTF-8 locale
sed -i 's/^# en_US.UTF-8/en_US.UTF-8/' /etc/locale.gen && locale-gen

Now install the NVIDIA userspace libraries:

Install NVIDIA userspace libraries
bash /root/NVIDIA-Linux-x86_64-595.58.03.run --no-kernel-module --silent
FlagWhat it does
--no-kernel-moduleSkips kernel module compilation entirely. The container uses the host's kernel modules via the passed-through devices. Keep.
--silentNo UI prompts. Keep.

Verify the driver sees the GPU from inside the container:

Verify GPU access
nvidia-smi

You should see the same GPU table as on the host — same driver version, same GPU model, same VRAM. If you get Failed to initialize NVML: Unknown Error, the device nodes are not correctly passed through. Go back to Step 4 and check the dev0: lines in the container config.

Clean up the installer to save disk space:

Remove the installer
rm /root/NVIDIA-Linux-x86_64-595.58.03.run

TIP

The NVIDIA driver version inside the container must match the host exactly. If you update the host driver later, you need to re-run this step inside the container with the matching new version.

Step 6: Install Ollama

Ollama provides a one-line install script (curl -fsSL https://ollama.com/install.sh | bash), but piping a remote script to bash means trusting whatever that URL serves at the moment you run it. For a service that will have GPU access and run inference on your network, it is worth doing the install manually. It only takes a few extra commands, and you will understand exactly what is on your system.

Install curl and zstd — both are needed to download and extract the Ollama release archive, and neither is in the minimal Debian template:

Install dependencies
apt update && apt install -y curl zstd

Download the official Ollama release tarball and extract it into /usr. This places the ollama binary at /usr/bin/ollama and its shared libraries under /usr/lib/ollama:

Download and extract Ollama
curl -fsSL https://ollama.com/download/ollama-linux-amd64.tar.zst | tar x --zstd -C /usr

Verify the binary is in place:

Verify Ollama binary
ollama --version

Now create a dedicated system user to run the Ollama service. Running it as its own user rather than root is standard practice — it limits what the process can access if it is ever compromised:

Create the ollama system user
useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
FlagWhat it does
-rSystem account (no aging, low UID). Keep.
-s /bin/falseNo login shell — this user only runs the service. Keep.
-UCreates a matching ollama group. Keep.
-m -d /usr/share/ollamaCreates the home directory at /usr/share/ollama, where Ollama stores downloaded models. Keep.

Add the ollama user to the video group so it can access the GPU devices we passed through in Step 4:

Grant GPU access to the ollama user
usermod -a -G video ollama

Create the systemd service file:

Create the Ollama service file
nano /etc/systemd/system/ollama.service

Paste the following, then save with Ctrl+X, Y, Enter:

/etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target
 
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
 
[Install]
WantedBy=multi-user.target

Enable and start the service:

Enable and start Ollama
systemctl daemon-reload && systemctl enable ollama && systemctl start ollama

Verify Ollama is running:

Check Ollama service status
systemctl status ollama --no-pager

You should see Active: active (running). Check the logs to confirm GPU detection:

Check GPU detection in logs
journalctl -u ollama --no-pager | grep -i "inference compute"

You should see a line mentioning your GPU model and VRAM — something like NVIDIA GeForce RTX 3060 with 12.0 GiB. If you see no GPU line, restart the service with systemctl restart ollama and check again.

Step 7: Configure Ollama for Network Access

By default, Ollama only listens on localhost:11434. If you want other containers or machines on your network to use this Ollama instance (for example, Paperless-GPT running on a different LXC), you need to change the listen address.

Create a systemd override
mkdir -p /etc/systemd/system/ollama.service.d
Open the override file
nano /etc/systemd/system/ollama.service.d/override.conf

Paste the following, then save with Ctrl+X, Y, Enter:

/etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Reload systemd and restart Ollama
systemctl daemon-reload && systemctl restart ollama

Ollama now listens on all interfaces. Any machine on your LAN can reach it at http://10.1.20.105:11434.

WARNING

There is no authentication on the Ollama API. Anyone who can reach port 11434 can use your GPU for inference. On a home network behind a firewall this is fine. On a shared or public network, restrict access with firewall rules.

Step 8: Pull a Model and Test Inference

Pull a small model first to verify everything works before committing to a multi-gigabyte download:

Pull a test model
ollama pull llama3.2:1b

This downloads the 1.3 GB Llama 3.2 1B model. Once it finishes, test inference:

Test inference
ollama run llama3.2:1b "What is Proxmox? Reply in two sentences."

You should get a coherent response in a few seconds. If this takes 30+ seconds, the GPU is not being used — see the verification step below.

For actual workloads like document classification or coding assistance, pull a larger model:

Pull an 8B model for production use
ollama pull llama3.1:8b

This downloads about 4.9 GB. The 8B parameter model fits entirely in the 12 GB of VRAM on an RTX 3060, with room to spare for context.

Step 9: Verify GPU Acceleration

After running any prompt, check that the model is loaded into VRAM (not system RAM):

Check where the model is loaded
curl -s http://localhost:11434/api/ps | python3 -m json.tool

Look for the size_vram field. If it is equal to (or very close to) the size field, the entire model is in GPU VRAM. If size_vram is 0, the model is running on CPU only.

You can also watch GPU utilization during inference in a second terminal:

Watch GPU utilization
watch -n 1 nvidia-smi

During inference, you should see GPU-Util spike to 50-100% and Memory-Usage increase by the model size.

If the model is not using the GPU:

  1. Check that nvidia-smi works inside the container — if it does not, the device passthrough is broken.
  2. Check Ollama's startup log — run journalctl -u ollama --no-pager | head -30 and look for lines about GPU discovery. You should see your GPU model listed.
  3. Restart Ollama — sometimes the GPU is not detected on first boot if the NVIDIA driver loaded after Ollama started. Run systemctl restart ollama and try again.

Standard Homelab Wiring (Optional Bonus)

These steps wire the Ollama LXC into the homelab stack — local DNS, HTTPS via Caddy, backups, and monitoring. Ollama is an API service with no built-in authentication, so the Caddy reverse proxy does not add security here. I set it up purely for uniformity — every service in the homelab gets a clean https://name.hake.rodeo URL, and I prefer consistency over having one oddball http://ip:port endpoint. Skip this section entirely if you do not run these services, or cherry-pick the parts that apply.

Pi-hole Local DNS Record

Open your Pi-hole admin UI and click Local DNS Records in the left sidebar.

  • Domain: ollama.hake.rodeo
  • IP Address: 10.1.20.101 (your Caddy LXC, not the Ollama LXC)

Click Add. Verify:

Verify DNS resolution
dig +short ollama.hake.rodeo @10.1.20.100

You should see 10.1.20.101.

Caddy Reverse Proxy

On the Proxmox host, enter the Caddy container:

Enter the Caddy container
pct enter 101
Open the Caddyfile
nano /etc/caddy/Caddyfile

Add the following block at the end of the file:

/etc/caddy/Caddyfile (append)
ollama.hake.rodeo {
    tls {
        dns cloudflare {env.CLOUDFLARE_API_TOKEN}
    }
    reverse_proxy 10.1.20.105:11434
}

Save with Ctrl+X, Y, Enter, then reload:

Reload Caddy
systemctl reload caddy

Verify with a quick API call:

Test the reverse proxy
curl -s https://ollama.hake.rodeo/api/tags

You should see a JSON response listing your pulled models. Other services on your network can now reach Ollama at https://ollama.hake.rodeo instead of http://10.1.20.105:11434 — both work, use whichever you prefer.

PBS Backup Job

In the Proxmox web UI, go to Datacenter > Backup, double-click your existing backup job, and tick 105 (ollama) in the VM selection list. Save.

Ollama model files are large — a single 8B model is around 5 GB, and you will probably accumulate several. These files are freely re-downloadable with ollama pull, so backing them up wastes storage and slows down every backup run. PBS supports a file called .pxarexclude that works like .gitignore — place it inside the container, and PBS will skip the matched paths during backup.

On the Proxmox host, enter the Ollama container:

Enter the Ollama container
pct enter 105

Create a .pxarexclude file at the filesystem root:

Create the PBS exclusion file
nano /.pxarexclude

Paste the following, then save with Ctrl+X, Y, Enter:

/.pxarexclude
/usr/share/ollama/.ollama/models/

This tells PBS to skip the entire models directory during backup. Your Ollama configuration, service files, and systemd overrides are all still backed up — only the large, re-downloadable model blobs are excluded.

After a restore, just re-pull your models:

Re-pull models after a restore
ollama pull llama3.2:1b
ollama pull llama3.1:8b

TIP

Run ollama list before a planned rebuild to note which models you have installed. The config is tiny, but the model names are easy to forget.

Exit the container when done:

Exit the container
exit

Uptime Kuma Monitors

Open Uptime Kuma and add two monitors:

Monitor 1 — Ping:

  • Monitor Type: Ping
  • Friendly Name: Ollama LXC
  • Hostname: 10.1.20.105
  • Heartbeat Interval: 60 seconds

Monitor 2 — HTTP:

  • Monitor Type: HTTP(s)
  • Friendly Name: Ollama API
  • URL: https://ollama.hake.rodeo/api/tags
  • Heartbeat Interval: 60 seconds
  • Expected Status Code: 200

Next Steps

  • Other services. Any container or VM on your network can now call https://ollama.hake.rodeo (or http://10.1.20.105:11434 directly) for AI inference. Paperless-GPT, Open WebUI, LangChain apps, custom scripts — anything that supports the Ollama API.
  • More models. Run ollama list to see what you have, and browse ollama.com/library for the full model catalog. For a 12 GB GPU, stick to models at or below 8B parameters for fast inference, or try 13B-14B quantized models if you do not mind slightly slower generation.
  • Persistent storage. Models are stored in /usr/share/ollama/.ollama/models/. If you plan to pull many models, consider mounting a larger storage volume to this path.
  • Vision models. Models like llava and minicpm-v can process images alongside text. Useful for OCR replacement or image analysis pipelines.
  • GPU monitoring. Run nvidia-smi any time you want to check temperature, power draw, and VRAM usage. NVIDIA GPUs throttle at high temperatures — if your homelab runs warm, keep an eye on the thermal column.