
Run Ollama with NVIDIA GPU Acceleration on Proxmox LXC (2026)

Prerequisites
- •Proxmox VE 8.1 or later with IOMMU enabled
- •An NVIDIA GPU physically installed in the host
Tools
- •SSH terminal
- •Web browser
Software
- •debian — 13
- •ollama — 0.22.0
- •proxmox-ve — 9.1
- •nvidia-driver — 595.58.03
Running large language models locally is one of the best things you can do with a homelab. No API keys, no usage fees, no data leaving your network. The bottleneck is speed — running inference on a CPU is painfully slow, but a mid-range NVIDIA GPU turns a 30-second response into a 2-second one.
This guide walks through the full stack: installing the NVIDIA driver on a Proxmox host, passing the GPU through to an unprivileged LXC container using Proxmox VE's modern dev0: device passthrough syntax, and installing Ollama inside the container for GPU-accelerated inference. By the end, you will have a dedicated Ollama LXC that any other service on your network can call for fast, local AI.
We are using the dev0: syntax introduced in Proxmox VE 8.1, which replaces the old lxc.cgroup2.devices.allow and lxc.mount.entry approach that you will see in older guides. The new syntax is cleaner, handles both cgroup permissions and device bind mounts in a single config line, and works with unprivileged containers out of the box.
What You Will End Up With
- NVIDIA driver 595.58.03 installed on the Proxmox host with DKMS (survives kernel updates)
- A dedicated Debian 13 LXC with full GPU access via
dev0:passthrough - Ollama serving models on the local network with GPU acceleration
- A test inference confirming everything works end-to-end
For this guide, the Ollama LXC takes CT ID 105 and IP 10.1.20.105. The GPU is an NVIDIA GeForce RTX 3060 12GB. Substitute your own values.
NOTE
About 25 to 30 minutes of hands-on time. The NVIDIA driver compilation is the longest wait.
Step 1: Identify Your GPU on the Proxmox Host
On the Proxmox host, check which NVIDIA GPU is installed and what bus address it is on:
lspci | grep -i nvidiaYou will see output like:
Output:
09:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate]
09:00.1 Audio device: NVIDIA Corporation GA106 High Definition Audio Controller
Note the bus address — 09:00 in this example. Modern NVIDIA cards are multi-function devices: the VGA controller at .0, the audio device at .1, and on Turing+ cards, sometimes USB and serial bus controllers at .2 and .3. All of these belong to the GPU and will be passed through together.
If you have multiple NVIDIA GPUs and want to pass only one to the LXC (keeping others for VM passthrough or other use), note which bus address belongs to which card.
Step 2: Blacklist Nouveau and Reboot
The open-source nouveau driver will try to claim the GPU before the proprietary NVIDIA driver can. We need to blacklist it. We also blacklist nvidiafb since we do not need framebuffer support on a headless server.
nano /etc/modprobe.d/blacklist-nvidia.confPaste the following, then save with Ctrl+X, Y, Enter:
blacklist nouveau
blacklist nvidiafbIf you have a single NVIDIA GPU and want it available for LXC passthrough, skip the next section and jump straight to the initramfs rebuild below.
Multi-GPU: Selective VFIO Binding (Optional)
If you have multiple NVIDIA GPUs and want one dedicated to VM passthrough (bound to vfio-pci) while the other uses the NVIDIA driver for LXC passthrough, you need a udev rule that binds by bus address rather than by vendor ID.
First, make sure the vfio modules load at boot:
nano /etc/modules-load.d/vfio.confPaste the following, then save with Ctrl+X, Y, Enter:
vfio
vfio_iommu_type1
vfio_pciNow create a udev rule that only binds the VM slot to vfio-pci. Replace 0000:41:00 with the bus address of the GPU you want reserved for VMs:
nano /etc/udev/rules.d/10-vfio-nvidia.rulesPaste the following, then save with Ctrl+X, Y, Enter:
SUBSYSTEM=="pci", KERNEL=="0000:41:00.*", ATTR{driver_override}="vfio-pci"| Flag | What it does |
|---|---|
SUBSYSTEM=="pci" | Only matches PCI devices. Keep. |
KERNEL=="0000:41:00.*" | Matches all functions at bus address 41:00 (VGA, audio, USB, etc.). Replace with the bus address of the GPU you want bound to vfio-pci for VM passthrough. |
ATTR{driver_override}="vfio-pci" | Forces vfio-pci as the driver. Keep. |
The wildcard .* at the end catches all functions on that bus address. The GPU at the other bus address (our LXC GPU) is not matched, so the NVIDIA driver claims it instead.
Rebuild the Initramfs and Reboot
WARNING
This step is critical. The blacklist (and udev rules, if you created them) are baked into the initramfs at build time. If you skip this, the old initramfs loads at boot with nouveau still active, and the NVIDIA driver installer in the next step will fail. This is the single most common failure in GPU passthrough setups.
update-initramfs -u -k $(uname -r)Now reboot so the kernel comes up with nouveau blocked:
rebootAfter the host comes back up, SSH in and verify nouveau is no longer loaded:
lsmod | grep nouveauNo output means nouveau is successfully blocked. If you still see it listed, double-check that /etc/modprobe.d/blacklist-nvidia.conf exists and contains the blacklist nouveau line, rebuild the initramfs again, and reboot.
Step 3: Install the NVIDIA Driver on the Host
The NVIDIA driver needs kernel headers to compile its modules. Install them along with build-essential for the compiler toolchain:
apt-get install -y proxmox-headers-$(uname -r) build-essential dkms| Flag | What it does |
|---|---|
proxmox-headers-$(uname -r) | Kernel headers matching your running PVE kernel. The $(uname -r) automatically fills in the correct version. Keep. |
build-essential | C compiler and make. Needed for NVIDIA module compilation. Keep. |
dkms | Dynamic Kernel Module Support. Automatically recompiles the NVIDIA modules when the kernel is updated. Keep. |
Download the NVIDIA driver. We use the 595.58.03 production branch — the latest Linux driver as of April 2026:
wget -P /root https://us.download.nvidia.com/XFree86/Linux-x86_64/595.58.03/NVIDIA-Linux-x86_64-595.58.03.runMake it executable and run the installer:
chmod +x /root/NVIDIA-Linux-x86_64-595.58.03.runbash /root/NVIDIA-Linux-x86_64-595.58.03.run --dkms --silent| Flag | What it does |
|---|---|
--dkms | Registers the modules with DKMS so they automatically rebuild on kernel updates. Keep. |
--silent | Runs without the ncurses UI. On a headless Proxmox host there is no display server, so the installer's questions about X libraries are irrelevant. Keep. |
This takes 2-5 minutes while it compiles the kernel modules. You will see a few warnings about X library paths and 32-bit compatibility libraries — both are harmless on a headless server.
Verify the driver is loaded and sees your GPU:
nvidia-smiYou should see a table showing your GPU, the driver version (595.58.03), CUDA version, temperature, and memory usage. If you get command not found or No devices were found, the most likely causes are:
- Nouveau still loaded — run
lsmod | grep nouveau. If it appears, the blacklist did not take effect. Go back to Step 2 and check that/etc/modprobe.d/blacklist-nvidia.confexists, rebuild the initramfs, and reboot. - GPU bound to vfio-pci — run
lspci -nnks <your-bus-address>and checkKernel driver in use. If it saysvfio-pciinstead ofnvidia, your udev rule is matching the wrong GPU. Check the bus address in your rule.
Verify DKMS registered the modules:
dkms statusYou should see nvidia/595.58.03 listed as installed.
Step 4: Create the Ollama LXC with GPU Passthrough
On the Proxmox host, create the container:
pct create 105 local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst \
--hostname ollama \
--cores 4 \
--memory 8192 \
--swap 2048 \
--rootfs local-lvm:20 \
--net0 name=eth0,bridge=vmbr0,ip=10.1.20.105/24,gw=10.1.20.1 \
--unprivileged 1 \
--features nesting=1 \
--onboot 1 \
--password| Flag | What it does |
|---|---|
105 (positional) | Container VMID. Replace with the next free ID on your host. |
local:vztmpl/debian-13-standard_13.1-2_amd64.tar.zst | OS template. Keep. If missing, run pveam update && pveam download local debian-13-standard_13.1-2_amd64.tar.zst. |
--hostname ollama | Container hostname. Keep or rename to taste. |
--cores 4 | CPU cores. Keep — Ollama uses CPU for prompt processing even with GPU offload. |
--memory 8192 | RAM in MB. Keep — Ollama loads model metadata into system RAM alongside the GPU VRAM. |
--swap 2048 | Swap in MB. Keep — safety net for larger models. |
--rootfs local-lvm:20 | 20 GB root filesystem. Replace local-lvm if your storage pool is named differently. Models can be large (4-8 GB each), so 20 GB gives room for a few. |
--net0 name=eth0,bridge=vmbr0,ip=10.1.20.105/24,gw=10.1.20.1 | Network attachment. Replace bridge, ip, and gw with your environment's values. |
--unprivileged 1 | Unprivileged container. Keep — the dev0: syntax works with unprivileged containers. |
--features nesting=1 | Enables systemd cgroup management. Keep. |
--onboot 1 | Auto-start on host boot. Keep. |
--password | Prompts for a root password. Keep. |
Now add the GPU device passthrough lines. These use PVE's dev0: syntax to pass the NVIDIA device nodes into the container:
nano /etc/pve/lxc/105.confAdd the following lines at the end of the file, then save with Ctrl+X, Y, Enter:
dev0: /dev/nvidia0,gid=44
dev1: /dev/nvidiactl,gid=44
dev2: /dev/nvidia-uvm,gid=44
dev3: /dev/nvidia-uvm-tools,gid=44
dev4: /dev/nvidia-caps/nvidia-cap1,gid=44
dev5: /dev/nvidia-caps/nvidia-cap2,gid=44| Line | What it does |
|---|---|
/dev/nvidia0 | The primary GPU device. If you have multiple GPUs on the NVIDIA driver, nvidia0 is the first one. Keep. |
/dev/nvidiactl | NVIDIA control device — shared across all GPUs. Keep. |
/dev/nvidia-uvm | Unified Virtual Memory — required for CUDA compute workloads like LLM inference. Keep. |
/dev/nvidia-uvm-tools | UVM tools interface. Keep. |
/dev/nvidia-caps/nvidia-cap1, nvidia-cap2 | NVIDIA capability devices. Keep. |
gid=44 | Sets the group inside the container to GID 44 (the video group on Debian). This lets the ollama user access the devices without running as root. Keep. |
Start the container:
pct start 105Verify the GPU devices are visible inside the container:
pct exec 105 -- ls -la /dev/nvidia*You should see all six device nodes listed with the video group.
Step 5: Install the NVIDIA Userspace Libraries in the Container
The container shares the host's kernel (and its NVIDIA kernel modules), but it needs its own copy of the userspace libraries — nvidia-smi, libcuda.so, and friends. We use the same .run installer with --no-kernel-module so it only installs the userspace components.
On the Proxmox host, copy the installer into the container:
pct push 105 /root/NVIDIA-Linux-x86_64-595.58.03.run /root/NVIDIA-Linux-x86_64-595.58.03.run --perms 755Now enter the container:
pct enter 105The minimal Debian 13 template ships without a generated locale, which causes noisy Perl warnings on every package operation. Fix that before doing anything else:
sed -i 's/^# en_US.UTF-8/en_US.UTF-8/' /etc/locale.gen && locale-genNow install the NVIDIA userspace libraries:
bash /root/NVIDIA-Linux-x86_64-595.58.03.run --no-kernel-module --silent| Flag | What it does |
|---|---|
--no-kernel-module | Skips kernel module compilation entirely. The container uses the host's kernel modules via the passed-through devices. Keep. |
--silent | No UI prompts. Keep. |
Verify the driver sees the GPU from inside the container:
nvidia-smiYou should see the same GPU table as on the host — same driver version, same GPU model, same VRAM. If you get Failed to initialize NVML: Unknown Error, the device nodes are not correctly passed through. Go back to Step 4 and check the dev0: lines in the container config.
Clean up the installer to save disk space:
rm /root/NVIDIA-Linux-x86_64-595.58.03.runTIP
The NVIDIA driver version inside the container must match the host exactly. If you update the host driver later, you need to re-run this step inside the container with the matching new version.
Step 6: Install Ollama
Ollama provides a one-line install script (curl -fsSL https://ollama.com/install.sh | bash), but piping a remote script to bash means trusting whatever that URL serves at the moment you run it. For a service that will have GPU access and run inference on your network, it is worth doing the install manually. It only takes a few extra commands, and you will understand exactly what is on your system.
Install curl and zstd — both are needed to download and extract the Ollama release archive, and neither is in the minimal Debian template:
apt update && apt install -y curl zstdDownload the official Ollama release tarball and extract it into /usr. This places the ollama binary at /usr/bin/ollama and its shared libraries under /usr/lib/ollama:
curl -fsSL https://ollama.com/download/ollama-linux-amd64.tar.zst | tar x --zstd -C /usrVerify the binary is in place:
ollama --versionNow create a dedicated system user to run the Ollama service. Running it as its own user rather than root is standard practice — it limits what the process can access if it is ever compromised:
useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama| Flag | What it does |
|---|---|
-r | System account (no aging, low UID). Keep. |
-s /bin/false | No login shell — this user only runs the service. Keep. |
-U | Creates a matching ollama group. Keep. |
-m -d /usr/share/ollama | Creates the home directory at /usr/share/ollama, where Ollama stores downloaded models. Keep. |
Add the ollama user to the video group so it can access the GPU devices we passed through in Step 4:
usermod -a -G video ollamaCreate the systemd service file:
nano /etc/systemd/system/ollama.servicePaste the following, then save with Ctrl+X, Y, Enter:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
[Install]
WantedBy=multi-user.targetEnable and start the service:
systemctl daemon-reload && systemctl enable ollama && systemctl start ollamaVerify Ollama is running:
systemctl status ollama --no-pagerYou should see Active: active (running). Check the logs to confirm GPU detection:
journalctl -u ollama --no-pager | grep -i "inference compute"You should see a line mentioning your GPU model and VRAM — something like NVIDIA GeForce RTX 3060 with 12.0 GiB. If you see no GPU line, restart the service with systemctl restart ollama and check again.
Step 7: Configure Ollama for Network Access
By default, Ollama only listens on localhost:11434. If you want other containers or machines on your network to use this Ollama instance (for example, Paperless-GPT running on a different LXC), you need to change the listen address.
mkdir -p /etc/systemd/system/ollama.service.dnano /etc/systemd/system/ollama.service.d/override.confPaste the following, then save with Ctrl+X, Y, Enter:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"systemctl daemon-reload && systemctl restart ollamaOllama now listens on all interfaces. Any machine on your LAN can reach it at http://10.1.20.105:11434.
WARNING
There is no authentication on the Ollama API. Anyone who can reach port 11434 can use your GPU for inference. On a home network behind a firewall this is fine. On a shared or public network, restrict access with firewall rules.
Step 8: Pull a Model and Test Inference
Pull a small model first to verify everything works before committing to a multi-gigabyte download:
ollama pull llama3.2:1bThis downloads the 1.3 GB Llama 3.2 1B model. Once it finishes, test inference:
ollama run llama3.2:1b "What is Proxmox? Reply in two sentences."You should get a coherent response in a few seconds. If this takes 30+ seconds, the GPU is not being used — see the verification step below.
For actual workloads like document classification or coding assistance, pull a larger model:
ollama pull llama3.1:8bThis downloads about 4.9 GB. The 8B parameter model fits entirely in the 12 GB of VRAM on an RTX 3060, with room to spare for context.
Step 9: Verify GPU Acceleration
After running any prompt, check that the model is loaded into VRAM (not system RAM):
curl -s http://localhost:11434/api/ps | python3 -m json.toolLook for the size_vram field. If it is equal to (or very close to) the size field, the entire model is in GPU VRAM. If size_vram is 0, the model is running on CPU only.
You can also watch GPU utilization during inference in a second terminal:
watch -n 1 nvidia-smiDuring inference, you should see GPU-Util spike to 50-100% and Memory-Usage increase by the model size.
If the model is not using the GPU:
- Check that
nvidia-smiworks inside the container — if it does not, the device passthrough is broken. - Check Ollama's startup log — run
journalctl -u ollama --no-pager | head -30and look for lines about GPU discovery. You should see your GPU model listed. - Restart Ollama — sometimes the GPU is not detected on first boot if the NVIDIA driver loaded after Ollama started. Run
systemctl restart ollamaand try again.
Standard Homelab Wiring (Optional Bonus)
These steps wire the Ollama LXC into the homelab stack — local DNS, HTTPS via Caddy, backups, and monitoring. Ollama is an API service with no built-in authentication, so the Caddy reverse proxy does not add security here. I set it up purely for uniformity — every service in the homelab gets a clean https://name.hake.rodeo URL, and I prefer consistency over having one oddball http://ip:port endpoint. Skip this section entirely if you do not run these services, or cherry-pick the parts that apply.
Pi-hole Local DNS Record
Open your Pi-hole admin UI and click Local DNS Records in the left sidebar.
- Domain:
ollama.hake.rodeo - IP Address:
10.1.20.101(your Caddy LXC, not the Ollama LXC)
Click Add. Verify:
dig +short ollama.hake.rodeo @10.1.20.100You should see 10.1.20.101.
Caddy Reverse Proxy
On the Proxmox host, enter the Caddy container:
pct enter 101nano /etc/caddy/CaddyfileAdd the following block at the end of the file:
ollama.hake.rodeo {
tls {
dns cloudflare {env.CLOUDFLARE_API_TOKEN}
}
reverse_proxy 10.1.20.105:11434
}Save with Ctrl+X, Y, Enter, then reload:
systemctl reload caddyVerify with a quick API call:
curl -s https://ollama.hake.rodeo/api/tagsYou should see a JSON response listing your pulled models. Other services on your network can now reach Ollama at https://ollama.hake.rodeo instead of http://10.1.20.105:11434 — both work, use whichever you prefer.
PBS Backup Job
In the Proxmox web UI, go to Datacenter > Backup, double-click your existing backup job, and tick 105 (ollama) in the VM selection list. Save.
Ollama model files are large — a single 8B model is around 5 GB, and you will probably accumulate several. These files are freely re-downloadable with ollama pull, so backing them up wastes storage and slows down every backup run. PBS supports a file called .pxarexclude that works like .gitignore — place it inside the container, and PBS will skip the matched paths during backup.
On the Proxmox host, enter the Ollama container:
pct enter 105Create a .pxarexclude file at the filesystem root:
nano /.pxarexcludePaste the following, then save with Ctrl+X, Y, Enter:
/usr/share/ollama/.ollama/models/This tells PBS to skip the entire models directory during backup. Your Ollama configuration, service files, and systemd overrides are all still backed up — only the large, re-downloadable model blobs are excluded.
After a restore, just re-pull your models:
ollama pull llama3.2:1b
ollama pull llama3.1:8bTIP
Run ollama list before a planned rebuild to note which models you have installed. The config is tiny, but the model names are easy to forget.
Exit the container when done:
exitUptime Kuma Monitors
Open Uptime Kuma and add two monitors:
Monitor 1 — Ping:
- Monitor Type: Ping
- Friendly Name: Ollama LXC
- Hostname:
10.1.20.105 - Heartbeat Interval: 60 seconds
Monitor 2 — HTTP:
- Monitor Type: HTTP(s)
- Friendly Name: Ollama API
- URL:
https://ollama.hake.rodeo/api/tags - Heartbeat Interval: 60 seconds
- Expected Status Code: 200
Next Steps
- Other services. Any container or VM on your network can now call
https://ollama.hake.rodeo(orhttp://10.1.20.105:11434directly) for AI inference. Paperless-GPT, Open WebUI, LangChain apps, custom scripts — anything that supports the Ollama API. - More models. Run
ollama listto see what you have, and browse ollama.com/library for the full model catalog. For a 12 GB GPU, stick to models at or below 8B parameters for fast inference, or try 13B-14B quantized models if you do not mind slightly slower generation. - Persistent storage. Models are stored in
/usr/share/ollama/.ollama/models/. If you plan to pull many models, consider mounting a larger storage volume to this path. - Vision models. Models like
llavaandminicpm-vcan process images alongside text. Useful for OCR replacement or image analysis pipelines. - GPU monitoring. Run
nvidia-smiany time you want to check temperature, power draw, and VRAM usage. NVIDIA GPUs throttle at high temperatures — if your homelab runs warm, keep an eye on the thermal column.
Related Content


Install Paperless-ngx on Proxmox LXC (2026)
.png)
Monitor Your Homelab with Uptime Kuma — Part 1: Install & First Monitors

Pi-hole v6 + Unbound on Proxmox — Network-Wide DNS (2026)
