James Routley

A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.

BIOS

The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.

The parameters that should be set:

Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled
Go to the Advanced tab -> PCI Subsystem Settings
Set Above 4G Decoding to Enabled
Set ReSize BAR Support to Auto or Enabled.
Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4
PCIEX16_2 Link Mode: Gen 4

kernel

NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with /tesla in the URL, because why not: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html

The two GPUs being different models, I unfortunately can’t setup this beauty https://github.com/aikitoria/open-gpu-kernel-modules I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.

Uninstall nvidia-dkms-open
blacklist the new nova driver

Only then the freshly patched driver will load at boot. You should see the following:

$ nvidia-smi topo -p2p r
 	GPU0	GPU1	
 GPU0	X	OK	
 GPU1	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  DR   = Disabled by regkey
  U    = Unknown

If like me you own different NVidia cards, just use the nvidia-open driver.

Once rebooted with the nvidia driver loaded, check that the cards are well seen by it:

$ nvidia-smi 
Sat Jun 13 09:29:23 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 610.43.02              KMD Version: 610.43.02     CUDA UMD Version: 13.3     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:07:00.0  On |                  N/A |
|  0%   34C    P8             17W /  350W |   23646MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5080        On  |   00000000:08:00.0 Off |                  N/A |
|  0%   31C    P8             15W /  360W |   15861MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

llama.cpp

Those are the build flags I use to support both cards generation:

# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES="86;120" -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA_NCCL=OFF

The relevant flag is CMAKE_CUDA_ARCHITECTURES="86;120" which enables both Ampere and Blackwell architectures. Note the -DGGML_CUDA_NCCL=OFF flag, I found out nccl was actually counter productive, even if llama-server logs say otherwise.

Now to startup options:

llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \
    -c 229376 \
    -np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \
    --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \
    -ctk q8_0 -ctv q8_0 --kv-unified \
    --chat-template-kwargs {"preserve_thinking": true} \
    --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \
    -sm tensor -ts 2,3 \
    --port 8001 --host 0.0.0.0

The sauce:

Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf this model’s q8 quantization fits in the overall 39GB with a 230k context and KV-cache quant at q8!
--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 the MTP speculative boost with a hint from ngram
-sm tensor from llama.cpp multi-GPUs documentation
-ts 2,3 cards usage ratio, important to be able to fill up every VRAM corner!

With this setup, I am able to run a full Qwen3.6 model quantized at q8, at a whooping 80+ tokens/sec, depending on the task it can go as high as 90+.

2673.12.803.689 I slot create_check: id  0 | task 45808 | created context checkpoint 1 of 32 (pos_min = 12, pos_max = 12, n_tokens = 13, size = 149.677 MiB)
2673.13.869.654 I reasoning-budget: deactivated (natural end)
2673.14.095.592 I slot print_timing: id  0 | task 45808 | n_decoded =    100, tg =  81.84 t/s
2673.17.131.165 I slot print_timing: id  0 | task 45808 | n_decoded =    388, tg =  91.13 t/s
2673.18.058.712 I slot print_timing: id  0 | task 45808 | prompt eval time =     219.76 ms /    17 tokens (   12.93 ms per token,    77.36 tokens per second)
2673.18.058.714 I slot print_timing: id  0 | task 45808 |        eval time =    5185.10 ms /   457 tokens (   11.35 ms per token,    88.14 tokens per second)
2673.18.058.715 I slot print_timing: id  0 | task 45808 |       total time =    5404.85 ms /   474 tokens
2673.18.058.716 I slot print_timing: id  0 | task 45808 |    graphs reused =      41669
2673.18.058.717 I slot print_timing: id  0 | task 45808 | draft acceptance = 0.77295 (  320 accepted /   414 generated)
2673.18.058.728 I statistics        ngram-mod: #calls(b,g,a) =  341  43646   1169, #gen drafts =   1169, #acc drafts =  1169, #gen tokens =  74496, #acc tokens = 44050, dur(b,g,a) = 1403.794, 706.959, 134.904 ms
2673.18.058.731 I statistics        draft-mtp: #calls(b,g,a) =  341  42477  42477, #gen drafts =  42477, #acc drafts = 36208, #gen tokens = 127431, #acc tokens = 86553, dur(b,g,a) = 0.158, 264947.885, 44.505 ms

While your cards are computing, check they are actually running at full speed with the following command:

$ sudo lspci -vvv -s 07:00.0 | grep "LnkSta:"

For each PCIe port, you should see:

LnkSta:	Speed 16GT/s, Width x8 (downgraded)

If you’re running the workload on a 16x/2 split.

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

BIOS

kernel

llama.cpp