Axolotl Setup — miaai (RTX 5080, CUDA 13.2)

System Info

GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell)
Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2
OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML)
Axolotl branch: activeblue/main

One-time Setup

1. Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /opt/miniconda3
/opt/miniconda3/bin/conda init bash
source ~/.bashrc

2. Create Python 3.11 environment

conda create -n axolotl python=3.11 -y
conda activate axolotl

3. Clone and sync repo with upstream

git clone https://git.activeblue.net/tocmo0nlord/axolotl.git
cd axolotl
git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git
git fetch upstream
git rebase upstream/main        # keeps activeblue patches on top
git push origin activeblue/main --force-with-lease

4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)

conda install -y -c "nvidia/label/cuda-12.8.0" cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
export PATH=$CUDA_HOME/bin:$PATH

NOTE: Despite installing from the cuda-12.8.0 channel, conda resolves nvcc to 13.2.78. This is fine — use cu132 everywhere to match.

5. Install PyTorch — use cu132 (matches nvcc from conda)

NOTE: torchaudio has no cu132 wheel — skip it, not needed for LLM training

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))"

6. Install Axolotl

pip install -e "."

flash-attn compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K. Always set MAX_JOBS to the number of available CPU cores:

MAX_JOBS=10 pip install flash-attn --no-build-isolation

7. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)

The prebuilt bitsandbytes wheels do not include sm_120 support and CUDA 13.2 dropped sm_50–53. You must compile from source with a patched CMakeLists.txt.

# Clone bitsandbytes v0.49.1
git clone --branch v0.49.1 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
cd /tmp/bnb_0491

# Patch CMakeLists.txt: override arch list to sm_120 only, just before the foreach loop
# (cmake >= 3.23.0 skips the manual arch block and uses its own built-in list which lacks sm_120)
sed -i '/    foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\    # RTX 5080 sm_120 patch: override before capability list is built\n    set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' CMakeLists.txt

# Verify the patch landed at the right line
grep -n "ARCHITECTURES_ALL\|foreach" CMakeLists.txt | tail -5
# Should show: set(CMAKE_CUDA_ARCHITECTURES_ALL 120) immediately before the foreach line

# Configure — must point cmake at conda's nvcc
cmake \
  -DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
  -DCOMPUTE_BACKEND=cuda \
  -S /tmp/bnb_0491 \
  -B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
# Expected: "CUDA Capabilities Selected: 120"

# Build (j10 uses 10 cores — adjust to your CPU)
cmake --build /tmp/bnb_0491/build -j10

# Install into conda site-packages
SITE_PKG=/opt/miniconda3/envs/axolotl/lib/python3.11/site-packages
cp -r /tmp/bnb_0491/bitsandbytes "$SITE_PKG/"

# Verify
python3 -c "
import torch, bitsandbytes as bnb
x = torch.randn(64, 64, device='cuda')
l = bnb.nn.Linear8bitLt(64, 64).cuda()
print('bitsandbytes CUDA OK:', l(x).shape)
"

Every Session (after first-time setup)

export PATH="/opt/miniconda3/bin:$PATH"
conda activate axolotl
export CUDA_HOME=$CONDA_PREFIX
export PATH=$CUDA_HOME/bin:$PATH
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
cd /home/tocmo0nlord/axolotl

Training Config — human_chat_qlora.yml

Key settings that work on RTX 5080 (16GB):

Setting	Value	Notes
`sequence_len`	`2048`	4096 causes OOM during loss computation (logits x 128k vocab)
`micro_batch_size`	`1`	Keep low; effective batch = micro x grad_accum
`gradient_accumulation_steps`	`8`	Effective batch = 8
`adapter`	`qlora`	QLoRA 4-bit via bitsandbytes
`attn_implementation`	`flash_attention_2`	Not the deprecated `flash_attention: true`
`type` (datasets)	`chat_template`	Not the deprecated `sharegpt`

Dataset fields for SlimOrca / OpenHermes-2.5 (sharegpt-format with different field names):

datasets:
  - path: Open-Orca/SlimOrca
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    split: "train[:3%]"

Run Training

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
axolotl train ~/human_chat_qlora.yml

Expected startup sequence:

Config validation + capability detection (shows sm_120)
Dataset tokenization (~65k samples, ~30 seconds)
Loading weights: 100% 291/291
trainable params: 167,772,160 || all params: 8,198,033,408 || trainable%: 2.05
Initial eval: loss ~0.81, perplexity ~2.25, VRAM ~8.5GB
Training steps at ~2.6 it/s, VRAM ~9-10GB

Common Pitfalls Encountered

Problem	Cause	Fix
`externally-managed-environment`	System Python 3.13 blocks pip	Use conda env, never system pip
`No module named torch` (flash-attn)	pip builds in isolated env	Use `--no-build-isolation`
`CUDA_HOME not set`	CUDA toolkit not installed	`conda install cuda-toolkit` from nvidia channel
`CUDA version mismatch 13.2 vs 12.8`	Conda nvcc is 13.2, torch was cu128	Reinstall torch with `--index-url .../cu132`
`torchaudio` not found for cu132	No cu132 wheel exists	Skip torchaudio — not needed
`src refspec main does not match`	Fork default branch is `activeblue/main`	`git push origin activeblue/main`
flash-attn compile is slow	Single-threaded by default	Set `MAX_JOBS=<cpu_count>` before pip install
`nvcc fatal: Unsupported gpu architecture 'compute_50'`	bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it	Patch CMakeLists.txt (see step 7 above)
`CUDA Capabilities Selected: 50;52;...` (ignores sm_120)	cmake >= 3.23 built-in arch list lacks sm_120	Add `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop
`BackendUnavailable: scikit_build_core`	pip install of bnb tries to rebuild	Copy .so directly to site-packages instead
`torch.OutOfMemoryError` during eval	logits tensor (batch x 4096 x 128k vocab) too large	Set `sequence_len: 2048`, `micro_batch_size: 1`
`type: sharegpt` deprecation warning	axolotl removed sharegpt type	Use `type: chat_template` with field mappings
`flash_attention: true` deprecation	Old config key removed	Use `attn_implementation: flash_attention_2`
Capybara dataset `field_messages null`	Capybara uses input/output format, not conversations	Switch to SlimOrca or OpenHermes-2.5

7.0 KiB Raw Blame History Unescape Escape