tocmo0nlord/axolotl

Fork 0

Files

tocmo0nlord c6da9b9e92

Tests Nightly against upstream main / pre-commit (push) Has been cancelled

Details

Tests Nightly against upstream main / Prefetch S3 once to prime the CDN cache (push) Has been cancelled

Details

Tests Nightly against upstream main / PyTest (3.12, 2.10.0) (push) Has been cancelled

Details

Tests Nightly against upstream main / PyTest (3.12, 2.9.1) (push) Has been cancelled

Details

Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, 1, 3.11, 2.10.0) (push) Has been cancelled

Details

Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, true, 1, 3.11, 2.9.1) (push) Has been cancelled

Details

Tests Nightly against upstream main / docker-e2e-tests (<nil>, 130, 13.0.0, true, 1, 3.12, 2.9.1) (push) Has been cancelled

Details

Tests Nightly against upstream main / docker-e2e-multigpu-tests (<nil>, 128, 12.8.1, true, 2, 3.11, 2.9.1) (push) Has been cancelled

Details

docker-nightlies / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled

Details

docker-nightlies / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled

Details

docker-multigpu-tests-biweekly / test-axolotl-multigpu (<nil>, 130, 13.0.0, 2, 3.11, 2.9.1) (push) Has been cancelled

Details

docker-multigpu-tests-biweekly / test-axolotl-multigpu (fbgemm-gpu, 128, 12.8.1, 2, 3.11, 2.10.0) (push) Has been cancelled

Details

Update SETUP_MIAAI.md: add bare Ubuntu rebuild section (driver, packages, Ollama)

2026-05-13 21:33:02 +00:00

9.4 KiB

Raw Blame History

Axolotl Setup — miaai (RTX 5080, CUDA 13.2)

System Info

GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell)
Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2
OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML)
Axolotl repo: /home/tocmo0nlord/axolotl (branch: activeblue/main)
Conda env: axolotl at /opt/miniconda3/envs/axolotl

Starting from Bare Ubuntu 25.10

If rebuilding from scratch, complete these steps first before anything else.

A. System packages

sudo apt update && sudo apt upgrade -y
sudo apt install -y \
  build-essential cmake git curl wget \
  python3-dev libssl-dev zlib1g-dev \
  ca-certificates gnupg lsb-release

B. NVIDIA driver (580.xx)

Ubuntu 25.10 is too new for NVIDIA's apt repo. Install via ubuntu-drivers:

sudo ubuntu-drivers autoinstall
sudo reboot

After reboot, verify:

nvidia-smi
# Must show: NVIDIA GeForce RTX 5080, Driver Version: 580.x

If ubuntu-drivers installs the wrong version, force the right one:

sudo apt install -y nvidia-driver-580
sudo reboot

C. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

# Verify it's running
systemctl status ollama

D. HuggingFace CLI

pip3 install huggingface_hub
huggingface-cli login
# Paste your HF token — required for gated models like meta-llama

Once steps A–D are done, continue with the One-time Setup below.

Pre-Training Checklist (every session)

# 1. Stop Ollama — if it receives a request mid-training it will compete for VRAM
sudo systemctl stop ollama

# 2. Activate conda env
export PATH="/opt/miniconda3/bin:$PATH"
conda activate axolotl

# 3. Set env vars
export CUDA_HOME=$CONDA_PREFIX
export PATH=$CUDA_HOME/bin:$PATH
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# 4. Confirm GPU is clear (should show no processes before training)
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

# 5. Go to axolotl directory
cd /home/tocmo0nlord/axolotl

Run Training

axolotl train ~/human_chat_qlora.yml

After Training

# Restart Ollama
sudo systemctl start ollama

# Test the adapter interactively
axolotl inference ~/human_chat_qlora.yml \
  --lora-model-dir ~/outputs/llama31-8b-humanchat \
  --prompter chat

# (Optional) Merge adapter into base model for standalone deployment
axolotl merge-lora ~/human_chat_qlora.yml

One-time Setup (fresh machine — after bare Ubuntu steps above)

1. Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /opt/miniconda3
/opt/miniconda3/bin/conda init bash
source ~/.bashrc

2. Create Python 3.11 environment

conda create -n axolotl python=3.11 -y
conda activate axolotl

3. Clone axolotl repo

git clone https://git.activeblue.net/tocmo0nlord/axolotl.git /home/tocmo0nlord/axolotl
cd /home/tocmo0nlord/axolotl
git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git
git fetch upstream
git rebase upstream/main        # keeps activeblue patches on top

4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)

conda install -y -c "nvidia/label/cuda-12.8.0" cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
export PATH=$CUDA_HOME/bin:$PATH

NOTE: Despite installing from the cuda-12.8.0 channel, conda resolves nvcc to 13.2.78. This is fine — use cu132 everywhere to match.

5. Install PyTorch — use cu132 (matches nvcc from conda)

# torchaudio has no cu132 wheel — skip it, not needed for LLM training
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))"

6. Install Axolotl

cd /home/tocmo0nlord/axolotl
pip install -e "."

7. Install flash-attn

Compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.

MAX_JOBS=10 pip install flash-attn --no-build-isolation

8. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)

Prebuilt wheels do not include sm_120. CUDA 13.2 also dropped sm_50–53. Must compile from source with a patched CMakeLists.txt.

# Clone bitsandbytes v0.49.1
git clone --branch v0.49.1 --depth 1 \
  https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491

# Patch CMakeLists.txt: insert sm_120 override before the foreach loop
# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120)
sed -i '/    foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\    # RTX 5080 sm_120 patch\n    set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt

# Verify patch landed correctly — set() line must appear immediately before foreach
grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5

# Configure — must point cmake at conda's nvcc explicitly
cmake \
  -DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
  -DCOMPUTE_BACKEND=cuda \
  -S /tmp/bnb_0491 \
  -B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
# Must show: CUDA Capabilities Selected: 120

# Build (adjust -j to your CPU core count)
cmake --build /tmp/bnb_0491/build -j10

# Install into conda site-packages
cp -r /tmp/bnb_0491/bitsandbytes \
  /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/

# Verify CUDA works
python3 -c "
import torch, bitsandbytes as bnb
x = torch.randn(64, 64, device='cuda')
l = bnb.nn.Linear8bitLt(64, 64).cuda()
print('bitsandbytes CUDA OK:', l(x).shape)
"

9. Copy training config to home

cp /home/tocmo0nlord/axolotl/human_chat_qlora.yml ~/human_chat_qlora.yml

10. Verify the full stack

python3 -c "
import torch, bitsandbytes as bnb, flash_attn, transformers
print('torch      :', torch.__version__, '| CUDA:', torch.version.cuda)
print('bitsandbytes:', bnb.__version__)
print('flash_attn :', flash_attn.__version__)
print('transformers:', transformers.__version__)
print('GPU        :', torch.cuda.get_device_name(0))
print('VRAM       :', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB')
"

Expected output:

torch      : 2.x.x | CUDA: 13.2
bitsandbytes: 0.50.0.dev0
flash_attn : 2.x.x
transformers: 5.x.x
GPU        : NVIDIA GeForce RTX 5080
VRAM       : 16.3 GB

Training Config — human_chat_qlora.yml

Key settings tuned for RTX 5080 (16GB):

Setting	Value	Notes
`sequence_len`	`2048`	4096 OOMs during loss computation (logits x 128k vocab)
`micro_batch_size`	`1`	Effective batch = micro x grad_accum = 8
`gradient_accumulation_steps`	`8`	Keeps effective batch size at 8
`adapter`	`qlora`	4-bit via bitsandbytes compiled from source
`attn_implementation`	`flash_attention_2`	Not the deprecated `flash_attention: true`
`type` (datasets)	`chat_template`	Not the deprecated `sharegpt`

Expected training metrics (RTX 5080, ~65k samples, 2 epochs):

VRAM: ~10–11 GB active, ~11 GB allocated
Training duration: ~3.5 hours
Initial eval loss: ~0.81, perplexity ~2.25
Final loss target: ~0.55–0.60

To push VRAM to ~14GB and improve training: set micro_batch_size: 2 and gradient_accumulation_steps: 4.

Common Pitfalls

Problem	Cause	Fix
`externally-managed-environment`	System Python 3.13 blocks pip	Use conda env, never system pip
`No module named torch` (flash-attn)	pip builds in isolated env	Use `--no-build-isolation`
`CUDA_HOME not set`	CUDA toolkit not installed	`conda install cuda-toolkit` from nvidia channel
`CUDA version mismatch 13.2 vs 12.8`	Conda nvcc is 13.2, torch was cu128	Reinstall torch with `--index-url .../cu132`
`torchaudio` not found for cu132	No cu132 wheel exists	Skip torchaudio — not needed
flash-attn compile is slow	Single-threaded by default	Set `MAX_JOBS=<cpu_count>` before pip install
`nvcc fatal: Unsupported gpu architecture 'compute_50'`	bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it	Patch CMakeLists.txt (see step 8 above)
`CUDA Capabilities Selected: 50;52;...` ignores -D flag	cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D	Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop
`BackendUnavailable: scikit_build_core`	pip install of bnb triggers cmake rebuild	Copy .so directly to site-packages instead
`torch.OutOfMemoryError` during eval	logits tensor (batch x 4096 x 128k vocab) too large	Set `sequence_len: 2048`, `micro_batch_size: 1`
`type: sharegpt` deprecation warning	axolotl removed sharegpt type	Use `type: chat_template` with field mappings
`flash_attention: true` deprecation	Old config key removed	Use `attn_implementation: flash_attention_2`
Capybara dataset `field_messages null`	Capybara uses input/output format, not conversations	Switch to SlimOrca or OpenHermes-2.5
Ollama loads model mid-training	Ollama is enabled and receives a request	`sudo systemctl stop ollama` before training
Training much slower than eval speed	The fast it/s on screen is the eval loop (forward only)	Normal — training includes backward pass and optimizer (~3.5h total)
ubuntu-drivers installs wrong NVIDIA version	Multiple driver candidates available	Force with `apt install nvidia-driver-580`

9.4 KiB Raw Blame History Unescape Escape

Axolotl Setup — miaai (RTX 5080, CUDA 13.2)

System Info

Starting from Bare Ubuntu 25.10

A. System packages

B. NVIDIA driver (580.xx)

C. Install Ollama

D. HuggingFace CLI

Pre-Training Checklist (every session)

Run Training

After Training

One-time Setup (fresh machine — after bare Ubuntu steps above)

1. Install Miniconda

2. Create Python 3.11 environment

3. Clone axolotl repo

4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)

5. Install PyTorch — use cu132 (matches nvcc from conda)

6. Install Axolotl

7. Install flash-attn

8. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)

9. Copy training config to home

10. Verify the full stack

Training Config — human_chat_qlora.yml

Common Pitfalls

9.4 KiB

Raw Blame History