From c6da9b9e925161e6fcaefcb37e405e186d5971bb Mon Sep 17 00:00:00 2001 From: tocmo0nlord Date: Wed, 13 May 2026 21:33:02 +0000 Subject: [PATCH] Update SETUP_MIAAI.md: add bare Ubuntu rebuild section (driver, packages, Ollama) --- SETUP_MIAAI.md | 114 +++++++++++++++++++++++++++++++++++++------------ 1 file changed, 87 insertions(+), 27 deletions(-) diff --git a/SETUP_MIAAI.md b/SETUP_MIAAI.md index faf249e3b..fadcfe767 100644 --- a/SETUP_MIAAI.md +++ b/SETUP_MIAAI.md @@ -9,12 +9,61 @@ --- -## Pre-Training Checklist (every time) +## Starting from Bare Ubuntu 25.10 -Before starting a training run, verify these: +If rebuilding from scratch, complete these steps first before anything else. + +### A. System packages +```bash +sudo apt update && sudo apt upgrade -y +sudo apt install -y \ + build-essential cmake git curl wget \ + python3-dev libssl-dev zlib1g-dev \ + ca-certificates gnupg lsb-release +``` + +### B. NVIDIA driver (580.xx) +Ubuntu 25.10 is too new for NVIDIA's apt repo. Install via ubuntu-drivers: +```bash +sudo ubuntu-drivers autoinstall +sudo reboot +``` + +After reboot, verify: +```bash +nvidia-smi +# Must show: NVIDIA GeForce RTX 5080, Driver Version: 580.x +``` + +If ubuntu-drivers installs the wrong version, force the right one: +```bash +sudo apt install -y nvidia-driver-580 +sudo reboot +``` + +### C. Install Ollama +```bash +curl -fsSL https://ollama.com/install.sh | sh + +# Verify it's running +systemctl status ollama +``` + +### D. HuggingFace CLI +```bash +pip3 install huggingface_hub +huggingface-cli login +# Paste your HF token — required for gated models like meta-llama +``` + +Once steps A–D are done, continue with the One-time Setup below. + +--- + +## Pre-Training Checklist (every session) ```bash -# 1. Stop Ollama — if a request hits it mid-training it will compete for VRAM +# 1. Stop Ollama — if it receives a request mid-training it will compete for VRAM sudo systemctl stop ollama # 2. Activate conda env @@ -26,7 +75,7 @@ export CUDA_HOME=$CONDA_PREFIX export PATH=$CUDA_HOME/bin:$PATH export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -# 4. Confirm GPU is clear (should show no processes) +# 4. Confirm GPU is clear (should show no processes before training) nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv # 5. Go to axolotl directory @@ -43,18 +92,18 @@ axolotl train ~/human_chat_qlora.yml # Restart Ollama sudo systemctl start ollama -# Test the adapter +# Test the adapter interactively axolotl inference ~/human_chat_qlora.yml \ --lora-model-dir ~/outputs/llama31-8b-humanchat \ --prompter chat -# (Optional) Merge adapter into base model +# (Optional) Merge adapter into base model for standalone deployment axolotl merge-lora ~/human_chat_qlora.yml ``` --- -## One-time Setup (fresh machine only) +## One-time Setup (fresh machine — after bare Ubuntu steps above) ### 1. Install Miniconda ```bash @@ -115,16 +164,17 @@ Must compile from source with a patched CMakeLists.txt. ```bash # Clone bitsandbytes v0.49.1 -git clone --branch v0.49.1 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491 +git clone --branch v0.49.1 --depth 1 \ + https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491 # Patch CMakeLists.txt: insert sm_120 override before the foreach loop # (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120) sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt -# Verify patch landed correctly (should show the set() line immediately before foreach) +# Verify patch landed correctly — set() line must appear immediately before foreach grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5 -# Configure +# Configure — must point cmake at conda's nvcc explicitly cmake \ -DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \ -DCOMPUTE_BACKEND=cuda \ @@ -132,13 +182,14 @@ cmake \ -B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)" # Must show: CUDA Capabilities Selected: 120 -# Build +# Build (adjust -j to your CPU core count) cmake --build /tmp/bnb_0491/build -j10 # Install into conda site-packages -cp -r /tmp/bnb_0491/bitsandbytes /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/ +cp -r /tmp/bnb_0491/bitsandbytes \ + /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/ -# Verify +# Verify CUDA works python3 -c " import torch, bitsandbytes as bnb x = torch.randn(64, 64, device='cuda') @@ -147,25 +198,34 @@ print('bitsandbytes CUDA OK:', l(x).shape) " ``` -### 9. HuggingFace login (meta-llama is gated) +### 9. Copy training config to home ```bash -huggingface-cli login -# Paste your HF token when prompted +cp /home/tocmo0nlord/axolotl/human_chat_qlora.yml ~/human_chat_qlora.yml ``` -### 10. Verify everything is working +### 10. Verify the full stack ```bash python3 -c " -import torch, bitsandbytes as bnb, flash_attn, transformers, axolotl -print('torch:', torch.__version__, '| CUDA:', torch.version.cuda) +import torch, bitsandbytes as bnb, flash_attn, transformers +print('torch :', torch.__version__, '| CUDA:', torch.version.cuda) print('bitsandbytes:', bnb.__version__) -print('flash_attn:', flash_attn.__version__) +print('flash_attn :', flash_attn.__version__) print('transformers:', transformers.__version__) -print('GPU:', torch.cuda.get_device_name(0)) -print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB') +print('GPU :', torch.cuda.get_device_name(0)) +print('VRAM :', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB') " ``` +Expected output: +``` +torch : 2.x.x | CUDA: 13.2 +bitsandbytes: 0.50.0.dev0 +flash_attn : 2.x.x +transformers: 5.x.x +GPU : NVIDIA GeForce RTX 5080 +VRAM : 16.3 GB +``` + --- ## Training Config — human_chat_qlora.yml @@ -176,7 +236,7 @@ Key settings tuned for RTX 5080 (16GB): |---|---|---| | `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) | | `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 | -| `gradient_accumulation_steps` | `8` | Keeps effective batch at 8 | +| `gradient_accumulation_steps` | `8` | Keeps effective batch size at 8 | | `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source | | `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` | | `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` | @@ -187,8 +247,7 @@ Expected training metrics (RTX 5080, ~65k samples, 2 epochs): - Initial eval loss: ~0.81, perplexity ~2.25 - Final loss target: ~0.55–0.60 -To use more VRAM (~14GB) and improve gradient signal, increase `micro_batch_size: 2` -(adjust `gradient_accumulation_steps: 4` to keep effective batch at 8). +To push VRAM to ~14GB and improve training: set `micro_batch_size: 2` and `gradient_accumulation_steps: 4`. --- @@ -203,11 +262,12 @@ To use more VRAM (~14GB) and improve gradient signal, increase `micro_batch_size | `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed | | flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=` before pip install | | `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) | -| `CUDA Capabilities Selected: 50;52;...` (ignores -D flag) | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop | +| `CUDA Capabilities Selected: 50;52;...` ignores -D flag | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop | | `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead | | `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` | | `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings | | `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` | | Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 | | Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training | -| Training slower than expected (~3.5h not 19min) | The fast it/s on screen is the eval loop, not training | Normal — training includes backward pass and optimizer | +| Training much slower than eval speed | The fast it/s on screen is the eval loop (forward only) | Normal — training includes backward pass and optimizer (~3.5h total) | +| ubuntu-drivers installs wrong NVIDIA version | Multiple driver candidates available | Force with `apt install nvidia-driver-580` |