diff --git a/SETUP_MIAAI.md b/SETUP_MIAAI.md index 1563d6fc1..faf249e3b 100644 --- a/SETUP_MIAAI.md +++ b/SETUP_MIAAI.md @@ -4,11 +4,57 @@ - GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell) - Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2 - OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML) -- Axolotl branch: `activeblue/main` +- Axolotl repo: `/home/tocmo0nlord/axolotl` (branch: `activeblue/main`) +- Conda env: `axolotl` at `/opt/miniconda3/envs/axolotl` --- -## One-time Setup +## Pre-Training Checklist (every time) + +Before starting a training run, verify these: + +```bash +# 1. Stop Ollama — if a request hits it mid-training it will compete for VRAM +sudo systemctl stop ollama + +# 2. Activate conda env +export PATH="/opt/miniconda3/bin:$PATH" +conda activate axolotl + +# 3. Set env vars +export CUDA_HOME=$CONDA_PREFIX +export PATH=$CUDA_HOME/bin:$PATH +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +# 4. Confirm GPU is clear (should show no processes) +nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv + +# 5. Go to axolotl directory +cd /home/tocmo0nlord/axolotl +``` + +## Run Training +```bash +axolotl train ~/human_chat_qlora.yml +``` + +## After Training +```bash +# Restart Ollama +sudo systemctl start ollama + +# Test the adapter +axolotl inference ~/human_chat_qlora.yml \ + --lora-model-dir ~/outputs/llama31-8b-humanchat \ + --prompter chat + +# (Optional) Merge adapter into base model +axolotl merge-lora ~/human_chat_qlora.yml +``` + +--- + +## One-time Setup (fresh machine only) ### 1. Install Miniconda ```bash @@ -24,14 +70,13 @@ conda create -n axolotl python=3.11 -y conda activate axolotl ``` -### 3. Clone and sync repo with upstream +### 3. Clone axolotl repo ```bash -git clone https://git.activeblue.net/tocmo0nlord/axolotl.git -cd axolotl +git clone https://git.activeblue.net/tocmo0nlord/axolotl.git /home/tocmo0nlord/axolotl +cd /home/tocmo0nlord/axolotl git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git git fetch upstream git rebase upstream/main # keeps activeblue patches on top -git push origin activeblue/main --force-with-lease ``` ### 4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes) @@ -45,55 +90,53 @@ export PATH=$CUDA_HOME/bin:$PATH > This is fine — use cu132 everywhere to match. ### 5. Install PyTorch — use cu132 (matches nvcc from conda) -> NOTE: torchaudio has no cu132 wheel — skip it, not needed for LLM training ```bash +# torchaudio has no cu132 wheel — skip it, not needed for LLM training pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132 python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))" ``` ### 6. Install Axolotl ```bash +cd /home/tocmo0nlord/axolotl pip install -e "." ``` -> **flash-attn compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.** -> Always set `MAX_JOBS` to the number of available CPU cores: +### 7. Install flash-attn +> Compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K. ```bash MAX_JOBS=10 pip install flash-attn --no-build-isolation ``` -### 7. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell) +### 8. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell) -The prebuilt bitsandbytes wheels do not include sm_120 support and CUDA 13.2 dropped sm_50–53. -You must compile from source with a patched CMakeLists.txt. +Prebuilt wheels do not include sm_120. CUDA 13.2 also dropped sm_50–53. +Must compile from source with a patched CMakeLists.txt. ```bash # Clone bitsandbytes v0.49.1 git clone --branch v0.49.1 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491 -cd /tmp/bnb_0491 -# Patch CMakeLists.txt: override arch list to sm_120 only, just before the foreach loop -# (cmake >= 3.23.0 skips the manual arch block and uses its own built-in list which lacks sm_120) -sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch: override before capability list is built\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' CMakeLists.txt +# Patch CMakeLists.txt: insert sm_120 override before the foreach loop +# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120) +sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt -# Verify the patch landed at the right line -grep -n "ARCHITECTURES_ALL\|foreach" CMakeLists.txt | tail -5 -# Should show: set(CMAKE_CUDA_ARCHITECTURES_ALL 120) immediately before the foreach line +# Verify patch landed correctly (should show the set() line immediately before foreach) +grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5 -# Configure — must point cmake at conda's nvcc +# Configure cmake \ -DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \ -DCOMPUTE_BACKEND=cuda \ -S /tmp/bnb_0491 \ -B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)" -# Expected: "CUDA Capabilities Selected: 120" +# Must show: CUDA Capabilities Selected: 120 -# Build (j10 uses 10 cores — adjust to your CPU) +# Build cmake --build /tmp/bnb_0491/build -j10 # Install into conda site-packages -SITE_PKG=/opt/miniconda3/envs/axolotl/lib/python3.11/site-packages -cp -r /tmp/bnb_0491/bitsandbytes "$SITE_PKG/" +cp -r /tmp/bnb_0491/bitsandbytes /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/ # Verify python3 -c " @@ -104,61 +147,52 @@ print('bitsandbytes CUDA OK:', l(x).shape) " ``` ---- - -## Every Session (after first-time setup) +### 9. HuggingFace login (meta-llama is gated) ```bash -export PATH="/opt/miniconda3/bin:$PATH" -conda activate axolotl -export CUDA_HOME=$CONDA_PREFIX -export PATH=$CUDA_HOME/bin:$PATH -export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -cd /home/tocmo0nlord/axolotl +huggingface-cli login +# Paste your HF token when prompted +``` + +### 10. Verify everything is working +```bash +python3 -c " +import torch, bitsandbytes as bnb, flash_attn, transformers, axolotl +print('torch:', torch.__version__, '| CUDA:', torch.version.cuda) +print('bitsandbytes:', bnb.__version__) +print('flash_attn:', flash_attn.__version__) +print('transformers:', transformers.__version__) +print('GPU:', torch.cuda.get_device_name(0)) +print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB') +" ``` --- ## Training Config — human_chat_qlora.yml -Key settings that work on RTX 5080 (16GB): +Key settings tuned for RTX 5080 (16GB): | Setting | Value | Notes | |---|---|---| -| `sequence_len` | `2048` | 4096 causes OOM during loss computation (logits x 128k vocab) | -| `micro_batch_size` | `1` | Keep low; effective batch = micro x grad_accum | -| `gradient_accumulation_steps` | `8` | Effective batch = 8 | -| `adapter` | `qlora` | QLoRA 4-bit via bitsandbytes | +| `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) | +| `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 | +| `gradient_accumulation_steps` | `8` | Keeps effective batch at 8 | +| `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source | | `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` | | `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` | -Dataset fields for SlimOrca / OpenHermes-2.5 (sharegpt-format with different field names): -```yaml -datasets: - - path: Open-Orca/SlimOrca - type: chat_template - field_messages: conversations - message_field_role: from - message_field_content: value - split: "train[:3%]" -``` +Expected training metrics (RTX 5080, ~65k samples, 2 epochs): +- VRAM: ~10–11 GB active, ~11 GB allocated +- Training duration: ~3.5 hours +- Initial eval loss: ~0.81, perplexity ~2.25 +- Final loss target: ~0.55–0.60 -## Run Training -```bash -export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True -axolotl train ~/human_chat_qlora.yml -``` - -Expected startup sequence: -1. Config validation + capability detection (shows `sm_120`) -2. Dataset tokenization (~65k samples, ~30 seconds) -3. `Loading weights: 100% 291/291` -4. `trainable params: 167,772,160 || all params: 8,198,033,408 || trainable%: 2.05` -5. Initial eval: loss ~0.81, perplexity ~2.25, VRAM ~8.5GB -6. Training steps at ~2.6 it/s, VRAM ~9-10GB +To use more VRAM (~14GB) and improve gradient signal, increase `micro_batch_size: 2` +(adjust `gradient_accumulation_steps: 4` to keep effective batch at 8). --- -## Common Pitfalls Encountered +## Common Pitfalls | Problem | Cause | Fix | |---|---|---| @@ -167,12 +201,13 @@ Expected startup sequence: | `CUDA_HOME not set` | CUDA toolkit not installed | `conda install cuda-toolkit` from nvidia channel | | `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` | | `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed | -| `src refspec main does not match` | Fork default branch is `activeblue/main` | `git push origin activeblue/main` | | flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=` before pip install | -| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 7 above) | -| `CUDA Capabilities Selected: 50;52;...` (ignores sm_120) | cmake >= 3.23 built-in arch list lacks sm_120 | Add `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop | -| `BackendUnavailable: scikit_build_core` | pip install of bnb tries to rebuild | Copy .so directly to site-packages instead | +| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) | +| `CUDA Capabilities Selected: 50;52;...` (ignores -D flag) | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop | +| `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead | | `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` | | `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings | | `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` | | Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 | +| Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training | +| Training slower than expected (~3.5h not 19min) | The fast it/s on screen is the eval loop, not training | Normal — training includes backward pass and optimizer |