Update SETUP_MIAAI.md: pre-training checklist, Ollama stop/start, verify script, corrected training time
This commit is contained in:
167
SETUP_MIAAI.md
167
SETUP_MIAAI.md
@@ -4,11 +4,57 @@
|
|||||||
- GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell)
|
- GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell)
|
||||||
- Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2
|
- Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2
|
||||||
- OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML)
|
- OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML)
|
||||||
- Axolotl branch: `activeblue/main`
|
- Axolotl repo: `/home/tocmo0nlord/axolotl` (branch: `activeblue/main`)
|
||||||
|
- Conda env: `axolotl` at `/opt/miniconda3/envs/axolotl`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## One-time Setup
|
## Pre-Training Checklist (every time)
|
||||||
|
|
||||||
|
Before starting a training run, verify these:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Stop Ollama — if a request hits it mid-training it will compete for VRAM
|
||||||
|
sudo systemctl stop ollama
|
||||||
|
|
||||||
|
# 2. Activate conda env
|
||||||
|
export PATH="/opt/miniconda3/bin:$PATH"
|
||||||
|
conda activate axolotl
|
||||||
|
|
||||||
|
# 3. Set env vars
|
||||||
|
export CUDA_HOME=$CONDA_PREFIX
|
||||||
|
export PATH=$CUDA_HOME/bin:$PATH
|
||||||
|
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
||||||
|
|
||||||
|
# 4. Confirm GPU is clear (should show no processes)
|
||||||
|
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
|
||||||
|
|
||||||
|
# 5. Go to axolotl directory
|
||||||
|
cd /home/tocmo0nlord/axolotl
|
||||||
|
```
|
||||||
|
|
||||||
|
## Run Training
|
||||||
|
```bash
|
||||||
|
axolotl train ~/human_chat_qlora.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## After Training
|
||||||
|
```bash
|
||||||
|
# Restart Ollama
|
||||||
|
sudo systemctl start ollama
|
||||||
|
|
||||||
|
# Test the adapter
|
||||||
|
axolotl inference ~/human_chat_qlora.yml \
|
||||||
|
--lora-model-dir ~/outputs/llama31-8b-humanchat \
|
||||||
|
--prompter chat
|
||||||
|
|
||||||
|
# (Optional) Merge adapter into base model
|
||||||
|
axolotl merge-lora ~/human_chat_qlora.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## One-time Setup (fresh machine only)
|
||||||
|
|
||||||
### 1. Install Miniconda
|
### 1. Install Miniconda
|
||||||
```bash
|
```bash
|
||||||
@@ -24,14 +70,13 @@ conda create -n axolotl python=3.11 -y
|
|||||||
conda activate axolotl
|
conda activate axolotl
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Clone and sync repo with upstream
|
### 3. Clone axolotl repo
|
||||||
```bash
|
```bash
|
||||||
git clone https://git.activeblue.net/tocmo0nlord/axolotl.git
|
git clone https://git.activeblue.net/tocmo0nlord/axolotl.git /home/tocmo0nlord/axolotl
|
||||||
cd axolotl
|
cd /home/tocmo0nlord/axolotl
|
||||||
git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git
|
git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git
|
||||||
git fetch upstream
|
git fetch upstream
|
||||||
git rebase upstream/main # keeps activeblue patches on top
|
git rebase upstream/main # keeps activeblue patches on top
|
||||||
git push origin activeblue/main --force-with-lease
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)
|
### 4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)
|
||||||
@@ -45,55 +90,53 @@ export PATH=$CUDA_HOME/bin:$PATH
|
|||||||
> This is fine — use cu132 everywhere to match.
|
> This is fine — use cu132 everywhere to match.
|
||||||
|
|
||||||
### 5. Install PyTorch — use cu132 (matches nvcc from conda)
|
### 5. Install PyTorch — use cu132 (matches nvcc from conda)
|
||||||
> NOTE: torchaudio has no cu132 wheel — skip it, not needed for LLM training
|
|
||||||
```bash
|
```bash
|
||||||
|
# torchaudio has no cu132 wheel — skip it, not needed for LLM training
|
||||||
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
|
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
|
||||||
python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))"
|
python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))"
|
||||||
```
|
```
|
||||||
|
|
||||||
### 6. Install Axolotl
|
### 6. Install Axolotl
|
||||||
```bash
|
```bash
|
||||||
|
cd /home/tocmo0nlord/axolotl
|
||||||
pip install -e "."
|
pip install -e "."
|
||||||
```
|
```
|
||||||
|
|
||||||
> **flash-attn compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.**
|
### 7. Install flash-attn
|
||||||
> Always set `MAX_JOBS` to the number of available CPU cores:
|
> Compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.
|
||||||
```bash
|
```bash
|
||||||
MAX_JOBS=10 pip install flash-attn --no-build-isolation
|
MAX_JOBS=10 pip install flash-attn --no-build-isolation
|
||||||
```
|
```
|
||||||
|
|
||||||
### 7. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)
|
### 8. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)
|
||||||
|
|
||||||
The prebuilt bitsandbytes wheels do not include sm_120 support and CUDA 13.2 dropped sm_50–53.
|
Prebuilt wheels do not include sm_120. CUDA 13.2 also dropped sm_50–53.
|
||||||
You must compile from source with a patched CMakeLists.txt.
|
Must compile from source with a patched CMakeLists.txt.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Clone bitsandbytes v0.49.1
|
# Clone bitsandbytes v0.49.1
|
||||||
git clone --branch v0.49.1 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
|
git clone --branch v0.49.1 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
|
||||||
cd /tmp/bnb_0491
|
|
||||||
|
|
||||||
# Patch CMakeLists.txt: override arch list to sm_120 only, just before the foreach loop
|
# Patch CMakeLists.txt: insert sm_120 override before the foreach loop
|
||||||
# (cmake >= 3.23.0 skips the manual arch block and uses its own built-in list which lacks sm_120)
|
# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120)
|
||||||
sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch: override before capability list is built\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' CMakeLists.txt
|
sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt
|
||||||
|
|
||||||
# Verify the patch landed at the right line
|
# Verify patch landed correctly (should show the set() line immediately before foreach)
|
||||||
grep -n "ARCHITECTURES_ALL\|foreach" CMakeLists.txt | tail -5
|
grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5
|
||||||
# Should show: set(CMAKE_CUDA_ARCHITECTURES_ALL 120) immediately before the foreach line
|
|
||||||
|
|
||||||
# Configure — must point cmake at conda's nvcc
|
# Configure
|
||||||
cmake \
|
cmake \
|
||||||
-DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
|
-DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
|
||||||
-DCOMPUTE_BACKEND=cuda \
|
-DCOMPUTE_BACKEND=cuda \
|
||||||
-S /tmp/bnb_0491 \
|
-S /tmp/bnb_0491 \
|
||||||
-B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
|
-B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
|
||||||
# Expected: "CUDA Capabilities Selected: 120"
|
# Must show: CUDA Capabilities Selected: 120
|
||||||
|
|
||||||
# Build (j10 uses 10 cores — adjust to your CPU)
|
# Build
|
||||||
cmake --build /tmp/bnb_0491/build -j10
|
cmake --build /tmp/bnb_0491/build -j10
|
||||||
|
|
||||||
# Install into conda site-packages
|
# Install into conda site-packages
|
||||||
SITE_PKG=/opt/miniconda3/envs/axolotl/lib/python3.11/site-packages
|
cp -r /tmp/bnb_0491/bitsandbytes /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/
|
||||||
cp -r /tmp/bnb_0491/bitsandbytes "$SITE_PKG/"
|
|
||||||
|
|
||||||
# Verify
|
# Verify
|
||||||
python3 -c "
|
python3 -c "
|
||||||
@@ -104,61 +147,52 @@ print('bitsandbytes CUDA OK:', l(x).shape)
|
|||||||
"
|
"
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
### 9. HuggingFace login (meta-llama is gated)
|
||||||
|
|
||||||
## Every Session (after first-time setup)
|
|
||||||
```bash
|
```bash
|
||||||
export PATH="/opt/miniconda3/bin:$PATH"
|
huggingface-cli login
|
||||||
conda activate axolotl
|
# Paste your HF token when prompted
|
||||||
export CUDA_HOME=$CONDA_PREFIX
|
```
|
||||||
export PATH=$CUDA_HOME/bin:$PATH
|
|
||||||
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
### 10. Verify everything is working
|
||||||
cd /home/tocmo0nlord/axolotl
|
```bash
|
||||||
|
python3 -c "
|
||||||
|
import torch, bitsandbytes as bnb, flash_attn, transformers, axolotl
|
||||||
|
print('torch:', torch.__version__, '| CUDA:', torch.version.cuda)
|
||||||
|
print('bitsandbytes:', bnb.__version__)
|
||||||
|
print('flash_attn:', flash_attn.__version__)
|
||||||
|
print('transformers:', transformers.__version__)
|
||||||
|
print('GPU:', torch.cuda.get_device_name(0))
|
||||||
|
print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB')
|
||||||
|
"
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Training Config — human_chat_qlora.yml
|
## Training Config — human_chat_qlora.yml
|
||||||
|
|
||||||
Key settings that work on RTX 5080 (16GB):
|
Key settings tuned for RTX 5080 (16GB):
|
||||||
|
|
||||||
| Setting | Value | Notes |
|
| Setting | Value | Notes |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `sequence_len` | `2048` | 4096 causes OOM during loss computation (logits x 128k vocab) |
|
| `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) |
|
||||||
| `micro_batch_size` | `1` | Keep low; effective batch = micro x grad_accum |
|
| `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 |
|
||||||
| `gradient_accumulation_steps` | `8` | Effective batch = 8 |
|
| `gradient_accumulation_steps` | `8` | Keeps effective batch at 8 |
|
||||||
| `adapter` | `qlora` | QLoRA 4-bit via bitsandbytes |
|
| `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source |
|
||||||
| `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` |
|
| `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` |
|
||||||
| `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` |
|
| `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` |
|
||||||
|
|
||||||
Dataset fields for SlimOrca / OpenHermes-2.5 (sharegpt-format with different field names):
|
Expected training metrics (RTX 5080, ~65k samples, 2 epochs):
|
||||||
```yaml
|
- VRAM: ~10–11 GB active, ~11 GB allocated
|
||||||
datasets:
|
- Training duration: ~3.5 hours
|
||||||
- path: Open-Orca/SlimOrca
|
- Initial eval loss: ~0.81, perplexity ~2.25
|
||||||
type: chat_template
|
- Final loss target: ~0.55–0.60
|
||||||
field_messages: conversations
|
|
||||||
message_field_role: from
|
|
||||||
message_field_content: value
|
|
||||||
split: "train[:3%]"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Run Training
|
To use more VRAM (~14GB) and improve gradient signal, increase `micro_batch_size: 2`
|
||||||
```bash
|
(adjust `gradient_accumulation_steps: 4` to keep effective batch at 8).
|
||||||
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
|
||||||
axolotl train ~/human_chat_qlora.yml
|
|
||||||
```
|
|
||||||
|
|
||||||
Expected startup sequence:
|
|
||||||
1. Config validation + capability detection (shows `sm_120`)
|
|
||||||
2. Dataset tokenization (~65k samples, ~30 seconds)
|
|
||||||
3. `Loading weights: 100% 291/291`
|
|
||||||
4. `trainable params: 167,772,160 || all params: 8,198,033,408 || trainable%: 2.05`
|
|
||||||
5. Initial eval: loss ~0.81, perplexity ~2.25, VRAM ~8.5GB
|
|
||||||
6. Training steps at ~2.6 it/s, VRAM ~9-10GB
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Common Pitfalls Encountered
|
## Common Pitfalls
|
||||||
|
|
||||||
| Problem | Cause | Fix |
|
| Problem | Cause | Fix |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
@@ -167,12 +201,13 @@ Expected startup sequence:
|
|||||||
| `CUDA_HOME not set` | CUDA toolkit not installed | `conda install cuda-toolkit` from nvidia channel |
|
| `CUDA_HOME not set` | CUDA toolkit not installed | `conda install cuda-toolkit` from nvidia channel |
|
||||||
| `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
|
| `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
|
||||||
| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
|
| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
|
||||||
| `src refspec main does not match` | Fork default branch is `activeblue/main` | `git push origin activeblue/main` |
|
|
||||||
| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |
|
| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |
|
||||||
| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 7 above) |
|
| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) |
|
||||||
| `CUDA Capabilities Selected: 50;52;...` (ignores sm_120) | cmake >= 3.23 built-in arch list lacks sm_120 | Add `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop |
|
| `CUDA Capabilities Selected: 50;52;...` (ignores -D flag) | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop |
|
||||||
| `BackendUnavailable: scikit_build_core` | pip install of bnb tries to rebuild | Copy .so directly to site-packages instead |
|
| `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead |
|
||||||
| `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` |
|
| `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` |
|
||||||
| `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings |
|
| `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings |
|
||||||
| `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` |
|
| `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` |
|
||||||
| Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 |
|
| Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 |
|
||||||
|
| Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training |
|
||||||
|
| Training slower than expected (~3.5h not 19min) | The fast it/s on screen is the eval loop, not training | Normal — training includes backward pass and optimizer |
|
||||||
|
|||||||
Reference in New Issue
Block a user