Update SETUP_MIAAI.md: add bare Ubuntu rebuild section (driver, packages, Ollama)
Some checks failed
Tests Nightly against upstream main / pre-commit (push) Has been cancelled
Tests Nightly against upstream main / Prefetch S3 once to prime the CDN cache (push) Has been cancelled
Tests Nightly against upstream main / PyTest (3.12, 2.10.0) (push) Has been cancelled
Tests Nightly against upstream main / PyTest (3.12, 2.9.1) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, 1, 3.11, 2.10.0) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, true, 1, 3.11, 2.9.1) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-tests (<nil>, 130, 13.0.0, true, 1, 3.12, 2.9.1) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-multigpu-tests (<nil>, 128, 12.8.1, true, 2, 3.11, 2.9.1) (push) Has been cancelled
docker-nightlies / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled
docker-nightlies / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled
docker-multigpu-tests-biweekly / test-axolotl-multigpu (<nil>, 130, 13.0.0, 2, 3.11, 2.9.1) (push) Has been cancelled
docker-multigpu-tests-biweekly / test-axolotl-multigpu (fbgemm-gpu, 128, 12.8.1, 2, 3.11, 2.10.0) (push) Has been cancelled
Pre-commit auto-update / auto-update (push) Has been cancelled
Some checks failed
Tests Nightly against upstream main / pre-commit (push) Has been cancelled
Tests Nightly against upstream main / Prefetch S3 once to prime the CDN cache (push) Has been cancelled
Tests Nightly against upstream main / PyTest (3.12, 2.10.0) (push) Has been cancelled
Tests Nightly against upstream main / PyTest (3.12, 2.9.1) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, 1, 3.11, 2.10.0) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, true, 1, 3.11, 2.9.1) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-tests (<nil>, 130, 13.0.0, true, 1, 3.12, 2.9.1) (push) Has been cancelled
Tests Nightly against upstream main / docker-e2e-multigpu-tests (<nil>, 128, 12.8.1, true, 2, 3.11, 2.9.1) (push) Has been cancelled
docker-nightlies / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled
docker-nightlies / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled
docker-multigpu-tests-biweekly / test-axolotl-multigpu (<nil>, 130, 13.0.0, 2, 3.11, 2.9.1) (push) Has been cancelled
docker-multigpu-tests-biweekly / test-axolotl-multigpu (fbgemm-gpu, 128, 12.8.1, 2, 3.11, 2.10.0) (push) Has been cancelled
Pre-commit auto-update / auto-update (push) Has been cancelled
This commit is contained in:
114
SETUP_MIAAI.md
114
SETUP_MIAAI.md
@@ -9,12 +9,61 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Pre-Training Checklist (every time)
|
## Starting from Bare Ubuntu 25.10
|
||||||
|
|
||||||
Before starting a training run, verify these:
|
If rebuilding from scratch, complete these steps first before anything else.
|
||||||
|
|
||||||
|
### A. System packages
|
||||||
|
```bash
|
||||||
|
sudo apt update && sudo apt upgrade -y
|
||||||
|
sudo apt install -y \
|
||||||
|
build-essential cmake git curl wget \
|
||||||
|
python3-dev libssl-dev zlib1g-dev \
|
||||||
|
ca-certificates gnupg lsb-release
|
||||||
|
```
|
||||||
|
|
||||||
|
### B. NVIDIA driver (580.xx)
|
||||||
|
Ubuntu 25.10 is too new for NVIDIA's apt repo. Install via ubuntu-drivers:
|
||||||
|
```bash
|
||||||
|
sudo ubuntu-drivers autoinstall
|
||||||
|
sudo reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
After reboot, verify:
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
# Must show: NVIDIA GeForce RTX 5080, Driver Version: 580.x
|
||||||
|
```
|
||||||
|
|
||||||
|
If ubuntu-drivers installs the wrong version, force the right one:
|
||||||
|
```bash
|
||||||
|
sudo apt install -y nvidia-driver-580
|
||||||
|
sudo reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
### C. Install Ollama
|
||||||
|
```bash
|
||||||
|
curl -fsSL https://ollama.com/install.sh | sh
|
||||||
|
|
||||||
|
# Verify it's running
|
||||||
|
systemctl status ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
### D. HuggingFace CLI
|
||||||
|
```bash
|
||||||
|
pip3 install huggingface_hub
|
||||||
|
huggingface-cli login
|
||||||
|
# Paste your HF token — required for gated models like meta-llama
|
||||||
|
```
|
||||||
|
|
||||||
|
Once steps A–D are done, continue with the One-time Setup below.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pre-Training Checklist (every session)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Stop Ollama — if a request hits it mid-training it will compete for VRAM
|
# 1. Stop Ollama — if it receives a request mid-training it will compete for VRAM
|
||||||
sudo systemctl stop ollama
|
sudo systemctl stop ollama
|
||||||
|
|
||||||
# 2. Activate conda env
|
# 2. Activate conda env
|
||||||
@@ -26,7 +75,7 @@ export CUDA_HOME=$CONDA_PREFIX
|
|||||||
export PATH=$CUDA_HOME/bin:$PATH
|
export PATH=$CUDA_HOME/bin:$PATH
|
||||||
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
||||||
|
|
||||||
# 4. Confirm GPU is clear (should show no processes)
|
# 4. Confirm GPU is clear (should show no processes before training)
|
||||||
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
|
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
|
||||||
|
|
||||||
# 5. Go to axolotl directory
|
# 5. Go to axolotl directory
|
||||||
@@ -43,18 +92,18 @@ axolotl train ~/human_chat_qlora.yml
|
|||||||
# Restart Ollama
|
# Restart Ollama
|
||||||
sudo systemctl start ollama
|
sudo systemctl start ollama
|
||||||
|
|
||||||
# Test the adapter
|
# Test the adapter interactively
|
||||||
axolotl inference ~/human_chat_qlora.yml \
|
axolotl inference ~/human_chat_qlora.yml \
|
||||||
--lora-model-dir ~/outputs/llama31-8b-humanchat \
|
--lora-model-dir ~/outputs/llama31-8b-humanchat \
|
||||||
--prompter chat
|
--prompter chat
|
||||||
|
|
||||||
# (Optional) Merge adapter into base model
|
# (Optional) Merge adapter into base model for standalone deployment
|
||||||
axolotl merge-lora ~/human_chat_qlora.yml
|
axolotl merge-lora ~/human_chat_qlora.yml
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## One-time Setup (fresh machine only)
|
## One-time Setup (fresh machine — after bare Ubuntu steps above)
|
||||||
|
|
||||||
### 1. Install Miniconda
|
### 1. Install Miniconda
|
||||||
```bash
|
```bash
|
||||||
@@ -115,16 +164,17 @@ Must compile from source with a patched CMakeLists.txt.
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Clone bitsandbytes v0.49.1
|
# Clone bitsandbytes v0.49.1
|
||||||
git clone --branch v0.49.1 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
|
git clone --branch v0.49.1 --depth 1 \
|
||||||
|
https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
|
||||||
|
|
||||||
# Patch CMakeLists.txt: insert sm_120 override before the foreach loop
|
# Patch CMakeLists.txt: insert sm_120 override before the foreach loop
|
||||||
# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120)
|
# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120)
|
||||||
sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt
|
sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt
|
||||||
|
|
||||||
# Verify patch landed correctly (should show the set() line immediately before foreach)
|
# Verify patch landed correctly — set() line must appear immediately before foreach
|
||||||
grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5
|
grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5
|
||||||
|
|
||||||
# Configure
|
# Configure — must point cmake at conda's nvcc explicitly
|
||||||
cmake \
|
cmake \
|
||||||
-DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
|
-DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
|
||||||
-DCOMPUTE_BACKEND=cuda \
|
-DCOMPUTE_BACKEND=cuda \
|
||||||
@@ -132,13 +182,14 @@ cmake \
|
|||||||
-B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
|
-B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
|
||||||
# Must show: CUDA Capabilities Selected: 120
|
# Must show: CUDA Capabilities Selected: 120
|
||||||
|
|
||||||
# Build
|
# Build (adjust -j to your CPU core count)
|
||||||
cmake --build /tmp/bnb_0491/build -j10
|
cmake --build /tmp/bnb_0491/build -j10
|
||||||
|
|
||||||
# Install into conda site-packages
|
# Install into conda site-packages
|
||||||
cp -r /tmp/bnb_0491/bitsandbytes /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/
|
cp -r /tmp/bnb_0491/bitsandbytes \
|
||||||
|
/opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/
|
||||||
|
|
||||||
# Verify
|
# Verify CUDA works
|
||||||
python3 -c "
|
python3 -c "
|
||||||
import torch, bitsandbytes as bnb
|
import torch, bitsandbytes as bnb
|
||||||
x = torch.randn(64, 64, device='cuda')
|
x = torch.randn(64, 64, device='cuda')
|
||||||
@@ -147,25 +198,34 @@ print('bitsandbytes CUDA OK:', l(x).shape)
|
|||||||
"
|
"
|
||||||
```
|
```
|
||||||
|
|
||||||
### 9. HuggingFace login (meta-llama is gated)
|
### 9. Copy training config to home
|
||||||
```bash
|
```bash
|
||||||
huggingface-cli login
|
cp /home/tocmo0nlord/axolotl/human_chat_qlora.yml ~/human_chat_qlora.yml
|
||||||
# Paste your HF token when prompted
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 10. Verify everything is working
|
### 10. Verify the full stack
|
||||||
```bash
|
```bash
|
||||||
python3 -c "
|
python3 -c "
|
||||||
import torch, bitsandbytes as bnb, flash_attn, transformers, axolotl
|
import torch, bitsandbytes as bnb, flash_attn, transformers
|
||||||
print('torch:', torch.__version__, '| CUDA:', torch.version.cuda)
|
print('torch :', torch.__version__, '| CUDA:', torch.version.cuda)
|
||||||
print('bitsandbytes:', bnb.__version__)
|
print('bitsandbytes:', bnb.__version__)
|
||||||
print('flash_attn:', flash_attn.__version__)
|
print('flash_attn :', flash_attn.__version__)
|
||||||
print('transformers:', transformers.__version__)
|
print('transformers:', transformers.__version__)
|
||||||
print('GPU:', torch.cuda.get_device_name(0))
|
print('GPU :', torch.cuda.get_device_name(0))
|
||||||
print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB')
|
print('VRAM :', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB')
|
||||||
"
|
"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
```
|
||||||
|
torch : 2.x.x | CUDA: 13.2
|
||||||
|
bitsandbytes: 0.50.0.dev0
|
||||||
|
flash_attn : 2.x.x
|
||||||
|
transformers: 5.x.x
|
||||||
|
GPU : NVIDIA GeForce RTX 5080
|
||||||
|
VRAM : 16.3 GB
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Training Config — human_chat_qlora.yml
|
## Training Config — human_chat_qlora.yml
|
||||||
@@ -176,7 +236,7 @@ Key settings tuned for RTX 5080 (16GB):
|
|||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) |
|
| `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) |
|
||||||
| `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 |
|
| `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 |
|
||||||
| `gradient_accumulation_steps` | `8` | Keeps effective batch at 8 |
|
| `gradient_accumulation_steps` | `8` | Keeps effective batch size at 8 |
|
||||||
| `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source |
|
| `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source |
|
||||||
| `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` |
|
| `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` |
|
||||||
| `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` |
|
| `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` |
|
||||||
@@ -187,8 +247,7 @@ Expected training metrics (RTX 5080, ~65k samples, 2 epochs):
|
|||||||
- Initial eval loss: ~0.81, perplexity ~2.25
|
- Initial eval loss: ~0.81, perplexity ~2.25
|
||||||
- Final loss target: ~0.55–0.60
|
- Final loss target: ~0.55–0.60
|
||||||
|
|
||||||
To use more VRAM (~14GB) and improve gradient signal, increase `micro_batch_size: 2`
|
To push VRAM to ~14GB and improve training: set `micro_batch_size: 2` and `gradient_accumulation_steps: 4`.
|
||||||
(adjust `gradient_accumulation_steps: 4` to keep effective batch at 8).
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -203,11 +262,12 @@ To use more VRAM (~14GB) and improve gradient signal, increase `micro_batch_size
|
|||||||
| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
|
| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
|
||||||
| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |
|
| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |
|
||||||
| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) |
|
| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) |
|
||||||
| `CUDA Capabilities Selected: 50;52;...` (ignores -D flag) | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop |
|
| `CUDA Capabilities Selected: 50;52;...` ignores -D flag | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop |
|
||||||
| `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead |
|
| `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead |
|
||||||
| `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` |
|
| `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` |
|
||||||
| `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings |
|
| `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings |
|
||||||
| `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` |
|
| `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` |
|
||||||
| Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 |
|
| Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 |
|
||||||
| Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training |
|
| Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training |
|
||||||
| Training slower than expected (~3.5h not 19min) | The fast it/s on screen is the eval loop, not training | Normal — training includes backward pass and optimizer |
|
| Training much slower than eval speed | The fast it/s on screen is the eval loop (forward only) | Normal — training includes backward pass and optimizer (~3.5h total) |
|
||||||
|
| ubuntu-drivers installs wrong NVIDIA version | Multiple driver candidates available | Force with `apt install nvidia-driver-580` |
|
||||||
|
|||||||
Reference in New Issue
Block a user