Compare commits

...

13 Commits

Author SHA1 Message Date
c6da9b9e92 Update SETUP_MIAAI.md: add bare Ubuntu rebuild section (driver, packages, Ollama) 2026-05-13 21:33:02 +00:00
c7c4885369 Update SETUP_MIAAI.md: pre-training checklist, Ollama stop/start, verify script, corrected training time 2026-05-13 21:19:15 +00:00
981a13e110 Update human_chat_qlora.yml: working config for RTX 5080 (seq_len 2048, qlora, chat_template) 2026-05-13 18:59:19 +00:00
74f2263ac7 Update SETUP_MIAAI.md: bitsandbytes sm_120 patch, OOM fixes, working training config 2026-05-13 18:58:51 +00:00
8693a1f61b fix Dockerfile-base-next: cuda 12.8.2, miniforge, sm_120 2026-05-13 14:37:01 +00:00
71c6a56e7a switch to HQQ quantization to bypass bitsandbytes sm_120 issue 2026-05-13 13:55:52 +00:00
38adf5cd37 add trust_remote_code, explicit bfloat16 and bnb dtype settings 2026-05-13 13:32:46 +00:00
3f29fa017b replace Capybara with SlimOrca (compatible ShareGPT format) 2026-05-13 12:58:29 +00:00
c02a76f132 fix field_messages mapping for Capybara/OpenHermes ShareGPT format 2026-05-13 12:56:03 +00:00
b9ceebfe7e fix deprecated type:sharegpt and flash_attention config keys 2026-05-13 12:52:25 +00:00
e9a3fd483f add human-like QLoRA training config for Llama 3.1 8B 2026-05-13 12:50:35 +00:00
eadd15c960 note MAX_JOBS for flash-attn compile speed 2026-05-13 04:45:21 +00:00
396ce4a9dd add miaai environment setup guide 2026-05-13 04:16:03 +00:00
3 changed files with 374 additions and 10 deletions

273
SETUP_MIAAI.md Normal file
View File

@@ -0,0 +1,273 @@
# Axolotl Setup — miaai (RTX 5080, CUDA 13.2)
## System Info
- GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell)
- Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2
- OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML)
- Axolotl repo: `/home/tocmo0nlord/axolotl` (branch: `activeblue/main`)
- Conda env: `axolotl` at `/opt/miniconda3/envs/axolotl`
---
## Starting from Bare Ubuntu 25.10
If rebuilding from scratch, complete these steps first before anything else.
### A. System packages
```bash
sudo apt update && sudo apt upgrade -y
sudo apt install -y \
build-essential cmake git curl wget \
python3-dev libssl-dev zlib1g-dev \
ca-certificates gnupg lsb-release
```
### B. NVIDIA driver (580.xx)
Ubuntu 25.10 is too new for NVIDIA's apt repo. Install via ubuntu-drivers:
```bash
sudo ubuntu-drivers autoinstall
sudo reboot
```
After reboot, verify:
```bash
nvidia-smi
# Must show: NVIDIA GeForce RTX 5080, Driver Version: 580.x
```
If ubuntu-drivers installs the wrong version, force the right one:
```bash
sudo apt install -y nvidia-driver-580
sudo reboot
```
### C. Install Ollama
```bash
curl -fsSL https://ollama.com/install.sh | sh
# Verify it's running
systemctl status ollama
```
### D. HuggingFace CLI
```bash
pip3 install huggingface_hub
huggingface-cli login
# Paste your HF token — required for gated models like meta-llama
```
Once steps AD are done, continue with the One-time Setup below.
---
## Pre-Training Checklist (every session)
```bash
# 1. Stop Ollama — if it receives a request mid-training it will compete for VRAM
sudo systemctl stop ollama
# 2. Activate conda env
export PATH="/opt/miniconda3/bin:$PATH"
conda activate axolotl
# 3. Set env vars
export CUDA_HOME=$CONDA_PREFIX
export PATH=$CUDA_HOME/bin:$PATH
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 4. Confirm GPU is clear (should show no processes before training)
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
# 5. Go to axolotl directory
cd /home/tocmo0nlord/axolotl
```
## Run Training
```bash
axolotl train ~/human_chat_qlora.yml
```
## After Training
```bash
# Restart Ollama
sudo systemctl start ollama
# Test the adapter interactively
axolotl inference ~/human_chat_qlora.yml \
--lora-model-dir ~/outputs/llama31-8b-humanchat \
--prompter chat
# (Optional) Merge adapter into base model for standalone deployment
axolotl merge-lora ~/human_chat_qlora.yml
```
---
## One-time Setup (fresh machine — after bare Ubuntu steps above)
### 1. Install Miniconda
```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /opt/miniconda3
/opt/miniconda3/bin/conda init bash
source ~/.bashrc
```
### 2. Create Python 3.11 environment
```bash
conda create -n axolotl python=3.11 -y
conda activate axolotl
```
### 3. Clone axolotl repo
```bash
git clone https://git.activeblue.net/tocmo0nlord/axolotl.git /home/tocmo0nlord/axolotl
cd /home/tocmo0nlord/axolotl
git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git
git fetch upstream
git rebase upstream/main # keeps activeblue patches on top
```
### 4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)
```bash
conda install -y -c "nvidia/label/cuda-12.8.0" cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
export PATH=$CUDA_HOME/bin:$PATH
```
> NOTE: Despite installing from the cuda-12.8.0 channel, conda resolves nvcc to **13.2.78**.
> This is fine — use cu132 everywhere to match.
### 5. Install PyTorch — use cu132 (matches nvcc from conda)
```bash
# torchaudio has no cu132 wheel — skip it, not needed for LLM training
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))"
```
### 6. Install Axolotl
```bash
cd /home/tocmo0nlord/axolotl
pip install -e "."
```
### 7. Install flash-attn
> Compiles CUDA kernels from source — takes 1525 min on 10 cores of i7-14700K.
```bash
MAX_JOBS=10 pip install flash-attn --no-build-isolation
```
### 8. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)
Prebuilt wheels do not include sm_120. CUDA 13.2 also dropped sm_5053.
Must compile from source with a patched CMakeLists.txt.
```bash
# Clone bitsandbytes v0.49.1
git clone --branch v0.49.1 --depth 1 \
https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
# Patch CMakeLists.txt: insert sm_120 override before the foreach loop
# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120)
sed -i '/ foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\ # RTX 5080 sm_120 patch\n set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt
# Verify patch landed correctly — set() line must appear immediately before foreach
grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5
# Configure — must point cmake at conda's nvcc explicitly
cmake \
-DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
-DCOMPUTE_BACKEND=cuda \
-S /tmp/bnb_0491 \
-B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
# Must show: CUDA Capabilities Selected: 120
# Build (adjust -j to your CPU core count)
cmake --build /tmp/bnb_0491/build -j10
# Install into conda site-packages
cp -r /tmp/bnb_0491/bitsandbytes \
/opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/
# Verify CUDA works
python3 -c "
import torch, bitsandbytes as bnb
x = torch.randn(64, 64, device='cuda')
l = bnb.nn.Linear8bitLt(64, 64).cuda()
print('bitsandbytes CUDA OK:', l(x).shape)
"
```
### 9. Copy training config to home
```bash
cp /home/tocmo0nlord/axolotl/human_chat_qlora.yml ~/human_chat_qlora.yml
```
### 10. Verify the full stack
```bash
python3 -c "
import torch, bitsandbytes as bnb, flash_attn, transformers
print('torch :', torch.__version__, '| CUDA:', torch.version.cuda)
print('bitsandbytes:', bnb.__version__)
print('flash_attn :', flash_attn.__version__)
print('transformers:', transformers.__version__)
print('GPU :', torch.cuda.get_device_name(0))
print('VRAM :', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB')
"
```
Expected output:
```
torch : 2.x.x | CUDA: 13.2
bitsandbytes: 0.50.0.dev0
flash_attn : 2.x.x
transformers: 5.x.x
GPU : NVIDIA GeForce RTX 5080
VRAM : 16.3 GB
```
---
## Training Config — human_chat_qlora.yml
Key settings tuned for RTX 5080 (16GB):
| Setting | Value | Notes |
|---|---|---|
| `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) |
| `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 |
| `gradient_accumulation_steps` | `8` | Keeps effective batch size at 8 |
| `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source |
| `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` |
| `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` |
Expected training metrics (RTX 5080, ~65k samples, 2 epochs):
- VRAM: ~1011 GB active, ~11 GB allocated
- Training duration: ~3.5 hours
- Initial eval loss: ~0.81, perplexity ~2.25
- Final loss target: ~0.550.60
To push VRAM to ~14GB and improve training: set `micro_batch_size: 2` and `gradient_accumulation_steps: 4`.
---
## Common Pitfalls
| Problem | Cause | Fix |
|---|---|---|
| `externally-managed-environment` | System Python 3.13 blocks pip | Use conda env, never system pip |
| `No module named torch` (flash-attn) | pip builds in isolated env | Use `--no-build-isolation` |
| `CUDA_HOME not set` | CUDA toolkit not installed | `conda install cuda-toolkit` from nvidia channel |
| `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |
| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) |
| `CUDA Capabilities Selected: 50;52;...` ignores -D flag | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop |
| `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead |
| `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` |
| `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings |
| `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` |
| Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 |
| Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training |
| Training much slower than eval speed | The fast it/s on screen is the eval loop (forward only) | Normal — training includes backward pass and optimizer (~3.5h total) |
| ubuntu-drivers installs wrong NVIDIA version | Multiple driver candidates available | Force with `apt install nvidia-driver-580` |

View File

@@ -1,16 +1,15 @@
ARG CUDA_VERSION="12.8.1"
ARG CUDNN_VERSION="8"
ARG CUDA_VERSION="12.8.2"
ARG UBUNTU_VERSION="22.04"
ARG MAX_JOBS=4
FROM nvidia/cuda:$CUDA_VERSION-cudnn$CUDNN_VERSION-devel-ubuntu$UBUNTU_VERSION AS base-builder
FROM nvidia/cuda:12.8.2-devel-ubuntu22.04 AS base-builder
ENV PATH="/root/miniconda3/bin:${PATH}"
ENV PATH="/root/miniforge3/bin:${PATH}"
ARG PYTHON_VERSION="3.11"
ARG PYTORCH_VERSION="next"
ARG CUDA="128"
ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0 12.0+PTX"
ENV PYTHON_VERSION=$PYTHON_VERSION
ENV TORCH_CUDA_ARCH_LIST=$TORCH_CUDA_ARCH_LIST
@@ -18,13 +17,13 @@ ENV TORCH_CUDA_ARCH_LIST=$TORCH_CUDA_ARCH_LIST
RUN apt-get update \
&& apt-get install -y wget git build-essential ninja-build git-lfs libaio-dev pkg-config && rm -rf /var/lib/apt/lists/* \
&& wget \
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
&& mkdir /root/.conda \
&& bash Miniconda3-latest-Linux-x86_64.sh -b \
&& rm -f Miniconda3-latest-Linux-x86_64.sh \
&& conda create -n "py${PYTHON_VERSION}" python="${PYTHON_VERSION}"
&& bash Miniforge3-Linux-x86_64.sh -b \
&& rm -f Miniforge3-Linux-x86_64.sh \
&& /root/miniforge3/bin/conda create -n "py${PYTHON_VERSION}" python="${PYTHON_VERSION}"
ENV PATH="/root/miniconda3/envs/py${PYTHON_VERSION}/bin:${PATH}"
ENV PATH="/root/miniforge3/envs/py${PYTHON_VERSION}/bin:${PATH}"
WORKDIR /workspace

92
human_chat_qlora.yml Normal file
View File

@@ -0,0 +1,92 @@
# Llama 3.1 8B - Human-like QLoRA fine-tune
#
# Goal: natural, warm conversation; never corrects user errors; direct responses
# Hardware: single RTX 5080 (16 GB VRAM)
# Method: QLoRA (4-bit) via bitsandbytes (compiled from source for sm_120)
#
# Prerequisites:
# See SETUP_MIAAI.md for full environment setup including bitsandbytes compilation
# huggingface-cli login (meta-llama is a gated model)
#
# Run:
# export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# axolotl train human_chat_qlora.yml
# axolotl merge-lora human_chat_qlora.yml # (optional) merge adapter into base
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
# --- System prompt baked into every conversation ---
chat_template: llama3
default_system_message: >-
You are a direct, warm, and genuinely helpful assistant.
Respond to the user's intent naturally - never comment on typos, grammar,
or phrasing issues in their message. Just understand what they mean and give
a clear, useful, conversational answer as if talking to a knowledgeable friend.
# --- Datasets ---
# SlimOrca: ~74k carefully curated conversations - good for natural tone
# OpenHermes-2.5: broad instruction coverage - sampled to 5% to keep balance
datasets:
- path: Open-Orca/SlimOrca
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
split: train[:3%]
- path: teknium/OpenHermes-2.5
type: chat_template
field_messages: conversations
message_field_role: from
message_field_content: value
split: train[:5%]
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
output_dir: ./outputs/llama31-8b-humanchat
# sequence_len 2048 required on 16GB VRAM - 4096 OOMs during loss computation
# (logits tensor: batch x seq_len x 128k vocab exceeds available memory)
sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true
# --- QLoRA adapter ---
adapter: qlora
lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
# --- Training hyperparameters ---
# Effective batch = micro_batch_size x gradient_accumulation = 1 x 8 = 8
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 2
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 2e-4
warmup_ratio: 0.05
weight_decay: 0.1
train_on_inputs: false
group_by_length: false
bf16: auto
tf32: false
# --- Memory & speed ---
gradient_checkpointing: true
attn_implementation: flash_attention_2
# --- Logging & checkpointing ---
logging_steps: 10
evals_per_epoch: 2
saves_per_epoch: 1
special_tokens:
pad_token: "<|eot_id|>"