Update SETUP_MIAAI.md: add bare Ubuntu rebuild section (driver, packages, Ollama)

Update SETUP_MIAAI.md: pre-training checklist, Ollama stop/start, verify script, corrected training time
Update human_chat_qlora.yml: working config for RTX 5080 (seq_len 2048, qlora, chat_template)
2026-05-13 21:33:02 +00:00 · 2026-05-13 21:19:15 +00:00 · 2026-05-13 18:59:19 +00:00 · 2026-05-13 18:58:51 +00:00 · 2026-05-13 14:37:01 +00:00 · 2026-05-13 13:55:52 +00:00
3 changed files with 374 additions and 10 deletions
--- a/SETUP_MIAAI.md
+++ b/SETUP_MIAAI.md
@@ -0,0 +1,273 @@
+# Axolotl Setup — miaai (RTX 5080, CUDA 13.2)
+
+## System Info
+- GPU: NVIDIA RTX 5080 (16GB VRAM, sm_120 / Blackwell)
+- Driver: 580.126.09 — max CUDA 13.0 shown by nvidia-smi, but nvcc from conda is 13.2
+- OS: Ubuntu 25.10 (Python 3.13 system — do NOT use system Python for ML)
+- Axolotl repo: `/home/tocmo0nlord/axolotl` (branch: `activeblue/main`)
+- Conda env: `axolotl` at `/opt/miniconda3/envs/axolotl`
+
+---
+
+## Starting from Bare Ubuntu 25.10
+
+If rebuilding from scratch, complete these steps first before anything else.
+
+### A. System packages
+```bash
+sudo apt update && sudo apt upgrade -y
+sudo apt install -y \
+  build-essential cmake git curl wget \
+  python3-dev libssl-dev zlib1g-dev \
+  ca-certificates gnupg lsb-release
+```
+
+### B. NVIDIA driver (580.xx)
+Ubuntu 25.10 is too new for NVIDIA's apt repo. Install via ubuntu-drivers:
+```bash
+sudo ubuntu-drivers autoinstall
+sudo reboot
+```
+
+After reboot, verify:
+```bash
+nvidia-smi
+# Must show: NVIDIA GeForce RTX 5080, Driver Version: 580.x
+```
+
+If ubuntu-drivers installs the wrong version, force the right one:
+```bash
+sudo apt install -y nvidia-driver-580
+sudo reboot
+```
+
+### C. Install Ollama
+```bash
+curl -fsSL https://ollama.com/install.sh | sh
+
+# Verify it's running
+systemctl status ollama
+```
+
+### D. HuggingFace CLI
+```bash
+pip3 install huggingface_hub
+huggingface-cli login
+# Paste your HF token — required for gated models like meta-llama
+```
+
+Once steps A–D are done, continue with the One-time Setup below.
+
+---
+
+## Pre-Training Checklist (every session)
+
+```bash
+# 1. Stop Ollama — if it receives a request mid-training it will compete for VRAM
+sudo systemctl stop ollama
+
+# 2. Activate conda env
+export PATH="/opt/miniconda3/bin:$PATH"
+conda activate axolotl
+
+# 3. Set env vars
+export CUDA_HOME=$CONDA_PREFIX
+export PATH=$CUDA_HOME/bin:$PATH
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+# 4. Confirm GPU is clear (should show no processes before training)
+nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
+
+# 5. Go to axolotl directory
+cd /home/tocmo0nlord/axolotl
+```
+
+## Run Training
+```bash
+axolotl train ~/human_chat_qlora.yml
+```
+
+## After Training
+```bash
+# Restart Ollama
+sudo systemctl start ollama
+
+# Test the adapter interactively
+axolotl inference ~/human_chat_qlora.yml \
+  --lora-model-dir ~/outputs/llama31-8b-humanchat \
+  --prompter chat
+
+# (Optional) Merge adapter into base model for standalone deployment
+axolotl merge-lora ~/human_chat_qlora.yml
+```
+
+---
+
+## One-time Setup (fresh machine — after bare Ubuntu steps above)
+
+### 1. Install Miniconda
+```bash
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
+bash miniconda.sh -b -p /opt/miniconda3
+/opt/miniconda3/bin/conda init bash
+source ~/.bashrc
+```
+
+### 2. Create Python 3.11 environment
+```bash
+conda create -n axolotl python=3.11 -y
+conda activate axolotl
+```
+
+### 3. Clone axolotl repo
+```bash
+git clone https://git.activeblue.net/tocmo0nlord/axolotl.git /home/tocmo0nlord/axolotl
+cd /home/tocmo0nlord/axolotl
+git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git
+git fetch upstream
+git rebase upstream/main        # keeps activeblue patches on top
+```
+
+### 4. Install CUDA toolkit (needed to compile flash-attn and bitsandbytes)
+```bash
+conda install -y -c "nvidia/label/cuda-12.8.0" cuda-toolkit
+export CUDA_HOME=$CONDA_PREFIX
+export PATH=$CUDA_HOME/bin:$PATH
+```
+
+> NOTE: Despite installing from the cuda-12.8.0 channel, conda resolves nvcc to **13.2.78**.
+> This is fine — use cu132 everywhere to match.
+
+### 5. Install PyTorch — use cu132 (matches nvcc from conda)
+```bash
+# torchaudio has no cu132 wheel — skip it, not needed for LLM training
+pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132
+python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))"
+```
+
+### 6. Install Axolotl
+```bash
+cd /home/tocmo0nlord/axolotl
+pip install -e "."
+```
+
+### 7. Install flash-attn
+> Compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.
+```bash
+MAX_JOBS=10 pip install flash-attn --no-build-isolation
+```
+
+### 8. Compile bitsandbytes from source for sm_120 (RTX 5080 / Blackwell)
+
+Prebuilt wheels do not include sm_120. CUDA 13.2 also dropped sm_50–53.
+Must compile from source with a patched CMakeLists.txt.
+
+```bash
+# Clone bitsandbytes v0.49.1
+git clone --branch v0.49.1 --depth 1 \
+  https://github.com/bitsandbytes-foundation/bitsandbytes.git /tmp/bnb_0491
+
+# Patch CMakeLists.txt: insert sm_120 override before the foreach loop
+# (cmake >= 3.23.0 uses its own built-in arch list which does not include sm_120)
+sed -i '/    foreach(capability \${CMAKE_CUDA_ARCHITECTURES_ALL})/i\    # RTX 5080 sm_120 patch\n    set(CMAKE_CUDA_ARCHITECTURES_ALL 120)' /tmp/bnb_0491/CMakeLists.txt
+
+# Verify patch landed correctly — set() line must appear immediately before foreach
+grep -n "ARCHITECTURES_ALL\|foreach" /tmp/bnb_0491/CMakeLists.txt | tail -5
+
+# Configure — must point cmake at conda's nvcc explicitly
+cmake \
+  -DCMAKE_CUDA_COMPILER=/opt/miniconda3/envs/axolotl/bin/nvcc \
+  -DCOMPUTE_BACKEND=cuda \
+  -S /tmp/bnb_0491 \
+  -B /tmp/bnb_0491/build 2>&1 | grep -E "(Capabilit|CUDA Ver|Error)"
+# Must show: CUDA Capabilities Selected: 120
+
+# Build (adjust -j to your CPU core count)
+cmake --build /tmp/bnb_0491/build -j10
+
+# Install into conda site-packages
+cp -r /tmp/bnb_0491/bitsandbytes \
+  /opt/miniconda3/envs/axolotl/lib/python3.11/site-packages/
+
+# Verify CUDA works
+python3 -c "
+import torch, bitsandbytes as bnb
+x = torch.randn(64, 64, device='cuda')
+l = bnb.nn.Linear8bitLt(64, 64).cuda()
+print('bitsandbytes CUDA OK:', l(x).shape)
+"
+```
+
+### 9. Copy training config to home
+```bash
+cp /home/tocmo0nlord/axolotl/human_chat_qlora.yml ~/human_chat_qlora.yml
+```
+
+### 10. Verify the full stack
+```bash
+python3 -c "
+import torch, bitsandbytes as bnb, flash_attn, transformers
+print('torch      :', torch.__version__, '| CUDA:', torch.version.cuda)
+print('bitsandbytes:', bnb.__version__)
+print('flash_attn :', flash_attn.__version__)
+print('transformers:', transformers.__version__)
+print('GPU        :', torch.cuda.get_device_name(0))
+print('VRAM       :', round(torch.cuda.get_device_properties(0).total_memory/1e9, 1), 'GB')
+"
+```
+
+Expected output:
+```
+torch      : 2.x.x | CUDA: 13.2
+bitsandbytes: 0.50.0.dev0
+flash_attn : 2.x.x
+transformers: 5.x.x
+GPU        : NVIDIA GeForce RTX 5080
+VRAM       : 16.3 GB
+```
+
+---
+
+## Training Config — human_chat_qlora.yml
+
+Key settings tuned for RTX 5080 (16GB):
+
+| Setting | Value | Notes |
+|---|---|---|
+| `sequence_len` | `2048` | 4096 OOMs during loss computation (logits x 128k vocab) |
+| `micro_batch_size` | `1` | Effective batch = micro x grad_accum = 8 |
+| `gradient_accumulation_steps` | `8` | Keeps effective batch size at 8 |
+| `adapter` | `qlora` | 4-bit via bitsandbytes compiled from source |
+| `attn_implementation` | `flash_attention_2` | Not the deprecated `flash_attention: true` |
+| `type` (datasets) | `chat_template` | Not the deprecated `sharegpt` |
+
+Expected training metrics (RTX 5080, ~65k samples, 2 epochs):
+- VRAM: ~10–11 GB active, ~11 GB allocated
+- Training duration: ~3.5 hours
+- Initial eval loss: ~0.81, perplexity ~2.25
+- Final loss target: ~0.55–0.60
+
+To push VRAM to ~14GB and improve training: set `micro_batch_size: 2` and `gradient_accumulation_steps: 4`.
+
+---
+
+## Common Pitfalls
+
+| Problem | Cause | Fix |
+|---|---|---|
+| `externally-managed-environment` | System Python 3.13 blocks pip | Use conda env, never system pip |
+| `No module named torch` (flash-attn) | pip builds in isolated env | Use `--no-build-isolation` |
+| `CUDA_HOME not set` | CUDA toolkit not installed | `conda install cuda-toolkit` from nvidia channel |
+| `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
+| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
+| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |
+| `nvcc fatal: Unsupported gpu architecture 'compute_50'` | bitsandbytes CMakeLists.txt hardcodes sm_50; CUDA 13.2 dropped it | Patch CMakeLists.txt (see step 8 above) |
+| `CUDA Capabilities Selected: 50;52;...` ignores -D flag | cmake >= 3.23 built-in arch list lacks sm_120; CMakeLists.txt overrides -D | Insert `set(CMAKE_CUDA_ARCHITECTURES_ALL 120)` before foreach loop |
+| `BackendUnavailable: scikit_build_core` | pip install of bnb triggers cmake rebuild | Copy .so directly to site-packages instead |
+| `torch.OutOfMemoryError` during eval | logits tensor (batch x 4096 x 128k vocab) too large | Set `sequence_len: 2048`, `micro_batch_size: 1` |
+| `type: sharegpt` deprecation warning | axolotl removed sharegpt type | Use `type: chat_template` with field mappings |
+| `flash_attention: true` deprecation | Old config key removed | Use `attn_implementation: flash_attention_2` |
+| Capybara dataset `field_messages null` | Capybara uses input/output format, not conversations | Switch to SlimOrca or OpenHermes-2.5 |
+| Ollama loads model mid-training | Ollama is enabled and receives a request | `sudo systemctl stop ollama` before training |
+| Training much slower than eval speed | The fast it/s on screen is the eval loop (forward only) | Normal — training includes backward pass and optimizer (~3.5h total) |
+| ubuntu-drivers installs wrong NVIDIA version | Multiple driver candidates available | Force with `apt install nvidia-driver-580` |
--- a/docker/Dockerfile-base-next
+++ b/docker/Dockerfile-base-next
@@ -1,16 +1,15 @@
-ARG CUDA_VERSION="12.8.1"
-ARG CUDNN_VERSION="8"
+ARG CUDA_VERSION="12.8.2"
 ARG UBUNTU_VERSION="22.04"
 ARG MAX_JOBS=4

-FROM nvidia/cuda:$CUDA_VERSION-cudnn$CUDNN_VERSION-devel-ubuntu$UBUNTU_VERSION AS base-builder
+FROM nvidia/cuda:12.8.2-devel-ubuntu22.04 AS base-builder

-ENV PATH="/root/miniconda3/bin:${PATH}"
+ENV PATH="/root/miniforge3/bin:${PATH}"

 ARG PYTHON_VERSION="3.11"
 ARG PYTORCH_VERSION="next"
 ARG CUDA="128"
-ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0+PTX"
+ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 9.0 12.0+PTX"

 ENV PYTHON_VERSION=$PYTHON_VERSION
 ENV TORCH_CUDA_ARCH_LIST=$TORCH_CUDA_ARCH_LIST
@@ -18,13 +17,13 @@ ENV TORCH_CUDA_ARCH_LIST=$TORCH_CUDA_ARCH_LIST
 RUN apt-get update \
    && apt-get install -y wget git build-essential ninja-build git-lfs libaio-dev pkg-config && rm -rf /var/lib/apt/lists/* \
    && wget \
-    https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
+    https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
    && mkdir /root/.conda \
-    && bash Miniconda3-latest-Linux-x86_64.sh -b \
-    && rm -f Miniconda3-latest-Linux-x86_64.sh \
-    && conda create -n "py${PYTHON_VERSION}" python="${PYTHON_VERSION}"
+    && bash Miniforge3-Linux-x86_64.sh -b \
+    && rm -f Miniforge3-Linux-x86_64.sh \
+    && /root/miniforge3/bin/conda create -n "py${PYTHON_VERSION}" python="${PYTHON_VERSION}"

-ENV PATH="/root/miniconda3/envs/py${PYTHON_VERSION}/bin:${PATH}"
+ENV PATH="/root/miniforge3/envs/py${PYTHON_VERSION}/bin:${PATH}"

 WORKDIR /workspace

--- a/human_chat_qlora.yml
+++ b/human_chat_qlora.yml
@@ -0,0 +1,92 @@
+# Llama 3.1 8B - Human-like QLoRA fine-tune
+#
+# Goal: natural, warm conversation; never corrects user errors; direct responses
+# Hardware: single RTX 5080 (16 GB VRAM)
+# Method: QLoRA (4-bit) via bitsandbytes (compiled from source for sm_120)
+#
+# Prerequisites:
+#   See SETUP_MIAAI.md for full environment setup including bitsandbytes compilation
+#   huggingface-cli login   (meta-llama is a gated model)
+#
+# Run:
+#   export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+#   axolotl train human_chat_qlora.yml
+#   axolotl merge-lora human_chat_qlora.yml   # (optional) merge adapter into base
+
+base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
+model_type: LlamaForCausalLM
+tokenizer_type: AutoTokenizer
+
+load_in_4bit: true
+strict: false
+
+# --- System prompt baked into every conversation ---
+chat_template: llama3
+default_system_message: >-
+  You are a direct, warm, and genuinely helpful assistant.
+  Respond to the user's intent naturally - never comment on typos, grammar,
+  or phrasing issues in their message. Just understand what they mean and give
+  a clear, useful, conversational answer as if talking to a knowledgeable friend.
+
+# --- Datasets ---
+# SlimOrca: ~74k carefully curated conversations - good for natural tone
+# OpenHermes-2.5: broad instruction coverage - sampled to 5% to keep balance
+datasets:
+  - path: Open-Orca/SlimOrca
+    type: chat_template
+    field_messages: conversations
+    message_field_role: from
+    message_field_content: value
+    split: train[:3%]
+
+  - path: teknium/OpenHermes-2.5
+    type: chat_template
+    field_messages: conversations
+    message_field_role: from
+    message_field_content: value
+    split: train[:5%]
+
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.01
+output_dir: ./outputs/llama31-8b-humanchat
+
+# sequence_len 2048 required on 16GB VRAM - 4096 OOMs during loss computation
+# (logits tensor: batch x seq_len x 128k vocab exceeds available memory)
+sequence_len: 2048
+sample_packing: true
+pad_to_sequence_len: true
+
+# --- QLoRA adapter ---
+adapter: qlora
+lora_r: 64
+lora_alpha: 32
+lora_dropout: 0.05
+lora_target_linear: true
+
+# --- Training hyperparameters ---
+# Effective batch = micro_batch_size x gradient_accumulation = 1 x 8 = 8
+micro_batch_size: 1
+gradient_accumulation_steps: 8
+num_epochs: 2
+optimizer: paged_adamw_32bit
+lr_scheduler: cosine
+learning_rate: 2e-4
+warmup_ratio: 0.05
+weight_decay: 0.1
+
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+tf32: false
+
+# --- Memory & speed ---
+gradient_checkpointing: true
+attn_implementation: flash_attention_2
+
+# --- Logging & checkpointing ---
+logging_steps: 10
+evals_per_epoch: 2
+saves_per_epoch: 1
+
+special_tokens:
+  pad_token: "<|eot_id|>"
Author	SHA1	Message	Date
tocmo0nlord	c6da9b9e92	Update SETUP_MIAAI.md: add bare Ubuntu rebuild section (driver, packages, Ollama) Some checks failed Tests Nightly against upstream main / pre-commit (push) Has been cancelled Details Tests Nightly against upstream main / Prefetch S3 once to prime the CDN cache (push) Has been cancelled Details Tests Nightly against upstream main / PyTest (3.12, 2.10.0) (push) Has been cancelled Details Tests Nightly against upstream main / PyTest (3.12, 2.9.1) (push) Has been cancelled Details Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, 1, 3.11, 2.10.0) (push) Has been cancelled Details Tests Nightly against upstream main / docker-e2e-tests (<nil>, 128, 12.8.1, true, 1, 3.11, 2.9.1) (push) Has been cancelled Details Tests Nightly against upstream main / docker-e2e-tests (<nil>, 130, 13.0.0, true, 1, 3.12, 2.9.1) (push) Has been cancelled Details Tests Nightly against upstream main / docker-e2e-multigpu-tests (<nil>, 128, 12.8.1, true, 2, 3.11, 2.9.1) (push) Has been cancelled Details docker-nightlies / build-axolotl (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled Details docker-nightlies / build-axolotl-cloud (<nil>, 128, 12.8.1, 3.11, 2.9.1) (push) Has been cancelled Details docker-multigpu-tests-biweekly / test-axolotl-multigpu (<nil>, 130, 13.0.0, 2, 3.11, 2.9.1) (push) Has been cancelled Details docker-multigpu-tests-biweekly / test-axolotl-multigpu (fbgemm-gpu, 128, 12.8.1, 2, 3.11, 2.10.0) (push) Has been cancelled Details Pre-commit auto-update / auto-update (push) Has been cancelled Details	2026-05-13 21:33:02 +00:00
tocmo0nlord	c7c4885369	Update SETUP_MIAAI.md: pre-training checklist, Ollama stop/start, verify script, corrected training time	2026-05-13 21:19:15 +00:00
tocmo0nlord	981a13e110	Update human_chat_qlora.yml: working config for RTX 5080 (seq_len 2048, qlora, chat_template)	2026-05-13 18:59:19 +00:00
tocmo0nlord	74f2263ac7	Update SETUP_MIAAI.md: bitsandbytes sm_120 patch, OOM fixes, working training config	2026-05-13 18:58:51 +00:00
tocmo0nlord	8693a1f61b	fix Dockerfile-base-next: cuda 12.8.2, miniforge, sm_120	2026-05-13 14:37:01 +00:00
tocmo0nlord	71c6a56e7a	switch to HQQ quantization to bypass bitsandbytes sm_120 issue	2026-05-13 13:55:52 +00:00
tocmo0nlord	38adf5cd37	add trust_remote_code, explicit bfloat16 and bnb dtype settings	2026-05-13 13:32:46 +00:00
tocmo0nlord	3f29fa017b	replace Capybara with SlimOrca (compatible ShareGPT format)	2026-05-13 12:58:29 +00:00
tocmo0nlord	c02a76f132	fix field_messages mapping for Capybara/OpenHermes ShareGPT format	2026-05-13 12:56:03 +00:00
tocmo0nlord	b9ceebfe7e	fix deprecated type:sharegpt and flash_attention config keys	2026-05-13 12:52:25 +00:00
tocmo0nlord	e9a3fd483f	add human-like QLoRA training config for Llama 3.1 8B	2026-05-13 12:50:35 +00:00
tocmo0nlord	eadd15c960	note MAX_JOBS for flash-attn compile speed	2026-05-13 04:45:21 +00:00
tocmo0nlord	396ce4a9dd	add miaai environment setup guide	2026-05-13 04:16:03 +00:00