diff --git a/SETUP_MIAAI.md b/SETUP_MIAAI.md new file mode 100644 index 000000000..79e6cfd35 --- /dev/null +++ b/SETUP_MIAAI.md @@ -0,0 +1,77 @@ +# Axolotl Setup — miaai (RTX 5080, CUDA 13.2) + +## System Info +- GPU: NVIDIA RTX 5080 (16GB VRAM) +- Driver: 580.126.09 — max CUDA 13.0 (nvcc from conda resolves to 13.2) +- OS: Ubuntu (Python 3.13 system — do NOT use system Python for ML) +- Axolotl branch: `activeblue/main` + +## One-time Setup + +### 1. Install Miniconda +```bash +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh +bash miniconda.sh -b -p /opt/miniconda3 +/opt/miniconda3/bin/conda init bash +source ~/.bashrc +``` + +### 2. Create Python 3.11 environment +```bash +conda create -n axolotl python=3.11 -y +conda activate axolotl +``` + +### 3. Clone and sync repo with upstream +```bash +git clone https://git.activeblue.net/tocmo0nlord/axolotl.git +cd axolotl +git remote add upstream https://github.com/axolotl-ai-cloud/axolotl.git +git fetch upstream +git rebase upstream/main # keeps activeblue patches on top +git push origin activeblue/main --force-with-lease +``` + +### 4. Install CUDA toolkit (needed to compile flash-attn) +```bash +conda install -y -c "nvidia/label/cuda-12.8.0" cuda-toolkit +export CUDA_HOME=$CONDA_PREFIX +export PATH=$CUDA_HOME/bin:$PATH +``` + +### 5. Install PyTorch — use cu132 (matches nvcc from conda) +> NOTE: torchaudio has no cu132 wheel — skip it, not needed for LLM training +```bash +pip install torch torchvision --index-url https://download.pytorch.org/whl/cu132 +python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch.cuda.get_device_name(0))" +``` + +### 6. Install Axolotl +```bash +pip install -e "." +pip install flash-attn --no-build-isolation +``` + +## Every Session (after first-time setup) +```bash +export PATH="/opt/miniconda3/bin:$PATH" +conda activate axolotl +export CUDA_HOME=$CONDA_PREFIX +export PATH=$CUDA_HOME/bin:$PATH +cd /home/tocmo0nlord/axolotl +``` + +## Run Training +```bash +axolotl train human_chat_qlora.yml +``` + +## Common Pitfalls Encountered +| Problem | Cause | Fix | +|---|---|---| +| `externally-managed-environment` | System Python 3.13 blocks pip | Use conda env, never system pip | +| `No module named torch` (flash-attn) | pip builds in isolated env | Use `--no-build-isolation` | +| `CUDA_HOME not set` | CUDA toolkit not installed | `conda install cuda-toolkit` from nvidia channel | +| `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` | +| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed | +| `src refspec main does not match` | Fork default branch is `activeblue/main` | `git push origin activeblue/main` |