From eadd15c96094eeb7f6a186b931d652e5fbb1181e Mon Sep 17 00:00:00 2001
From: tocmo0nlord <tocmo0nlord@192.168.1.63>
Date: Wed, 13 May 2026 04:45:21 +0000
Subject: [PATCH] note MAX_JOBS for flash-attn compile speed

---
 SETUP_MIAAI.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/SETUP_MIAAI.md b/SETUP_MIAAI.md
index 79e6cfd35..36d1a8db9 100644
--- a/SETUP_MIAAI.md
+++ b/SETUP_MIAAI.md
@@ -49,7 +49,12 @@ python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch
 ### 6. Install Axolotl
 ```bash
 pip install -e "."
-pip install flash-attn --no-build-isolation
+```
+
+> **flash-attn compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.**
+> Always set `MAX_JOBS` to the number of available CPU cores to parallelize and speed up compilation:
+```bash
+MAX_JOBS=10 pip install flash-attn --no-build-isolation
 ```
 
 ## Every Session (after first-time setup)
@@ -75,3 +80,4 @@ axolotl train human_chat_qlora.yml
 | `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
 | `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
 | `src refspec main does not match` | Fork default branch is `activeblue/main` | `git push origin activeblue/main` |
+| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |