note MAX_JOBS for flash-attn compile speed

2026-05-13 04:45:21 +00:00
parent 396ce4a9dd
commit eadd15c960
1 changed files with 7 additions and 1 deletions
--- a/SETUP_MIAAI.md
+++ b/SETUP_MIAAI.md
@@ -49,7 +49,12 @@ python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch
 ### 6. Install Axolotl
 ```bash
 pip install -e "."
-pip install flash-attn --no-build-isolation
+```
+
+> **flash-attn compiles CUDA kernels from source — takes 15–25 min on 10 cores of i7-14700K.**
+> Always set `MAX_JOBS` to the number of available CPU cores to parallelize and speed up compilation:
+```bash
+MAX_JOBS=10 pip install flash-attn --no-build-isolation
 ```

 ## Every Session (after first-time setup)
@@ -75,3 +80,4 @@ axolotl train human_chat_qlora.yml
 | `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
 | `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
 | `src refspec main does not match` | Fork default branch is `activeblue/main` | `git push origin activeblue/main` |
+| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |