note MAX_JOBS for flash-attn compile speed

This commit is contained in:
2026-05-13 04:45:21 +00:00
parent 396ce4a9dd
commit eadd15c960

View File

@@ -49,7 +49,12 @@ python -c "import torch; print('CUDA:', torch.version.cuda); print('GPU:', torch
### 6. Install Axolotl
```bash
pip install -e "."
pip install flash-attn --no-build-isolation
```
> **flash-attn compiles CUDA kernels from source — takes 1525 min on 10 cores of i7-14700K.**
> Always set `MAX_JOBS` to the number of available CPU cores to parallelize and speed up compilation:
```bash
MAX_JOBS=10 pip install flash-attn --no-build-isolation
```
## Every Session (after first-time setup)
@@ -75,3 +80,4 @@ axolotl train human_chat_qlora.yml
| `CUDA version mismatch 13.2 vs 12.8` | Conda nvcc is 13.2, torch was cu128 | Reinstall torch with `--index-url .../cu132` |
| `torchaudio` not found for cu132 | No cu132 wheel exists | Skip torchaudio — not needed |
| `src refspec main does not match` | Fork default branch is `activeblue/main` | `git push origin activeblue/main` |
| flash-attn compile is slow | Single-threaded by default | Set `MAX_JOBS=<cpu_count>` before pip install |