native support for modal cloud from CLI (#2237)

* native support for modal cloud from CLI * do lm_eval in cloud too * Fix the sub call to lm-eval * lm_eval option to not post eval, and append not extend * cache bust when using branch, grab sha of latest image tag, update lm-eval dep * allow minimal yaml for lm eval * include modal in requirements * update link in README to include utm * pr feedback * use chat template * revision support * apply chat template as arg * add wandb name support, allow explicit a100-40gb * cloud is optional * handle accidental setting of tasks with a single task str * document the modal cloud yaml for clarity [skip ci] * cli docs * support spawn vs remote for lm-eval * Add support for additional docker commands in modal image build * cloud config shouldn't be a dir * Update README.md Co-authored-by: Charles Frye <cfrye59@gmail.com> * fix annotation args --------- Co-authored-by: Charles Frye <cfrye59@gmail.com>
2025-01-30 11:34:02 -05:00
parent 268543a3be
commit 8779997ba5
12 changed files with 834 additions and 53 deletions
--- a/docs/cli.qmd
+++ b/docs/cli.qmd
@@ -0,0 +1,256 @@
+# Axolotl CLI Documentation
+
+The Axolotl CLI provides a streamlined interface for training and fine-tuning large language models. This guide covers
+the CLI commands, their usage, and common examples.
+
+### Table of Contents
+
+- Basic Commands
+- Command Reference
+  - fetch
+  - preprocess
+  - train
+  - inference
+  - merge-lora
+  - merge-sharded-fsdp-weights
+  - evaluate
+  - lm-eval
+- Legacy CLI Usage
+- Remote Compute with Modal Cloud
+  - Cloud Configuration
+  - Running on Modal Cloud
+  - Cloud Configuration Options
+
+
+### Basic Commands
+
+All Axolotl commands follow this general structure:
+
+```bash
+axolotl <command> [config.yml] [options]
+```
+
+The config file can be local or a URL to a raw YAML file.
+
+### Command Reference
+
+#### fetch
+
+Downloads example configurations and deepspeed configs to your local machine.
+
+```bash
+# Get example YAML files
+axolotl fetch examples
+
+# Get deepspeed config files
+axolotl fetch deepspeed_configs
+
+# Specify custom destination
+axolotl fetch examples --dest path/to/folder
+```
+
+#### preprocess
+
+Preprocesses and tokenizes your dataset before training. This is recommended for large datasets.
+
+```bash
+# Basic preprocessing
+axolotl preprocess config.yml
+
+# Preprocessing with one GPU
+CUDA_VISIBLE_DEVICES="0" axolotl preprocess config.yml
+
+# Debug mode to see processed examples
+axolotl preprocess config.yml --debug
+
+# Debug with limited examples
+axolotl preprocess config.yml --debug --debug-num-examples 5
+```
+
+Configuration options:
+
+```yaml
+dataset_prepared_path: Local folder for saving preprocessed data
+push_dataset_to_hub: HuggingFace repo to push preprocessed data (optional)
+```
+
+#### train
+
+Trains or fine-tunes a model using the configuration specified in your YAML file.
+
+```bash
+# Basic training
+axolotl train config.yml
+
+# Train and set/override specific options
+axolotl train config.yml \
+    --learning-rate 1e-4 \
+    --micro-batch-size 2 \
+    --num-epochs 3
+
+# Training without accelerate
+axolotl train config.yml --no-accelerate
+
+# Resume training from checkpoint
+axolotl train config.yml --resume-from-checkpoint path/to/checkpoint
+```
+
+#### inference
+
+Runs inference using your trained model in either CLI or Gradio interface mode.
+
+```bash
+# CLI inference with LoRA
+axolotl inference config.yml --lora-model-dir="./outputs/lora-out"
+
+# CLI inference with full model
+axolotl inference config.yml --base-model="./completed-model"
+
+# Gradio web interface
+axolotl inference config.yml --gradio \
+    --lora-model-dir="./outputs/lora-out"
+
+# Inference with input from file
+cat prompt.txt | axolotl inference config.yml \
+    --base-model="./completed-model"
+```
+
+#### merge-lora
+
+Merges trained LoRA adapters into the base model.
+
+```bash
+# Basic merge
+axolotl merge-lora config.yml
+
+# Specify LoRA directory (usually used with checkpoints)
+axolotl merge-lora config.yml --lora-model-dir="./lora-output/checkpoint-100"
+
+# Merge using CPU (if out of GPU memory)
+CUDA_VISIBLE_DEVICES="" axolotl merge-lora config.yml
+```
+
+Configuration options:
+
+```yaml
+gpu_memory_limit: Limit GPU memory usage
+lora_on_cpu: Load LoRA weights on CPU
+```
+
+#### merge-sharded-fsdp-weights
+
+Merges sharded FSDP model checkpoints into a single combined checkpoint.
+
+```bash
+# Basic merge
+axolotl merge-sharded-fsdp-weights config.yml
+```
+
+#### evaluate
+
+Evaluates a model's performance using metrics specified in the config.
+
+```bash
+# Basic evaluation
+axolotl evaluate config.yml
+```
+
+#### lm-eval
+
+Runs LM Evaluation Harness on your model.
+
+```bash
+# Basic evaluation
+axolotl lm-eval config.yml
+
+# Evaluate specific tasks
+axolotl lm-eval config.yml --tasks arc_challenge,hellaswag
+```
+
+Configuration options:
+
+```yaml
+lm_eval_tasks: List of tasks to evaluate
+lm_eval_batch_size: Batch size for evaluation
+output_dir: Directory to save evaluation results
+```
+
+### Legacy CLI Usage
+
+While the new Click-based CLI is preferred, Axolotl still supports the legacy module-based CLI:
+
+```bash
+# Preprocess
+python -m axolotl.cli.preprocess config.yml
+
+# Train
+accelerate launch -m axolotl.cli.train config.yml
+
+# Inference
+accelerate launch -m axolotl.cli.inference config.yml \
+    --lora_model_dir="./outputs/lora-out"
+
+# Gradio interface
+accelerate launch -m axolotl.cli.inference config.yml \
+    --lora_model_dir="./outputs/lora-out" --gradio
+```
+
+### Remote Compute with Modal Cloud
+
+Axolotl supports running training and inference workloads on Modal cloud infrastructure. This is configured using a
+cloud YAML file alongside your regular Axolotl config.
+
+#### Cloud Configuration
+
+Create a cloud config YAML with your Modal settings:
+
+```yaml
+# cloud_config.yml
+provider: modal
+gpu: a100  # Supported: l40s, a100-40gb, a100-80gb, a10g, h100, t4, l4
+gpu_count: 1    # Number of GPUs to use
+timeout: 86400  # Maximum runtime in seconds (24 hours)
+branch: main    # Git branch to use (optional)
+
+volumes:        # Persistent storage volumes
+  - name: axolotl-cache
+    mount: /workspace/cache
+
+env:            # Environment variables
+  - WANDB_API_KEY
+  - HF_TOKEN
+```
+
+#### Running on Modal Cloud
+
+Commands that support the --cloud flag:
+
+```bash
+# Preprocess on cloud
+axolotl preprocess config.yml --cloud cloud_config.yml
+
+# Train on cloud
+axolotl train config.yml --cloud cloud_config.yml
+
+# Train without accelerate on cloud
+axolotl train config.yml --cloud cloud_config.yml --no-accelerate
+
+# Run lm-eval on cloud
+axolotl lm-eval config.yml --cloud cloud_config.yml
+```
+
+#### Cloud Configuration Options
+
+```yaml
+provider: compute provider, currently only `modal` is supported
+gpu: GPU type to use
+gpu_count: Number of GPUs (default: 1)
+memory: RAM in GB (default: 128)
+timeout: Maximum runtime in seconds
+timeout_preprocess: Preprocessing timeout
+branch: Git branch to use
+docker_tag: Custom Docker image tag
+volumes: List of persistent storage volumes
+env: Environment variables to pass
+secrets: Secrets to inject
+```