chore: update readme to be more clear (#1326) [skip ci]

This commit is contained in:
NanoCode012
2024-02-27 03:32:13 +09:00
committed by GitHub
parent cc3cebfa70
commit c6b01e0f4a

151
README.md
View File

@@ -22,7 +22,7 @@ Features:
- [Introduction](#axolotl) - [Introduction](#axolotl)
- [Supported Features](#axolotl-supports) - [Supported Features](#axolotl-supports)
- [Quickstart](#quickstart-) - [Quickstart](#quickstart-)
- [Installation](#installation) - [Environment](#environment)
- [Docker](#docker) - [Docker](#docker)
- [Conda/Pip venv](#condapip-venv) - [Conda/Pip venv](#condapip-venv)
- [Cloud GPU](#cloud-gpu) - Latitude.sh, RunPod - [Cloud GPU](#cloud-gpu) - Latitude.sh, RunPod
@@ -87,25 +87,20 @@ Features:
| phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | phi | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ |
| RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ | | RWKV | ✅ | ❓ | ❓ | ❓ | ❓ | ❓ | ❓ |
| Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ | | Qwen | ✅ | ✅ | ✅ | ❓ | ❓ | ❓ | ❓ |
| Gemma | ✅ | ✅ | ✅ | ❓ | ❓ | ✅ | ❓ |
✅: supported
❌: not supported
❓: untested
## Quickstart ⚡ ## Quickstart ⚡
Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task. Get started with Axolotl in just a few steps! This quickstart guide will walk you through setting up and running a basic fine-tuning task.
**Requirements**: Python >=3.9 and Pytorch >=2.0. **Requirements**: Python >=3.9 and Pytorch >=2.1.1.
`pip3 install "axolotl[flash-attn,deepspeed] @ git+https://github.com/OpenAccess-AI-Collective/axolotl"` `pip3 install "axolotl[flash-attn,deepspeed] @ git+https://github.com/OpenAccess-AI-Collective/axolotl"`
### For developers
```bash
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip3 install packaging
pip3 install -e '.[flash-attn,deepspeed]'
```
### Usage ### Usage
```bash ```bash
# preprocess datasets - optional but recommended # preprocess datasets - optional but recommended
@@ -127,13 +122,14 @@ accelerate launch -m axolotl.cli.inference examples/openllama-3b/lora.yml \
accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/openllama-3b/lora.yml
``` ```
## Installation ## Advanced Setup
### Environment ### Environment
#### Docker #### Docker
```bash ```bash
docker run --gpus '"all"' --rm -it winglian/axolotl:main-py3.10-cu118-2.0.1 docker run --gpus '"all"' --rm -it winglian/axolotl:main-latest
``` ```
Or run on the current files for development: Or run on the current files for development:
@@ -152,7 +148,7 @@ accelerate launch -m axolotl.cli.train https://raw.githubusercontent.com/OpenAcc
A more powerful Docker command to run would be this: A more powerful Docker command to run would be this:
```bash ```bash
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu118-2.0.1 docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
``` ```
It additionally: It additionally:
@@ -242,15 +238,18 @@ Please use WSL or Docker!
#### Launching on public clouds via SkyPilot #### Launching on public clouds via SkyPilot
To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use [SkyPilot](https://skypilot.readthedocs.io/en/latest/index.html): To launch on GPU instances (both on-demand and spot instances) on 7+ clouds (GCP, AWS, Azure, OCI, and more), you can use [SkyPilot](https://skypilot.readthedocs.io/en/latest/index.html):
```bash ```bash
pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds pip install "skypilot-nightly[gcp,aws,azure,oci,lambda,kubernetes,ibm,scp]" # choose your clouds
sky check sky check
``` ```
Get the [example YAMLs](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl) of using Axolotl to finetune `mistralai/Mistral-7B-v0.1`: Get the [example YAMLs](https://github.com/skypilot-org/skypilot/tree/master/llm/axolotl) of using Axolotl to finetune `mistralai/Mistral-7B-v0.1`:
``` ```
git clone https://github.com/skypilot-org/skypilot.git git clone https://github.com/skypilot-org/skypilot.git
cd skypilot/llm/axolotl cd skypilot/llm/axolotl
``` ```
Use one command to launch: Use one command to launch:
```bash ```bash
# On-demand # On-demand
@@ -260,32 +259,33 @@ HF_TOKEN=xx sky launch axolotl.yaml --env HF_TOKEN
HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET HF_TOKEN=xx BUCKET=<unique-name> sky spot launch axolotl-spot.yaml --env HF_TOKEN --env BUCKET
``` ```
### Dataset ### Dataset
Axolotl supports a variety of dataset formats. Below are some of the formats you can use. Axolotl supports a variety of dataset formats. Below are some of the formats you can use.
Have dataset(s) in one of the following format (JSONL recommended): Have dataset(s) in one of the following format (JSONL recommended):
- `alpaca`: instruction; input(optional) #### Pretraining
```json
{"instruction": "...", "input": "...", "output": "..."}
```
- `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: `system` to override default system prompt)
```json
{"conversations": [{"from": "...", "value": "..."}]}
```
- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
```yml
datasets:
- path: <your-path>
type: sharegpt
conversation: llama-2
```
- `completion`: raw corpus - `completion`: raw corpus
```json ```json
{"text": "..."} {"text": "..."}
``` ```
Note: Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
```yaml
pretraining_dataset: # hf path only
```
#### Supervised finetuning
##### Instruction
- `alpaca`: instruction; input(optional)
```json
{"instruction": "...", "input": "...", "output": "..."}
```
<details> <details>
<summary>See other formats</summary> <summary>See other formats</summary>
@@ -362,14 +362,28 @@ Have dataset(s) in one of the following format (JSONL recommended):
```json ```json
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."} {"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
``` ```
- `pygmalion`: pygmalion
```json
{"conversations": [{"role": "...", "value": "..."}]}
```
- `metharme`: instruction, adds additional eos tokens - `metharme`: instruction, adds additional eos tokens
```json ```json
{"prompt": "...", "generation": "..."} {"prompt": "...", "generation": "..."}
``` ```
</details>
##### Conversation
- `sharegpt`: conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
```json
{"conversations": [{"from": "...", "value": "..."}]}
```
<details>
<summary>See other formats</summary>
- `pygmalion`: pygmalion
```json
{"conversations": [{"role": "...", "value": "..."}]}
```
- `sharegpt.load_role`: conversations where `role` is used instead of `from` - `sharegpt.load_role`: conversations where `role` is used instead of `from`
```json ```json
{"conversations": [{"role": "...", "value": "..."}]} {"conversations": [{"role": "...", "value": "..."}]}
@@ -385,6 +399,8 @@ Have dataset(s) in one of the following format (JSONL recommended):
</details> </details>
Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See dataset section under [all yaml options](#all-yaml-options).
#### How to add custom prompts #### How to add custom prompts
For a dataset that is preprocessed for instruction purposes: For a dataset that is preprocessed for instruction purposes:
@@ -406,12 +422,16 @@ datasets:
format: "[INST] {instruction} [/INST]" format: "[INST] {instruction} [/INST]"
no_input_format: "[INST] {instruction} [/INST]" no_input_format: "[INST] {instruction} [/INST]"
``` ```
See full config options under [all yaml options](#all-yaml-options).
#### How to use your custom pretokenized dataset #### How to use your custom pretokenized dataset
- Do not pass a `type:` - Do not pass a `type:`
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels` - Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
```yaml
- path: ...
```
### Config ### Config
@@ -425,22 +445,18 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
- dataset - dataset
```yaml ```yaml
sequence_len: 2048 # max token length for prompt
# huggingface repo
datasets: datasets:
# huggingface repo
- path: vicgalle/alpaca-gpt4 - path: vicgalle/alpaca-gpt4
type: alpaca # format from earlier type: alpaca
# huggingface repo with specific configuration/subset # huggingface repo with specific configuration/subset
datasets:
- path: EleutherAI/pile - path: EleutherAI/pile
name: enron_emails name: enron_emails
type: completion # format from earlier type: completion # format from earlier
field: text # Optional[str] default: text, field to use for completion data field: text # Optional[str] default: text, field to use for completion data
# huggingface repo with multiple named configurations/subsets # huggingface repo with multiple named configurations/subsets
datasets:
- path: bigcode/commitpackft - path: bigcode/commitpackft
name: name:
- ruby - ruby
@@ -448,34 +464,29 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
- typescript - typescript
type: ... # unimplemented custom format type: ... # unimplemented custom format
# fastchat conversation # fastchat conversation
# See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py # See 'conversation' options: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
datasets:
- path: ... - path: ...
type: sharegpt type: sharegpt
conversation: chatml conversation: chatml # default: vicuna_v1.1
# local # local
datasets:
- path: data.jsonl # or json - path: data.jsonl # or json
ds_type: json # see other options below ds_type: json # see other options below
type: alpaca type: alpaca
# dataset with splits, but no train split # dataset with splits, but no train split
dataset:
- path: knowrohit07/know_sql - path: knowrohit07/know_sql
type: context_qa.load_v2 type: context_qa.load_v2
train_on_split: validation train_on_split: validation
# loading from s3 or gcs # loading from s3 or gcs
# s3 creds will be loaded from the system default and gcs only supports public access # s3 creds will be loaded from the system default and gcs only supports public access
dataset:
- path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs. - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
... ...
# Loading Data From a Public URL # Loading Data From a Public URL
# - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly. # - The file format is `json` (which includes `jsonl`) by default. For different formats, adjust the `ds_type` option accordingly.
dataset:
- path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP. - path: https://some.url.com/yourdata.jsonl # The URL should be a direct link to the file you wish to load. URLs must use HTTPS protocol, not HTTP.
ds_type: json # this is the default, see other options below. ds_type: json # this is the default, see other options below.
``` ```
@@ -484,9 +495,11 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
```yaml ```yaml
load_in_4bit: true load_in_4bit: true
load_in_8bit: true load_in_8bit: true
bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically. bf16: auto # require >=ampere, auto will detect if your GPU supports this and choose automatically.
fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32 fp16: # leave empty to use fp16 when bf16 is 'auto'. set to false if you want to fallback to fp32
tf32: true # require >=ampere tf32: true # require >=ampere
bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision) bfloat16: true # require >=ampere, use instead of bf16 when you don't want AMP (automatic mixed precision)
float16: true # use instead of fp16 when you don't want AMP float16: true # use instead of fp16 when you don't want AMP
``` ```
@@ -494,7 +507,7 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
- lora - lora
```yaml ```yaml
adapter: lora # qlora or leave blank for full finetune adapter: lora # 'qlora' or leave blank for full finetune
lora_r: 8 lora_r: 8
lora_alpha: 16 lora_alpha: 16
lora_dropout: 0.05 lora_dropout: 0.05
@@ -503,9 +516,9 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
- v_proj - v_proj
``` ```
<details> <details id="all-yaml-options">
<summary>All yaml options (click me)</summary> <summary>All yaml options (click to expand)</summary>
```yaml ```yaml
# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files # This is the huggingface model that contains *.pt, *.safetensors, or *.bin files
@@ -535,12 +548,13 @@ tokenizer_legacy:
# This is reported to improve training speed on some models # This is reported to improve training speed on some models
resize_token_embeddings_to_32x: resize_token_embeddings_to_32x:
# (Internal use only)
# Used to identify which the model is based on # Used to identify which the model is based on
is_falcon_derived_model: is_falcon_derived_model:
is_llama_derived_model: is_llama_derived_model:
is_qwen_derived_model:
# Please note that if you set this to true, `padding_side` will be set to "left" by default # Please note that if you set this to true, `padding_side` will be set to "left" by default
is_mistral_derived_model: is_mistral_derived_model:
is_qwen_derived_model:
# optional overrides to the base model configuration # optional overrides to the base model configuration
model_config_overrides: model_config_overrides:
@@ -633,7 +647,7 @@ test_datasets:
data_files: data_files:
- /workspace/data/eval.jsonl - /workspace/data/eval.jsonl
# use RL training: dpo, ipo, kto_pair # use RL training: 'dpo', 'ipo', 'kto_pair'
rl: rl:
# Saves the desired chat template to the tokenizer_config.json for easier inferencing # Saves the desired chat template to the tokenizer_config.json for easier inferencing
@@ -653,7 +667,7 @@ dataset_processes: # defaults to os.cpu_count() if not set
# Only needed if cached dataset is taking too much storage # Only needed if cached dataset is taking too much storage
dataset_keep_in_memory: dataset_keep_in_memory:
# push checkpoints to hub # push checkpoints to hub
hub_model_id: # repo path to push finetuned model hub_model_id: # private repo path to push finetuned model
# how to push checkpoints to hub # how to push checkpoints to hub
# https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy # https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/trainer#transformers.TrainingArguments.hub_strategy
hub_strategy: hub_strategy:
@@ -1100,7 +1114,7 @@ Please use `--sample_packing False` if you have it on and receive the error simi
### Merge LORA to base ### Merge LORA to base
The following command will merge your LORA adapater with your base model. You can optionally pass the argument `--lora_model_dir` to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from `output_dir` in your axolotl config file. The merged model is saved in the sub-directory `{lora_model_dir}/merged`. The following command will merge your LORA adapater with your base model. You can optionally pass the argument `--lora_model_dir` to specify the directory where your LORA adapter was saved, otherwhise, this will be inferred from `output_dir` in your axolotl config file. The merged model is saved in the sub-directory `{lora_model_dir}/merged`.
```bash ```bash
python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model" python3 -m axolotl.cli.merge_lora your_config.yml --lora_model_dir="./completed-model"
@@ -1161,7 +1175,7 @@ If you decode a prompt constructed by axolotl, you might see spaces between toke
1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer. 1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string. 2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly. 3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same, adjust your inference server accordingly.
4. As an additional troubleshooting step, you can look at the token ids between 1 and 2 to make sure they are identical. 4. As an additional troubleshooting step, you can look at the token ids between 1 and 2 to make sure they are identical.
Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example. Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
@@ -1208,11 +1222,20 @@ PRs are **greatly welcome**!
Please run below to setup env Please run below to setup env
```bash ```bash
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip3 install packaging
pip3 install -e '.[flash-attn,deepspeed]'
pip3 install -r requirements-dev.txt -r requirements-tests.txt pip3 install -r requirements-dev.txt -r requirements-tests.txt
pre-commit install pre-commit install
# test # test
pytest tests/ pytest tests/
# optional: run against all files
pre-commit run --all-files
``` ```
Thanks to all of our contributors to date. Help drive open source AI progress forward by contributing to Axolotl. Thanks to all of our contributors to date. Help drive open source AI progress forward by contributing to Axolotl.