Add docs (#947)
* move section * update README * update README * update README * update README * update README * Update README.md Co-authored-by: Wing Lian <wing.lian@gmail.com> --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
This commit is contained in:
39
README.md
39
README.md
@@ -36,7 +36,9 @@ Features:
|
|||||||
- [Train](#train)
|
- [Train](#train)
|
||||||
- [Inference](#inference)
|
- [Inference](#inference)
|
||||||
- [Merge LORA to Base](#merge-lora-to-base)
|
- [Merge LORA to Base](#merge-lora-to-base)
|
||||||
|
- [Special Tokens](#special-tokens)
|
||||||
- [Common Errors](#common-errors-)
|
- [Common Errors](#common-errors-)
|
||||||
|
- [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
|
||||||
- [Need Help?](#need-help-)
|
- [Need Help?](#need-help-)
|
||||||
- [Badge](#badge-)
|
- [Badge](#badge-)
|
||||||
- [Community Showcase](#community-showcase)
|
- [Community Showcase](#community-showcase)
|
||||||
@@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended):
|
|||||||
```json
|
```json
|
||||||
{"conversations": [{"from": "...", "value": "..."}]}
|
{"conversations": [{"from": "...", "value": "..."}]}
|
||||||
```
|
```
|
||||||
|
- `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
|
||||||
|
```yml
|
||||||
|
datasets:
|
||||||
|
- path: <your-path>
|
||||||
|
type: sharegpt
|
||||||
|
conversation: llama-2
|
||||||
|
```
|
||||||
- `completion`: raw corpus
|
- `completion`: raw corpus
|
||||||
```json
|
```json
|
||||||
{"text": "..."}
|
{"text": "..."}
|
||||||
@@ -970,6 +979,22 @@ wandb_name:
|
|||||||
wandb_log_model:
|
wandb_log_model:
|
||||||
```
|
```
|
||||||
|
|
||||||
|
##### Special Tokens
|
||||||
|
|
||||||
|
It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary. This will help you avoid tokenization issues and help your model train better. You can do this in axolotl like this:
|
||||||
|
|
||||||
|
```yml
|
||||||
|
special_tokens:
|
||||||
|
bos_token: "<s>"
|
||||||
|
eos_token: "</s>"
|
||||||
|
unk_token: "<unk>"
|
||||||
|
tokens: # these are delimiters
|
||||||
|
- "<|im_start|>"
|
||||||
|
- "<|im_end|>"
|
||||||
|
```
|
||||||
|
|
||||||
|
When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
|
||||||
|
|
||||||
### Inference
|
### Inference
|
||||||
|
|
||||||
Pass the appropriate flag to the train command:
|
Pass the appropriate flag to the train command:
|
||||||
@@ -1048,6 +1073,20 @@ It's safe to ignore it.
|
|||||||
|
|
||||||
See the [NCCL](docs/nccl.md) guide.
|
See the [NCCL](docs/nccl.md) guide.
|
||||||
|
|
||||||
|
|
||||||
|
### Tokenization Mismatch b/w Inference & Training
|
||||||
|
|
||||||
|
For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings. The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
|
||||||
|
|
||||||
|
If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens. When you are starting out with a new format, you should always do the following:
|
||||||
|
|
||||||
|
1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
|
||||||
|
2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
|
||||||
|
3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines. If they aren't the same adjust your inference server accordingly.
|
||||||
|
4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical.
|
||||||
|
|
||||||
|
Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this. See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
|
||||||
|
|
||||||
## Need help? 🙋♂️
|
## Need help? 🙋♂️
|
||||||
|
|
||||||
Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
|
Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you
|
||||||
|
|||||||
Reference in New Issue
Block a user