Add docs (#947)

* move section * update README * update README * update README * update README * update README * Update README.md Co-authored-by: Wing Lian <wing.lian@gmail.com> --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>
2023-12-13 14:22:52 -08:00
parent ef24342538
commit 712fd27b3f
1 changed files with 39 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -36,7 +36,9 @@ Features:
  - [Train](#train)
  - [Inference](#inference)
  - [Merge LORA to Base](#merge-lora-to-base)
  - [Special Tokens](#special-tokens)
 - [Common Errors](#common-errors-)
  - [Tokenization Mismatch b/w Training & Inference](#tokenization-mismatch-bw-inference--training)
 - [Need Help?](#need-help-)
 - [Badge](#badge-)
 - [Community Showcase](#community-showcase)
@@ -251,6 +253,13 @@ Have dataset(s) in one of the following format (JSONL recommended):
  ```json
  {"conversations": [{"from": "...", "value": "..."}]}
  ```
 - `llama-2`: the json is the same format as `sharegpt` above, with the following config (see the [config section](#config) for more details)
    ```yml
    datasets:
      - path: <your-path>
        type: sharegpt
        conversation: llama-2
    ```
 - `completion`: raw corpus
  ```json
  {"text": "..."}
@@ -970,6 +979,22 @@ wandb_name:
 wandb_log_model:
 ```
 ##### Special Tokens
 It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer's vocubulary.  This will help you avoid tokenization issues and help your model train better.  You can do this in axolotl like this:
 ```yml
 special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
 tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"
 ```
 When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer's vocabulary.
 ### Inference
 Pass the appropriate flag to the train command:
@@ -1048,6 +1073,20 @@ It's safe to ignore it.
 See the [NCCL](docs/nccl.md) guide.
 ### Tokenization Mismatch b/w Inference & Training
 For many formats, Axolotl constructs prompts by concatenating token ids _after_ tokenizing strings.  The reason for concatenating token ids rather than operating on strings is to maintain precise accounting for attention masks.
 If you decode a prompt constructed by axolotl, you might see spaces between tokens (or lack thereof) that you do not expect, especially around delimiters and special tokens.  When you are starting out with a new format, you should always do the following:
 1. Materialize some data using `python -m axolotl.cli.preprocess your_config.yml --debug`, and then decode the first few rows with your model's tokenizer.
 2. During inference, right before you pass a tensor of token ids to your model, decode these tokens back into a string.
 3. Make sure the inference string from #2 looks **exactly** like the data you fine tuned on from #1, including spaces and new lines.  If they aren't the same adjust your inference server accordingly.
 4. As an additional troubleshooting step, you can look look at the token ids between 1 and 2 to make sure they are identical.
 Having misalignment between your prompts during training and inference can cause models to perform very poorly, so it is worth checking this.  See [this blog post](https://hamel.dev/notes/llm/05_tokenizer_gotchas.html) for a concrete example.
 ## Need help? 🙋♂️
 Join our [Discord server](https://discord.gg/HhrNrHJPRb) where we can help you