From 77764de0dd5002c922bd511c08490bd70ce557ca Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Thu, 13 Feb 2025 21:02:41 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- docs/dataset-formats/conversation.html | 4 +- docs/dataset-formats/index.html | 590 ++++++++++++++++----- docs/rlhf.html | 513 ++++++++++++++++-- listings.json | 13 - search.json | 102 +++- site_libs/quarto-listing/list.min.js | 2 - site_libs/quarto-listing/quarto-listing.js | 254 --------- sitemap.xml | 76 +-- 9 files changed, 1052 insertions(+), 504 deletions(-) delete mode 100644 listings.json delete mode 100644 site_libs/quarto-listing/list.min.js delete mode 100644 site_libs/quarto-listing/quarto-listing.js diff --git a/.nojekyll b/.nojekyll index 51f17ebc2..057f00e37 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -e7966439 \ No newline at end of file +64f14fad \ No newline at end of file diff --git a/docs/dataset-formats/conversation.html b/docs/dataset-formats/conversation.html index 2f74554a8..769cdcf85 100644 --- a/docs/dataset-formats/conversation.html +++ b/docs/dataset-formats/conversation.html @@ -368,7 +368,7 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin

sharegpt

-

IMPORTANT: ShareGPT is deprecated!. Please see chat_template section below.

+

IMPORTANT: ShareGPT is deprecated!. Please see chat_template section below.

pygmalion

@@ -388,7 +388,7 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin
{"conversations": [{"role": "...", "content": "..."}]}
-

See config.qmd for full configs and supported templates.

+

See configs for full configs and supported templates.

Migrating from sharegpt

Most configs can be adapted as follows:

diff --git a/docs/dataset-formats/index.html b/docs/dataset-formats/index.html index 80b5cad80..b622b019c 100644 --- a/docs/dataset-formats/index.html +++ b/docs/dataset-formats/index.html @@ -6,7 +6,7 @@ - + Dataset Formats – Axolotl @@ -31,8 +65,6 @@ ul.task-list li input[type="checkbox"] { - - @@ -69,65 +101,7 @@ ul.task-list li input[type="checkbox"] { "search-label": "Search" } } - - - - - - @@ -350,8 +324,46 @@ window.Quarto = {
- +
+

For pre-training only, Axolotl would split texts if it exceeds the context length into multiple smaller prompts.

+
+ +
+

Pre-training from Hugging Face hub datasets

+

As an example, to train using a Hugging Face dataset hf_org/name, you can pass the following config:

+
pretraining_dataset: hf_org/name
+
+
+

Pre-training from local dataset files

+

Given a few corpus files: A.jsonl, B.jsonl, and C.jsonl, your config will look like the below:

+
pretraining_dataset:
+  - path: json
+    data_files:
+      - A.jsonl
+      - B.jsonl
+      - C.jsonl
+

While we recommend .jsonl, you can also use the other formats (csv, parquet, arrow, SQL, Webdataset) that are supported by Dataset.load_dataset

+
+
+

Pre-training without streaming

+

On the rare case that the dataset is small and can be loaded entirely into memory, another approach to running pre-training is to use the completion format. This would mean that the entire dataset is pre-tokenized instead of on-demand in streaming.

+

One benefit of this is that the tokenization can be performed separately on a CPU-only machine, and then transferred to a GPU machine for training to save costs.

+

From Hugging Face:

+
datasets:
+  - path: hf_org/name
+    type: completion
+

From local files (either example works):

+
datasets:
+  - path: A.jsonl
+    type: completion
+
+  - path: json
+    data_files: ["A.jsonl", "B.jsonl", "C.jsonl"]
+    type: completion
+
+
+

Pre-training dataset configuration tips

+
+

Setting max_steps

+

When using streaming for large datasets, Axolotl does not know in advance how large the dataset is and does not know when to stop.

+

Therefore, it is necessary to set max_steps: int in your config for pre-training to run, so that Axolotl knows when to stop training.

+

One step is equal to sequence_len * micro_batch_size * gradient_accumulation_steps * total_num_gpus tokens.

+
+
+

Group_by_length

+

It is recommended to leave this off if downloading from Hugging Face hub as it would download the entire dataset which can be very large.

+
+
+
+
+

Supervised fine-tuning (SFT)

+

Supervised fine-tuning is the process of training models to respond to an instruction or chat input.

+

As there are a wide variety of dataset formats, Axolotl tries to support a majority of the formats available in public datasets.

+

Axolotl provides four approaches for loading datasets, however, it’s easier to work backwards from the dataset you have available to figure out which approach to use.

+

A flow chart is as follows:

+
    +
  1. Do you already have the dataset tokenized? If yes, check Pre-Tokenized Dataset.

  2. +
  3. Do you want to format the dataset yourself and manually choose each section to mask? If yes, check Template Free Dataset

  4. +
  5. Is your dataset in a “conversation” format, containing a list[messages]? If yes, check Conversation Dataset

  6. +
  7. Is your dataset in an “instruct” format, containing { instruction, response }? If yes, check Instruction Dataset

  8. +
+

If you went through the flow chart and did not find one that matches, it is recommended to preprocess your dataset into one of the above or create a Github Discussion.

+
+
+
+ +
+
+Tip +
+
+
+

You can mix and match within each approach or across approaches to train a model on a variety of datasets.

+
+
+
+

Pre-Tokenized Dataset

+

We suggest this approach when you want to bring your own tokenized dataset.

+

Axolotl expects the dataset to have three keys: - input_ids: from tokenizing formatted prompt - attention_mask: for masking padding. If you don’t add padding, it would be equal to len(input_ids) * [1] - labels: this is the same as input_ids, however, if you want to mask certain tokens, you would set those indices to -100.

+
+
+
+ +
+
+Tip +
+
+
+

Make sure to add BOS/EOS tokens to your prompt and mask it appropriately.

+
+
+

A config for this would look like:

+
datasets:
+  - path: A.jsonl
+    type:
+
+
+
+ +
+
+Note +
+
+
+

type: is empty!

+
+
+
+
+

Template Free Dataset

+

We reccomend this approach when you want granular control over the prompt formatting, special tokens, and masking, whilst letting Axolotl handle the tokenization. This is very useful if your dataset has unique prompts that differ across samples and where one single general template wouldn’t suffice.

+

In the example below, you could see that there is no proper structure. At the same time, it’s very flexible as there are no constraints on how your prompt can look.

+
{
+    "segments": [
+        {
+            "label": true,
+            "text": "<s>Hello\n"
+        },
+        {
+            "label": true,
+            "text": "hi there!. "
+        },
+        {
+            "label": false,
+            "text": "goodbye "
+        },
+        {
+            "label": true,
+            "text": "farewell</s>"
+        }
+    ]
+}
+

Each prompt must be have a key called segments which is a list of { text, label }.

+
datasets:
+  - path: A.jsonl
+    type: input_output
+
+
+

Conversation Dataset

+

conversation messages are a list of messages which usually contain a role and content key.

+
+
+
+ +
+
+Tip +
+
+
+

Fun fact: Axolotl synonymously refers to “chat” messages as conversation messages due to how FastChat initially used this term to build a widely used fastchat conversation method for formatting chat messages prior to the creation of chat_templates.

+
+
+
+

What are chat_templates?

+

The current most popular and convenient method for inference is to use chat_templates for formatting prompts. Axolotl supports using chat_templates for training to ensure that the model performs in the same environment as in inference.

+

Here’s a quick rundown on chat_template: A chat_template is a Jinja2 template which formats a list of messages into a prompt.

+

An example of a prompt formatted into a popular template called ChatML can be seen below:

+

Single prompt (pretty-printed):

+
{
+    "messages": [
+        {
+            "role": "user",
+            "content": "Hi"
+        },
+        {
+            "role": "assistant",
+            "content": "How can I help you?"
+        },
+        {
+            "role": "user",
+            "content": "Can you add 3+5?"
+        },
+        {
+            "role": "assistant",
+            "content": "The answer is 8."
+        }
+    ]
+}
+

The ChatML template is as follows:

+
{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
+

The above prompt formatted into this template will result in:

+
<|im_start|>user
+Hi<|im_end|>
+<|im_start|>assistant
+How can I help you?<|im_end|>
+<|im_start|>user
+Can you add 3+5?<|im_end|>
+<|im_start|>assistant
+The answer is 8.<|im_end|>
+

By using delimiters (<|im_start|> and <|im_end|>), a prompt separates different speakers which helps the model identify which portion belongs to whom.

+
+
+

Common Conversation Dataset formats

+

Older conversation datasets with the following format are colloquially called sharegpt datasets.

+
{"conversations": [{"from": "...", "value": "..."}]}
+

Newer conversation datasets usually follow the OpenAI format.

+
{"messages": [{"role": "...", "content": "..."}]}
+

Axolotl supports both as well as allowing customization of any kind of key.

+
+
+

Chat Template Usage

+

To properly use this method, it is important to identify three things:

+
    +
  1. Which chat_template would you use?

  2. +
  3. What are the keys in your dataset, and what are the possible roles? For example, in OpenAI format, the keys would be messages, role, and content, respectively, whereas the possible roles are system, user, and assistant.

  4. +
  5. What do you want to mask? For instance, only assistant messages, only last message, or nothing.

  6. +
+
+
Choosing a chat_template
+

There are a lot of chat_templates out there. Axolotl supports the common ones: supported chat templates. For example, to use ChatML, it would be chat_template: chatml.

+

However, it is also possible to use the already configured template within the tokenizer by specifying chat_template: tokenizer_default. If you want a fallback (in case some tokenizer does not have it pre-configured), you can do chat_template: tokenizer_default_fallback_chatml to fallback to the ChatML template if a tokenizer template was not found.

+

One last but powerful approach is to bring your own template. This can be set via:

+
chat_template_jinja: # your template
+
+
+
Setting chat_template dataset keys
+

We currently default to OpenAI format for dataset keys, so if that’s your current dataset format, there’s nothing to do here.

+

If your dataset format is different, here are the keys you should check (with their defaults):

+
datasets:
+    ...
+    field_messages: messages
+    message_field_role: role
+    message_field_content: content
+

In some chat_templates (e.g. Gemma), the roles are hardcoded to user and assistant. Consequently, you may find it necessary to map the roles in your dataset to these above. We currently have some defaults that should work for common datasets, but if you get a KeyError, it would be necessary to add mapping for your roles. Here is an example of how it would look like:

+
datasets:
+    ...
+    roles:
+      assistant:
+        - gpt
+        - model
+      user:
+        - human
+

In the example above, all gpt and model values are converted to assistant. All human values are converted to user.

+
+
+
Handling masking
+

The common use case for chat_template is for chat messages, therefore, it is common to mask all non-assistant messages. Assistant messages refer to the bot messages that you want the model to learn on.

+

To train on all assistant messages, you would set the following configs.

+
datasets:
+    ...
+    roles_to_train: ["assistant"]
+    train_on_eos: "turn"
+

The train_on_eos config means that it would mask all EOS tokens for turns that aren’t assistant-turns. The other options are: all and last to choose which EOS to train on.

+

Perhaps, you want to train on assistant and narrator roles, you can simply add narrator to the list of roles_to_train. You would also need to add it to the mapping of roles above.

+
datasets:
+    ...
+    roles_to_train: ["assistant", "narrator"]
+    roles:
+      assistant:
+        - gpt
+        - model
+      user:
+        - human
+      narrator: ["narrator"]
+
+
+
+

Applying chat_template

+

Once all the above steps are completed, you could combine all these configs together to form a bespoke configuration for your custom dataset. The final step would be to correctly set the EOS token in your config:

+
datasets:
+  - path: A.jsonl
+    type: chat_template
+
+    # step 1
+    chat_template: chatml
+
+    # step 2
+    field_messages: messages
+    message_field_role: role
+    message_field_content: content
+
+    roles:
+      assistant:
+        - gpt
+        - model
+        - assistant
+      user:
+        - human
+        - user
+
+    # step 3
+    roles_to_train: ["assistant"]
+    train_on_eos: "turn"
+
+special_tokens:
+  eos_token: <|im_end|>
+

If this config were to be applied to the sample dataset above, the output would look as such (which can be retrieved via axolotl preprocess config.yaml --debug):

+
<|im_start|>(-100, 128256) user(-100, 882)
+(-100, 198) Hi(-100, 13347) <|im_end|>(-100, 128257)
+(-100, 198) <|im_start|>(-100, 128256) assistant(-100, 78191)
+(-100, 198) How(4438, 4438)  can(649, 649)  I(358, 358)  help(1520, 1520)  you(499, 499) ?(30, 30) <|im_end|>(128257, 128257)
+(-100, 198) <|im_start|>(-100, 128256) user(-100, 882)
+(-100, 198) Can(-100, 6854)  you(-100, 499)  add(-100, 923)  (-100, 220) 3(-100, 18) +(-100, 10) 5(-100, 20) ?(-100, 30) <|im_end|>(-100, 128257)
+(-100, 198) <|im_start|>(-100, 128256) assistant(-100, 78191)
+(-100, 198) The(791, 791)  answer(4320, 4320)  is(374, 374)  (220, 220) 8(23, 23) .(13, 13) <|im_end|>(128257, 128257)
+(-100, 198)
+

The first number refers to the label, the second refers to the token_id. For example, -100 labels appear on non-assistant portions, meaning that they are masked during. For assistant portions, the label is the same as the token_id.

+
+
+
+

Instruction Dataset

+

Instruction datasets are used to train instruction-following models and comprise a prompt, containing an instruction, and a single response. In contrast to chat datasets which may be multi-turn, instruct datasets are typically single-turn.

+

An example is of a common format called Alpaca:

+
{"instruction": "...", "input": "...", "output": "..."}
+

Using those keys, a prompt can be built based on it.

+
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+
+### Instruction:
+{instruction}
+
+### Input:
+{input}
+
+### Response:
+{output}
+

This can be configured as such:

+
datasets:
+  - path: A.jsonl
+    type: alpaca
+

Axolotl supports many kinds of instruction dataset. All of them can be found here (https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/inst_tune.html) with their respective type and sample row format.

+
+

Custom Instruct Prompt Format

+

Due to the myriad possibilities of instruction formats, Axolotl allows customizing your own instruction format without having to dive into the code directly.

+

In the example below, a sample row is used to output in mistral_v1 format.

+
{"input": "...", "output": "..."}
+
datasets:
+  - path: repo
+    type:
+      system_prompt: ""
+
+      field_system:
+      field_instruction: input
+      field_input:
+      field_output: output
+
+      # multi-line example with input
+      format: |-
+        [INST] {instruction} {input} [/INST]
+
+      # single-line example without input
+      no_input_format: "[INST] {instruction} [/INST]"
+

The config sets that the field_instruction is actually named input, and the field_input is empty as we don’t have an input in this sample. Generally, instruction can be thought as the question to the model, and input as the additional information with output being the response. It is not necessary to have an input nor system. In the end, the most important part is to understand what format you want it to look like and how you can customize this to your use case.

+
+
+
+
+

Reinforcement Learning from Human Feedback (RLHF)

+

As there are multiple RLHF methods with their own dataset requirements. Please see RLHF datasets documentation for more detail.

+ + +
+ +