Reorganize Docs (#1468)
This commit is contained in:
71
docs/dataset-formats/conversation.qmd
Normal file
71
docs/dataset-formats/conversation.qmd
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
title: Conversation
|
||||
description: Conversation format for supervised fine-tuning.
|
||||
order: 1
|
||||
---
|
||||
|
||||
## Formats
|
||||
|
||||
### sharegpt
|
||||
|
||||
conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"conversations": [{"from": "...", "value": "..."}]}
|
||||
```
|
||||
|
||||
Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See [the docs](../docs/config.qmd) for all config options.
|
||||
|
||||
### pygmalion
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"conversations": [{"role": "...", "value": "..."}]}
|
||||
```
|
||||
|
||||
### sharegpt.load_role
|
||||
|
||||
conversations where `role` is used instead of `from`
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"conversations": [{"role": "...", "value": "..."}]}
|
||||
```
|
||||
|
||||
### sharegpt.load_guanaco
|
||||
|
||||
conversations where `from` is `prompter` `assistant` instead of default sharegpt
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"conversations": [{"from": "...", "value": "..."}]}
|
||||
```
|
||||
|
||||
### sharegpt_jokes
|
||||
|
||||
creates a chat where bot is asked to tell a joke, then explain why the joke is funny
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
|
||||
```
|
||||
|
||||
## How to add custom prompts for instruction-tuning
|
||||
|
||||
For a dataset that is preprocessed for instruction purposes:
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"input": "...", "output": "..."}
|
||||
```
|
||||
|
||||
You can use this example in your YAML config:
|
||||
|
||||
```{.yaml filename="config.yaml"}
|
||||
datasets:
|
||||
- path: repo
|
||||
type:
|
||||
system_prompt: ""
|
||||
field_system: system
|
||||
field_instruction: input
|
||||
field_output: output
|
||||
format: "[INST] {instruction} [/INST]"
|
||||
no_input_format: "[INST] {instruction} [/INST]"
|
||||
```
|
||||
|
||||
See full config options under [here](../docs/config.qmd).
|
||||
14
docs/dataset-formats/index.qmd
Normal file
14
docs/dataset-formats/index.qmd
Normal file
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: Dataset Formats
|
||||
description: Supported dataset formats.
|
||||
listing:
|
||||
fields: [title, description]
|
||||
type: table
|
||||
sort-ui: false
|
||||
filter-ui: false
|
||||
max-description-length: 250
|
||||
---
|
||||
|
||||
Axolotl supports a variety of dataset formats. It is recommended to use a JSONL format. The schema of the JSONL depends upon the task and the prompt template you wish to use. Instead of a JSONL, you can also use a HuggingFace dataset with columns for each JSONL field.
|
||||
|
||||
Below are these various formats organized by task:
|
||||
165
docs/dataset-formats/inst_tune.qmd
Normal file
165
docs/dataset-formats/inst_tune.qmd
Normal file
@@ -0,0 +1,165 @@
|
||||
---
|
||||
title: Instruction Tuning
|
||||
description: Instruction tuning formats for supervised fine-tuning.
|
||||
order: 2
|
||||
---
|
||||
|
||||
## alpaca
|
||||
|
||||
instruction; input(optional)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"instruction": "...", "input": "...", "output": "..."}
|
||||
```
|
||||
|
||||
## jeopardy
|
||||
|
||||
question and answer
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"question": "...", "category": "...", "answer": "..."}
|
||||
```
|
||||
|
||||
## oasst
|
||||
|
||||
instruction
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"INSTRUCTION": "...", "RESPONSE": "..."}
|
||||
```
|
||||
|
||||
## gpteacher
|
||||
|
||||
instruction; input(optional)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"instruction": "...", "input": "...", "response": "..."}
|
||||
```
|
||||
|
||||
## reflection
|
||||
|
||||
instruction with reflect; input(optional)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"instruction": "...", "input": "...", "output": "...", "reflection": "...", "corrected": "..."}
|
||||
```
|
||||
|
||||
## explainchoice
|
||||
|
||||
question, choices, (solution OR explanation)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
|
||||
```
|
||||
|
||||
## concisechoice
|
||||
|
||||
question, choices, (solution OR explanation)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"question": "...", "choices": ["..."], "solution": "...", "explanation": "..."}
|
||||
```
|
||||
|
||||
## summarizetldr
|
||||
|
||||
article and summary
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"article": "...", "summary": "..."}
|
||||
```
|
||||
|
||||
## alpaca_chat
|
||||
|
||||
basic instruct for alpaca chat
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"instruction": "...", "input": "...", "response": "..."}
|
||||
```
|
||||
|
||||
## alpaca_chat.load_qa
|
||||
|
||||
question and answer for alpaca chat
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"question": "...", "answer": "..."}
|
||||
```
|
||||
|
||||
## alpaca_chat.load_concise
|
||||
|
||||
question and answer for alpaca chat, for concise answers
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"instruction": "...", "input": "...", "response": "..."}
|
||||
```
|
||||
|
||||
## alpaca_chat.load_camel_ai
|
||||
|
||||
question and answer for alpaca chat, for load_camel_ai
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"message_1": "...", "message_2": "..."}
|
||||
```
|
||||
|
||||
## alpaca_w_system.load_open_orca
|
||||
|
||||
support for open orca datasets with included system prompts, instruct
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"system_prompt": "...", "question": "...", "response": "..."}
|
||||
```
|
||||
|
||||
## context_qa
|
||||
|
||||
in context question answering from an article
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"article": "...", "question": "...", "answer": "..."}
|
||||
```
|
||||
|
||||
## context_qa.load_v2
|
||||
|
||||
in context question answering (alternate)
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"context": "...", "question": "...", "answer": "..."}
|
||||
```
|
||||
|
||||
## context_qa.load_404
|
||||
|
||||
in context question answering from an article, with default response for no answer from context
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"article": "...", "unanswerable_question": "..."}
|
||||
```
|
||||
|
||||
## creative_acr.load_answer
|
||||
|
||||
instruction and revision
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"instruction": "...", "revision": "..."}
|
||||
```
|
||||
|
||||
## creative_acr.load_critique
|
||||
|
||||
critique
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "..."}
|
||||
```
|
||||
|
||||
## creative_acr.load_revise
|
||||
|
||||
critique and revise
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"scores": "...", "critiques": "...", "instruction": "...", "answer": "...", "revision": "..."}
|
||||
```
|
||||
|
||||
## metharme
|
||||
|
||||
instruction, adds additional eos tokens
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"prompt": "...", "generation": "..."}
|
||||
```
|
||||
26
docs/dataset-formats/pretraining.qmd
Normal file
26
docs/dataset-formats/pretraining.qmd
Normal file
@@ -0,0 +1,26 @@
|
||||
---
|
||||
title: Pre-training
|
||||
description: Data format for a pre-training completion task.
|
||||
order: 3
|
||||
---
|
||||
|
||||
For pretraining, there is no prompt template or roles. The only required field is `text`:
|
||||
|
||||
```{.json filename="data.jsonl"}
|
||||
{"text": "first row"}
|
||||
{"text": "second row"}
|
||||
...
|
||||
```
|
||||
|
||||
:::{.callout-note}
|
||||
|
||||
### Streaming is recommended for large datasets
|
||||
|
||||
Axolotl usually loads the entire dataset into memory. This will be challenging for large datasets. Use the following config to enable streaming:
|
||||
|
||||
```{.yaml filename="config.yaml"}
|
||||
pretraining_dataset: # hf path only
|
||||
...
|
||||
```
|
||||
|
||||
:::
|
||||
7
docs/dataset-formats/template_free.qmd
Normal file
7
docs/dataset-formats/template_free.qmd
Normal file
@@ -0,0 +1,7 @@
|
||||
---
|
||||
title: Template-Free
|
||||
description: Construct prompts without a template.
|
||||
order: 4
|
||||
---
|
||||
|
||||
See [these docs](../input_output.qmd).
|
||||
12
docs/dataset-formats/tokenized.qmd
Normal file
12
docs/dataset-formats/tokenized.qmd
Normal file
@@ -0,0 +1,12 @@
|
||||
---
|
||||
title: Custom Pre-Tokenized Dataset
|
||||
description: How to use a custom pre-tokenized dataset.
|
||||
order: 5
|
||||
---
|
||||
|
||||
- Do not pass a `type:` in your axolotl config.
|
||||
- Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels`
|
||||
|
||||
```{.yaml filename="config.yml"}
|
||||
- path: ...
|
||||
```
|
||||
Reference in New Issue
Block a user