Feat: update doc (#1475) [skip ci]

* feat: update doc contents * chore: move batch vs ga docs * feat: update lambdalabs instructions * fix: refactor dev instructions
2024-04-04 13:43:40 +09:00
parent 5760099bd4
commit c2b64e4dcf
6 changed files with 116 additions and 113 deletions
--- a/docs/dataset-formats/conversation.qmd
+++ b/docs/dataset-formats/conversation.qmd
@@ -1,12 +1,10 @@
 ---
 title: Conversation
 description: Conversation format for supervised fine-tuning.
-order: 1
+order: 3
 ---

-## Formats
-
-### sharegpt
+## sharegpt

 conversations where `from` is `human`/`gpt`. (optional: first row with role `system` to override default system prompt)

@@ -14,15 +12,33 @@ conversations where `from` is `human`/`gpt`. (optional: first row with role `sys
 {"conversations": [{"from": "...", "value": "..."}]}
 ```

-Note: `type: sharegpt` opens a special config `conversation:` that enables conversions to many Conversation types. See [the docs](../docs/config.qmd) for all config options.
+Note: `type: sharegpt` opens special configs:
+- `conversation`: enables conversions to many Conversation types. Refer to the 'name' [here](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py) for options.
+- `roles`: allows you to specify the roles for input and output. This is useful for datasets with custom roles such as `tool` etc to support masking.
+- `field_human`: specify the key to use instead of `human` in the conversation.
+- `field_model`: specify the key to use instead of `gpt` in the conversation.

-### pygmalion
+```yaml
+datasets:
+    path: ...
+    type: sharegpt
+
+    conversation: # Options (see Conversation 'name'): https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
+    field_human: # Optional[str]. Human key to use for conversation.
+    field_model: # Optional[str]. Assistant key to use for conversation.
+    # Add additional keys from your dataset as input or output roles
+    roles:
+      input: # Optional[List[str]]. These will be masked based on train_on_input
+      output: # Optional[List[str]].
+```
+
+## pygmalion

 ```{.json filename="data.jsonl"}
 {"conversations": [{"role": "...", "value": "..."}]}
 ```

-### sharegpt.load_role
+## sharegpt.load_role

 conversations where `role` is used instead of `from`

@@ -30,7 +46,7 @@ conversations where `role` is used instead of `from`
 {"conversations": [{"role": "...", "value": "..."}]}
 ```

-### sharegpt.load_guanaco
+## sharegpt.load_guanaco

 conversations where `from` is `prompter` `assistant` instead of default sharegpt

@@ -38,34 +54,10 @@ conversations where `from` is `prompter` `assistant` instead of default sharegpt
 {"conversations": [{"from": "...", "value": "..."}]}
 ```

-### sharegpt_jokes
+## sharegpt_jokes

 creates a chat where bot is asked to tell a joke, then explain why the joke is funny

 ```{.json filename="data.jsonl"}
 {"conversations": [{"title": "...", "text": "...", "explanation": "..."}]}
 ```
-
-## How to add custom prompts for instruction-tuning
-
-For a dataset that is preprocessed for instruction purposes:
-
-```{.json filename="data.jsonl"}
-{"input": "...", "output": "..."}
-```
-
-You can use this example in your YAML config:
-
-```{.yaml filename="config.yaml"}
-datasets:
-  - path: repo
-    type:
-      system_prompt: ""
-      field_system: system
-      field_instruction: input
-      field_output: output
-      format: "[INST] {instruction} [/INST]"
-      no_input_format: "[INST] {instruction} [/INST]"
-```
-
-See full config options under [here](../docs/config.qmd).
--- a/docs/dataset-formats/inst_tune.qmd
+++ b/docs/dataset-formats/inst_tune.qmd
@@ -163,3 +163,27 @@ instruction, adds additional eos tokens
 ```{.json filename="data.jsonl"}
 {"prompt": "...", "generation": "..."}
 ```
+
+## How to add custom prompt format
+
+For a dataset that is preprocessed for instruction purposes:
+
+```{.json filename="data.jsonl"}
+{"input": "...", "output": "..."}
+```
+
+You can use this example in your YAML config:
+
+```{.yaml filename="config.yaml"}
+datasets:
+  - path: repo
+    type:
+      system_prompt: ""
+      field_system: system
+      field_instruction: input
+      field_output: output
+      format: "[INST] {instruction} [/INST]"
+      no_input_format: "[INST] {instruction} [/INST]"
+```
+
+See full config options under [here](../config.qmd).
--- a/docs/dataset-formats/pretraining.qmd
+++ b/docs/dataset-formats/pretraining.qmd
@@ -1,7 +1,7 @@
 ---
 title: Pre-training
 description: Data format for a pre-training completion task.
-order: 3
+order: 1
 ---

 For pretraining, there is no prompt template or roles.  The only required field is `text`: