From f2480a1d9199b213066b8fe4e512b2f260e86c6a Mon Sep 17 00:00:00 2001 From: Josh Bleecher Snyder Date: Wed, 26 Jun 2024 13:13:21 -0700 Subject: [PATCH] improve Pre-Tokenized Dataset docs (#1684) [skip ci] Fixes #1661 --- docs/dataset-formats/tokenized.qmd | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/docs/dataset-formats/tokenized.qmd b/docs/dataset-formats/tokenized.qmd index 8991a2110..b2ea003c0 100644 --- a/docs/dataset-formats/tokenized.qmd +++ b/docs/dataset-formats/tokenized.qmd @@ -4,9 +4,25 @@ description: How to use a custom pre-tokenized dataset. order: 5 --- -- Do not pass a `type:` in your axolotl config. +- Pass an empty `type:` in your axolotl config. - Columns in Dataset must be exactly `input_ids`, `attention_mask`, `labels` +- To indicate that a token should be ignored during training, set its corresponding label to `-100`. +- Do not add BOS/EOS. Axolotl will add them for you based on the default tokenizer for the model you're using. +- For pretraining, do not truncate/pad documents to the context window length. +- For instruction training, documents must be truncated/padded as desired. + +Sample config: ```{.yaml filename="config.yml"} -- path: ... +datasets: + - path: /path/to/your/file.jsonl + ds_type: json + type: +``` + +Sample jsonl: + +```jsonl +{"input_ids":[271,299,99],"attention_mask":[1,1,1],"labels":[271,-100,99]} +{"input_ids":[87,227,8383,12],"attention_mask":[1,1,1,1],"labels":[87,227,8383,12]} ```