support user defined prompters, pretokenized datasets in config, local parquet, local arrow files (#348)

* support user defined prompters, pretokenized datasets in config, local parquet, local arrow files

* fix user defined dataset types

* fix for system prompts

* fix tests

* fix checks for parquet and arrow

* aha moment that d.data_files isn't used

* add documentation for ds_type to add support for parquet and arrow
This commit is contained in:
Wing Lian
2023-08-20 09:17:49 -04:00
committed by GitHub
parent d21318dfb9
commit d2e7f27240
6 changed files with 146 additions and 14 deletions

View File

@@ -392,6 +392,7 @@ datasets:
- path: vicgalle/alpaca-gpt4
# The type of prompt to use for training. [alpaca, sharegpt, gpteacher, oasst, reflection]
type: alpaca # format | format:<prompt_style> (chat/instruct) | <prompt_strategies>.load_<load_fn>
ds_type: # Optional[str] (json|arrow|parquet) defines the datatype when path is a file
data_files: # path to source data files
shards: # number of shards to split data into
name: # name of dataset configuration to load