wip add new proposed message structure (#1904)

* wip add new proposed message structure

* tokenization

* wip

* wip transform builder

* wip make the chat dataset loadable

* wip chatml + llama 3 new chat objects

* chore: lint

* chore: lint

* fix tokenization

* remove dacite dependency since we're using pydantic now

* fix handling when already correctly split in messages

* make sure to remove chat features from tokenized ds

* move chat to be a input transform for messages

* make sure llama3 has the bos token

* remove non-working special token code

* fix messages strat loader
This commit is contained in:
Wing Lian
2024-10-13 12:15:18 -04:00
committed by GitHub
parent 1834cdc364
commit cd2d89f467
23 changed files with 1285 additions and 15 deletions

View File