* ignore generation/endgeneration tags
Axolotl handles calculating the mask for assistant turns on its own, and as such these tags are not needed, however currently the analyzer does not recognize them at all and throws an error.
* feat: add phi4 tokenizer test and unblock gemma2
* fix: improve template
* chore: refactor
* chore: lint
---------
Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: Wing Lian <wing@axolotl.ai>
* remove unused field for chat_template.default
"messages" field present in final dataset causes issues with DPO
training otherwise
* lint and fix tests for new return value
* remove unused field for chat_template.default
"messages" field present in final dataset causes issues with DPO
training otherwise
lint and fix tests for new return value
fix for updated expected fields for dpo
remove unused field for chat_template.default
"messages" field present in final dataset causes issues with DPO
training otherwise
fix test still expecting "messages" field
* chore: lint
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* feat: add eos_tokens and train_on_eot for chat_template EOT parsing
* fix: comments
* chore: add some examples of tokens
* feat: add new potential errors for chat_template to faq
* feat: add examples for EOT handling
* fix: change error to warning for missing EOS
* fix: warning typo
* feat: add tests for eot token handling
* fix: remove broken caplog capture in test
* fix: chattemplate strategy with kd missing eot changes
* fix: update chat_template
* fix: handle gemma3 showing a lot of no content for turn 0
* fix: remove unknown config from examples
* fix: test
* fix: temporary disable gemma2 test
* fix: stop overwriting config.text_config unnecessarily
* fix: handling of set cache to the text_config section
* feat: add liger gemma support and bump liger to 0.5.5
* fix: add double use_cache setting
* fix: add support for final_logit_softcap in CCE for gemma2/3
* fix: set use_cache before model load
* feat: add missing layernorm override
* fix: handle gemma3 rmsnorm
* fix: use wrapper to pass dim as hidden_size
* fix: change dim to positional
* fix: patch with wrong mlp
* chore: refactor use_cache handling
* fix import issues
* fix tests.e2e.utils import
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* hf offline decorator for tests to workaround rate limits
* fail quicker so we can see logs
* try new cache name
* limit files downloaded
* phi mini predownload
* offline decorator for phi tokenizer
* handle meta llama 8b offline too
* make sure to return fixtures if they are wrapped too
* more fixes
* more things offline
* more offline things
* fix the env var
* fix the model name
* handle gemma also
* force reload of modules to recheck offline status
* prefetch mistral too
* use reset_sessions so hub picks up offline mode
* more fixes
* rename so it doesn't seem like a context manager
* fix backoff
* switch out tinyshakespeare dataset since it runs a py script to fetch data and doesn't work offline
* include additional dataset
* more fixes
* more fixes
* replace tiny shakespeaere dataset
* skip some tests for now
* use more robust check using snapshot download to determine if a dataset name is on the hub
* typo for skip reason
* use local_files_only
* more fixtures
* remove local only
* use tiny shakespeare as pretrain dataset and streaming can't be offline even if precached
* make sure fixtures aren't offline
improve the offline reset
try bumping version of datasets
reorder reloading and setting
prime a new cache
run the tests now with fresh cache
try with a static cache
* now run all the ci again with hopefully a correct cache
* skip wonky tests for now
* skip wonky tests for now
* handle offline mode for model card creation
* feat: add config for optional parameters in a chat message
* chore: cleanup
* chore: fix nits and add light docs
* docs: update docs/dataset-formats/conversation.qmd
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* feat: configurable message mappings, jinja template analyzer
* chore: handle bradley terry
* docs: update docs
* refactor: change order of mappings, improve message transform
* refactor: make chat awware of property mappings
* chore: remove .python-version
* chore: revert change
* chore: add dataset validation to tests where appropriate
* chore: add dataset validation to tests where appropriate
* chore: clean up handling of ds_cfg
* chore: recursively serialize config
* make sure to use the return value from validate_config
* DefaultDict pickle/unpickle fix
* fix super call for override
* refactor: message fields
* chore: empty commit
* tests: validate config before using
* chore: add config validation to all e2e tests
* chore: add unneeded logging
* chore: add missed config validation
* chore: pass field_messages to prompter
* test: fix borked test
* chore: remove uninteded file
* chore: add deprecation warning and update chat_datasets script
* chore: lint
* refactor: message fields
* feat: update axolotlinputconfig and test_models
- add configdict import in axolotl/utils/config/models/input/v0_4_1/__init__.py
- remove unnecessary line breaks in sftdataset, dpodataset, ktodataset, stepwisesuperviseddataset classes
- update model_dump method in axolotlinputconfig to exclude none values
- correct typo in test_models.py comment
* feat: simplify dpodataset and ktodataset classes in config models
removed several optional fields from dpodataset and ktodataset classes in axolotl/utils/config/models/input/v0_4_1. this simplifies the configuration subsets for these datasets.
* feat: improve readability and structure in dataset configuration models
this commit enhances the readability and structure of the dataset configuration models in the `axolotl/utils/config/models/input/v0_4_1` module. it removes unused `configdict` import and adds line breaks to separate class definitions for better clarity. additionally, a minor documentation fix is included to ensure a newline at the end of the `stepwise_supervised.qmd` file.
* feat: change log level from info to debug in chattemplatestrategy
* feat(prompt_strategies): refactor chattemplateprompter and chattemplatestrategy
- Make `chat_template` a required parameter in `ChatTemplatePrompter` constructor
- Add default value for `message_property_mappings` in `ChatTemplatePrompter` constructor
- Add `messages_array_name` property to `ChatTemplatePrompter`
- Change `processor` type to Optional in `ChatTemplatePrompter`
- Add TypeError check for `processor` in `ChatTemplatePrompter.build_prompt`
- Remove `_messages` property from `ChatTemplateStrategy`
- Make `prompter` a required parameter and add type hint in `ChatTemplateStrategy` constructor
- Remove `messages` getter and setter from `ChatTemplateStrategy`
- Use `prompter.messages_array_name` in `ChatTemplateStrategy.get_conversation_thread`
- Remove condition to set `messages` field in `load` function
* feat(tests/utils): ignore type check in load_model call in test_models.py
* feat: improve type handling and test structure in chat templates
- Add return type hint for `get_chat_template` function in `chat_templates.py`
- Remove unnecessary assignment of `strategy.messages` in several test cases
- Add `messages_array_name` parameter to various test configurations in `test_chat_templates.py` and `test_chat_templates_advanced.py`
- Remove redundant `strategy.messages` assignment in `test_chat_templates_advanced.py`
* feat(axolotl): enhance chat strategy with datasetconfig support
This commit introduces support for DatasetConfig in the ChatTemplateStrategy. It also refines the strategy loader to handle different types of ds_cfg inputs and improves the clarity of the code by formatting and reordering. The key changes include:
- Importing Union from typing and BaseModel from pydantic.
- Adding DatasetConfig as an optional type for ds_cfg in StrategyLoader.
- Adjusting the handling of ds_cfg in StrategyLoader to account for BaseModel instances.
- Refactoring the prompter_params and strategy_params for better readability.
- Changing the reference from prompt[self.messages] to prompt[self.prompter.messages_array_name] in the is_prompt_batched method.
* feat: update message handling in btchattemplatestrategy
* Replace `self.messages` with direct string references to "chosen_messages" and "rejected_messages"
* Append system, user, and assistant content directly to "chosen_messages" and "rejected_messages"
* Add a new attribute "messages_array_name" to the `load` function parameters
* Remove the conditional attribute assignment for "field_messages" in the `load` function
* feat: add config validation in test_kd.py
- Import `validate_config` from `axolotl.utils.config`
- Validate the configuration in `test_llama_kd` and another function in `TestKnowledgeDistillation` class
* feat: enhance config validation and capabilities handling
* Import `EnvCapabilities` and `GPUCapabilities` from `axolotl.utils.config.models.internals`
* Update `validate_config` function to create `KTODataset` and `SFTDataset` instances using `dict(ds_cfg)`
* Replace `capabilities` and `env_capabilities` with instances of `GPUCapabilities` and `EnvCapabilities` respectively in `AxolotlConfigWCapabilities` model dump
* feat: update config validation in axolotl utils
- Remove import of `EnvCapabilities` and `GPUCapabilities` from `axolotl.utils.config.models.internals`
- Update `validate_config` function to use `capabilities` and `env_capabilities` directly instead of creating new instances of `GPUCapabilities` and `EnvCapabilities`
* feat: refactor strategyloader in chat_template.py
- Extracted the creation of strategy parameters into a separate function, `_get_strategy_params(cfg, dataset_config)`
- Created a new function, `_get_strategy_cls()`, to obtain the strategy class
- Replaced `ChatTemplateStrategy` with `strategy_cls` for strategy instantiation
* trigger CI
* chore: revert dataset config changes for kto/dpo
* subject: refactor: rename 'messages_array_name' to 'field_messages'
Body:
- Renamed 'messages_array_name' to 'field_messages' in 'ChatTemplatePrompter' class and its usages in 'chat_template.py'
- Updated 'load' function in 'bradley_terry/chat_template.py' to reflect the change
- Adjusted 'get_chat_template_msg_variables' and 'get_message_vars' methods in 'jinja_template_analyzer.py' to use the new variable name
- Modified 'StrategyLoader' in 'chat_template.py' to use 'field_messages'
- Updated tests in 'test_chat_templates.py' and 'test_chat_templates_advanced.py' to use 'field_messages' instead of 'messages_array_name'
* feat: refactor prompt strategies and update config models
* Remove redundant 'return None' in `axolotl/prompt_strategies/__init__.py`
* Simplify message handling in `axolotl/prompt_strategies/bradley_terry/chat_template.py` by using a single 'messages' list instead of separate 'chosen_messages' and 'rejected_messages' lists
* Update default 'message_property_mappings' in `axolotl/prompt_strategies/bradley_terry/chat_template.py`
* Add 'field_messages' field to `axolotl/utils/config/models/input/v0_4_1/__init__.py` configuration model
* chore: remove unused input
* chore: remove redundant type ignore
* fix: remove old configs and update examples
* fix: type check
* fix: remove loading old config in ChatMessage
* fix: update faq with potential new undefinederror
* fix: add debug if property mapped is not found
* chore: improve explanation for unmapped properties
* fix: update docs with new config
* chore: add note for deprecation config and del old config from dict
---------
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: Wing Lian <wing@axolotl.ai>
Co-authored-by: NanoCode012 <nano@axolotl.ai>
* fix: use apply_chat_template to find turn boundaries and allow tool_calling field
* fix: keys to include in turn
* feat(doc): explicitly recommend setting train_on_eos and roles_to_train
* fix: eos not being masked for tool due to template padding
* chore: clear up docs
* fix: default messages format, train_on_eos: turn, and train on all assistant msg
* fix: properly warn if empty content
* feat: parametrize chat_template tests to test different tokenizers
* fix: set proper default for message key
* fix: update defaults to match load function
* fix: change defaults to use new
* feat: add tool_calling dataset
* feat: add tool_calling test
* fix: add handling of edge case of mistral tokenizer with only system prompt
* feat: refactor all test to follow source code
* fix: remove unnecessary eos_token from phi35
* fix test for phi3.5 since eos was dropped from chat_template
---------
Co-authored-by: Wing Lian <wing@axolotl.ai>
* Allow using tokenizer's default chat template with fallbacks
Summary of changes:
1. Adds `tokenizer_default` as option for `chat_template` in
`chat_template` prompt strategy that allows using the chat template
from tokenizer's config.json
2. Allows falling back to chat templates available in axolotl if
tokenizer does not have a chat template
3. Adds a mistral chat template which supports system message - taken
from https://github.com/chujiezheng/chat_templates/blob/main/chat_templates/mistral-instruct.jinja
---
Why?
Many popular models are not trained with chatml format. As a result for
the model to correctly learn chatml we have to turn on train_on_inputs
which requires more compute and time. If we can use the model's already
learned chat template we can just learn the output tokens
---
Todo:
- Write tests
* Add tests
* Fix lint and bug post merge from main
* Add option `chat_template_jinja` to provide a jinja template
* remove custom mistral template
* Address review comments and add docs
* Update docs/dataset-formats/conversation.qmd
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
* fix: set default to tokenizer template
* Merge branch 'main' into cj_tokenizer_default_prompt_template
* chore: remove redundant function
* fix: re-arrange enum declaration position
* fix: refactor artifact left from main merge
* feat(doc): updated config with chat template options and clarified examples
* chore: clarify doc
* chore: added example for non-default template
* chore: refactor
* fix: test
* fix: config being dropped and unittest to catch that
* chore: lint
* chore: skip duplicate
* fix: rename var after merge
* feat: add test for levy's dpo case
* fix: remove default setting on edge case where chat template overriden in dataset section
* feat: handle sharegpt deprecation better in docs
* feat: add example using fallback
* feat: handles chat_template requiring specific user/assistant order
* fix: update test based on new defaults
* fix: imported name incorrectly updated on merge
* chore: lint
* fix: update dummy message to prevent potential overlap with real content
* fix(doc): formatting
* fix: update bradleyterry to use new chat_template
---------
Co-authored-by: Chirag Jain <jain.chirag925@gmail.com>
* wip add new proposed message structure
* tokenization
* wip
* wip transform builder
* wip make the chat dataset loadable
* wip chatml + llama 3 new chat objects
* chore: lint
* chore: lint
* fix tokenization
* remove dacite dependency since we're using pydantic now
* fix handling when already correctly split in messages
* make sure to remove chat features from tokenized ds
* move chat to be a input transform for messages
* make sure llama3 has the bos token
* remove non-working special token code
* fix messages strat loader
* Add flexible configuration options for chat dataset training
- Introduce roles_to_train parameter to set training labels by role
- Add train_on_eos option to configure training on end-of-sequence tokens
- Implement per-message training configuration in dataset
- Allow fine-grained control over training specific portions of messages
- Add message_field_training and message_field_training_detail settings
- Implement mapping between dataset character offsets and tokenized prompt
- Enhance test suite to cover new functionality
* Fix missing field inits, things weren't working from yaml.
* Add flexible configuration options for chat dataset training
- Introduce roles_to_train parameter to set training labels by role
- Add train_on_eos option to configure training on end-of-sequence tokens
- Implement per-message training configuration in dataset
- Allow fine-grained control over training specific portions of messages
- Add message_field_training and message_field_training_detail settings
- Implement mapping between dataset character offsets and tokenized prompt
- Enhance test suite to cover new functionality
* Fix missing field inits, things weren't working from yaml.
* chore: lint
* Revert test repo back to NousResearch after opening PR to fix the tokenizer_config.json.
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* Implementing a basic chat_template strategy for DPO datasets
This mimics the sft chat_template strategy such that users can:
* Specify the messages field
* Specify the per message role and content fields
* speicfy the chosen and rejected fields
* Let the tokenizer construct the raw prompt
* Ensure the chosen and rejected fields don't have any prefix tokens
* Adding additional dpo chat template unittests
* Rename test class
Allow in message objects the additional key `weight`, which can be set
to 0 (or 1) to cause that message to be masked out (or left unmasked)
for training (similar to [1]). This is helpful for training the model to be robust and
capable of error recovery upon a bad assistant message.
A missing `weight` key defaults to weight 1, to guarantee downward compatibility.
[1]: https://github.com/mistralai/mistral-finetune
The strategy now supports configuring several fields: * The data field holding message arrays * the role and
content fields for each message * role mapping from source to target types
additionally this adds a sample llama3-8b instruct template using the chat template
* Fix llama3 chat_template (the {{eos_token}} leads to an extra <|eot_id|> being added in the last turn). Output now matches official Llama 3 Instruct model
* add tests
* chore: lint
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* Add Glaive conversation format support
* fix black formatting errors
* Fix black and pylint formatting errors
* only set role_key_tool if provided in the dataset constructor
* Update src/axolotl/prompt_strategies/sharegpt.py
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* sharegpt test
* tokenizer test
* fix formatting
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* plain input/output prompt strategy w/o chat templates
* disable duplicate code check
* make sure to add an eos/eot token to the end of the output so it will stop
* multi turn segement support and test
* add system message to template
* readme update
* added code to register new system message
* register chatml template for test
---------
Co-authored-by: Mads Henrichsen <mads@BrbartiendeMads.lan>
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* fix: `train_on_inputs: true` ignored for sharegpt
* enable unit test for train_on_inputs for sharegpt
---------
Co-authored-by: Wing Lian <wing.lian@gmail.com>
* fix double eos token for chatml
* isolate fix to chatml conversation
* fix add special tokens to include rstrip
* add test for train_on_inputs for sharegpt
* don't use rstrip for chatml