* Add example YAML file for training Mistral using DPO * added deduplication code * Add exact deduplication feature and update examples * Improve deduplication for train/eval overlap Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements. * Improve deduplication for train/eval overlap Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements. * Apply suggestions from code review To handle the original case where we do not do deduplication Co-authored-by: Wing Lian <wing.lian@gmail.com> * Improve false collision detection to ensure dataset integrity - Added test cases to simulate and verify handling of forced hash collisions between datasets. - Ensured that datasets with identical hashes but different content are correctly identified, preventing incorrect deduplication. - Updated unit tests to include scenarios where collisions occur across both training and evaluation datasets, as well as within a single dataset. * Moved the constants file to the tests folder - Relocated `constants.py` to the `tests` folder to improve modularity and maintain a clear separation between source and test files. - Renamed `cicd/tests.py` to `cicd/cicd_tests.py` to resolve a conflict with `tests/__init__.py`, which caused Mypy to fail due to duplicate module names. - Updated all references to `cicd.tests` in the codebase to `cicd.cicd_tests` to reflect the renaming and ensure compatibility. - These changes ensure Mypy passes the pre-commit hook and maintain alignment with the project's structure. * revert some changes from previous commit and fix relative import --------- Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Wing Lian <wing@axolotl.ai>
33 lines
934 B
Python
33 lines
934 B
Python
# constants.py
|
|
"""
|
|
This module contains constants and configuration dictionaries used for
|
|
datasets and other utilities in the Axolotl project, specifically for testing.
|
|
"""
|
|
# Configuration for Alpaca Messages Dataset
|
|
ALPACA_MESSAGES_CONFIG_OG = {
|
|
"path": "fozziethebeat/alpaca_messages_2k_dpo_test",
|
|
"type": "chat_template.default",
|
|
"chat_template": "llama3",
|
|
"field_messages": "conversation",
|
|
"field_chosen": "chosen",
|
|
"field_rejected": "rejected",
|
|
"message_field_role": "role",
|
|
"message_field_content": "content",
|
|
"roles": {
|
|
"system": ["system"],
|
|
"user": ["user"],
|
|
"assistant": ["assistant"],
|
|
},
|
|
}
|
|
|
|
# Revision configuration extending the original
|
|
ALPACA_MESSAGES_CONFIG_REVISION = ALPACA_MESSAGES_CONFIG_OG.copy()
|
|
ALPACA_MESSAGES_CONFIG_REVISION["revision"] = "ea82cff"
|
|
|
|
|
|
SPECIAL_TOKENS = {
|
|
"bos_token": "<s>",
|
|
"eos_token": "</s>",
|
|
"unk_token": "<unk>",
|
|
}
|