diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 350b04cca..df12b3c89 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -29,7 +29,7 @@ jobs: cuda_version: 12.4.1 python_version: "3.11" pytorch: 2.6.0 - axolotl_extras: + axolotl_extras: vllm is_latest: true runs-on: axolotl-gpu-runner steps: diff --git a/.nojekyll b/.nojekyll index 41b744d64..e85d045c6 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -39794ace \ No newline at end of file +9eefd995 \ No newline at end of file diff --git a/docs/api/core.training_args.html b/docs/api/core.training_args.html index ed1e84efb..8941c3d9e 100644 --- a/docs/api/core.training_args.html +++ b/docs/api/core.training_args.html @@ -558,10 +558,11 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, - simpo_gamma=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, + simpo_gamma=None, +)

CPO config for CPO training

@@ -612,9 +613,10 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, +)

KTO config for KTO training

@@ -665,9 +667,10 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, +)

ORPO config for ORPO training

@@ -718,9 +721,10 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, +)

PRM config for PRM training

@@ -771,9 +775,10 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, +)

Reward config for Reward training

@@ -824,9 +829,10 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, +)

Training arguments for Causal trainer

This code is duplicated due to HF TrainingArguments not setting output_dir with a default value so it can’t be used as a mixin.

@@ -879,9 +885,10 @@ default value so it can’t be used as a mixin.

kd_zscore_base_temp=None, kd_top_k_before_softmax=None, sequence_parallel_degree=1, - image_size=None, - image_resize_algorithm=None, -) + ring_attn_func=None, + image_size=None, + image_resize_algorithm=None, +)

Mixin class for the Axolotl training args.

diff --git a/docs/api/utils.collators.batching.html b/docs/api/utils.collators.batching.html index 6bde01cf0..8bca65e4b 100644 --- a/docs/api/utils.collators.batching.html +++ b/docs/api/utils.collators.batching.html @@ -509,7 +509,8 @@ includes logic for handling sequence parallelism collation.

position_pad_token_id=0, return_tensors='pt', sequence_parallel_degree=1, -) + ring_attn_func=None, +)

Collator for multipack specific to the using the BatchSampler

@@ -525,7 +526,8 @@ includes logic for handling sequence parallelism collation.

position_pad_token_id=0, return_tensors='pt', sequence_parallel_degree=1, -) + ring_attn_func=None, +)

Data collator that will dynamically pad the inputs received, as well as the labels and position_ids

Parameters

@@ -690,7 +692,8 @@ includes logic for handling sequence parallelism collation.

position_pad_token_id=0, return_tensors='pt', sequence_parallel_degree=1, -) + ring_attn_func=None, +)

Collator for multipack specific to the using the BatchSampler

diff --git a/docs/config.html b/docs/config.html index 858f556db..669dad140 100644 --- a/docs/config.html +++ b/docs/config.html @@ -1159,21 +1159,24 @@ pre > code.sourceCode > span > a:first-child::before { text-decoration: underlin # Optional; strides across the key dimension. Larger values use more memory but should make training faster. # Must evenly divide the number of KV heads in your model. heads_k_stride: 1 - -# Path to torch distx for optim 'adamw_anyprecision' -torchdistx_path: +# One of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to "varlen_llama3" +# in the sample packing case, and "batch_ring" in the non-sample packing case. +ring_attn_func: -# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize -pretraining_dataset: +# Path to torch distx for optim 'adamw_anyprecision' +torchdistx_path: -# Debug mode -debug: +# Set to HF dataset for type: 'completion' for streaming instead of pre-tokenize +pretraining_dataset: -# Seed -seed: +# Debug mode +debug: -# Allow overwrite yml config using from cli -strict: +# Seed +seed: + +# Allow overwrite yml config using from cli +strict: diff --git a/docs/sequence_parallelism.html b/docs/sequence_parallelism.html index bf7bfd64c..b815e0698 100644 --- a/docs/sequence_parallelism.html +++ b/docs/sequence_parallelism.html @@ -507,7 +507,10 @@ through a ring communication pattern.

# Set to a divisor (> 1) of the number of GPUs available
 sequence_parallel_degree: 4  # Split sequences across 4 GPUs
 # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
-heads_k_stride: 1
+heads_k_stride: 1 +# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to +# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise. +ring_attn_func:

The sequence_parallel_degree should be a divisor of the total number of GPUs. For example: