From 39c92de9134e7632267d6b5d8e634b2acc61bcf6 Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Thu, 31 Jul 2025 19:30:34 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- docs/api/core.trainers.grpo.sampler.html | 10 +- docs/api/core.trainers.relora.html | 936 ---- docs/api/index.html | 10 +- docs/api/monkeypatch.relora.html | 17 - .../utils.ctx_managers.sequence_parallel.html | 4 +- docs/api/utils.schedulers.html | 77 +- docs/api/utils.schemas.training.html | 12 +- docs/config-reference.html | 901 ++-- docs/sequence_parallelism.html | 12 +- index.html | 1 - search.json | 4468 ++++++++--------- sitemap.xml | 1256 +++-- 13 files changed, 3378 insertions(+), 4328 deletions(-) delete mode 100644 docs/api/core.trainers.relora.html diff --git a/.nojekyll b/.nojekyll index 782093c68..5e356bf8e 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -b45efa43 \ No newline at end of file +5f2282be \ No newline at end of file diff --git a/docs/api/core.trainers.grpo.sampler.html b/docs/api/core.trainers.grpo.sampler.html index 25510c734..34fc1dee2 100644 --- a/docs/api/core.trainers.grpo.sampler.html +++ b/docs/api/core.trainers.grpo.sampler.html @@ -530,7 +530,7 @@ sequence parallel group.

rank, batch_size=1, repeat_count=1, - sequence_parallel_degree=1, + context_parallel_size=1, shuffle=True, seed=0, drop_last=False, @@ -542,7 +542,7 @@ sequence parallel group.

- Entire batches are repeated for reuse in multiple updates. - Data is properly distributed across SP groups.

In the table below, the values represent dataset indices. Each SP group has -sequence_parallel_degree = 2 GPUs working together on the same data. There are 2 +context_parallel_size = 2 GPUs working together on the same data. There are 2 SP groups (SP0 and SP1), with world_size = 4 total GPUs.

                                       Sequence Parallel Groups
                                 |       SP0        |       SP1        |
@@ -561,9 +561,9 @@ num_iterations=2 ▼ 1 3 [0 0 0 1 1 1] [2 2 2 3 3 3] <- When using gradient a
 

Parameters

-+-+ @@ -612,7 +612,7 @@ num_iterations=2 ▼ 1 3 [0 0 0 1 1 1] [2 2 2 3 3 3] <- When using gradient a - + diff --git a/docs/api/core.trainers.relora.html b/docs/api/core.trainers.relora.html deleted file mode 100644 index b910b9da5..000000000 --- a/docs/api/core.trainers.relora.html +++ /dev/null @@ -1,936 +0,0 @@ - - - - - - - - - -core.trainers.relora – Axolotl - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
- - -
- -
- - -
- - - -
- - - - -
-

core.trainers.relora

-

core.trainers.relora

-

Module for ReLoRA trainer

-
-

Classes

-
1
sequence_parallel_degreecontext_parallel_size int Number of ranks in a sequence parallel group. 1
- - - - - - - - - - - - -
NameDescription
ReLoRATrainerTrainer subclass that uses the OneCycleLR scheduler
-
-

ReLoRATrainer

-
core.trainers.relora.ReLoRATrainer(*args, **kwargs)
-

Trainer subclass that uses the OneCycleLR scheduler

- - -
- - - - - - - - - - - \ No newline at end of file diff --git a/docs/api/index.html b/docs/api/index.html index 7fa3d3230..860b75a1c 100644 --- a/docs/api/index.html +++ b/docs/api/index.html @@ -663,22 +663,18 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); Module for mamba trainer -core.trainers.relora -Module for ReLoRA trainer - - core.trainers.dpo.trainer DPO trainer for axolotl - + core.trainers.grpo.trainer Axolotl GRPO trainers (with and without sequence parallelism handling) - + core.trainers.grpo.sampler Repeat random sampler (similar to the one implemented in - + core.trainers.utils Utils for Axolotl trainers diff --git a/docs/api/monkeypatch.relora.html b/docs/api/monkeypatch.relora.html index 6c26c1323..5edea9d02 100644 --- a/docs/api/monkeypatch.relora.html +++ b/docs/api/monkeypatch.relora.html @@ -487,7 +487,6 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
  • Classes
  • @@ -517,28 +516,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); ReLoRACallback Callback to merge LoRA weights into the base model and save full-weight checkpoints - -ReLoRAScheduler -Wraps another scheduler to apply per-lora-restart learning rate warmups. -

    ReLoRACallback

    monkeypatch.relora.ReLoRACallback(cfg)

    Callback to merge LoRA weights into the base model and save full-weight checkpoints

    -
    -
    -

    ReLoRAScheduler

    -
    monkeypatch.relora.ReLoRAScheduler(
    -    optimizer,
    -    inner_schedule,
    -    relora_steps,
    -    warmup_steps,
    -    anneal_steps=1,
    -    min_lr_scale=0.001,
    -)
    -

    Wraps another scheduler to apply per-lora-restart learning rate warmups.

    diff --git a/docs/api/utils.ctx_managers.sequence_parallel.html b/docs/api/utils.ctx_managers.sequence_parallel.html index 62cc83df7..2ed4b32ea 100644 --- a/docs/api/utils.ctx_managers.sequence_parallel.html +++ b/docs/api/utils.ctx_managers.sequence_parallel.html @@ -696,7 +696,7 @@ from the full gradient tensor.

    SequenceParallelContextManager

    utils.ctx_managers.sequence_parallel.SequenceParallelContextManager(
         models,
    -    sequence_parallel_degree,
    +    context_parallel_size,
         gradient_accumulation_steps,
         ring_attn_func,
         heads_k_stride,
    @@ -731,7 +731,7 @@ across the sequence parallelism group using a post-forward hook.

    required -sequence_parallel_degree +context_parallel_size int Number of processes to split sequences over. required diff --git a/docs/api/utils.schedulers.html b/docs/api/utils.schedulers.html index f168d2964..aa47e699f 100644 --- a/docs/api/utils.schedulers.html +++ b/docs/api/utils.schedulers.html @@ -487,6 +487,7 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
  • Classes
  • Functions @@ -524,6 +525,10 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); A scheduler that interpolates learning rates in a logarithmic fashion +JaggedLRRestartScheduler +Wraps another scheduler to apply per-lora-restart learning rate warmups. + + RexLR Reflected Exponential (REX) learning rate scheduler. @@ -540,16 +545,28 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); )
  • A scheduler that interpolates learning rates in a logarithmic fashion

    +
    +

    JaggedLRRestartScheduler

    +
    utils.schedulers.JaggedLRRestartScheduler(
    +    optimizer,
    +    inner_schedule,
    +    jagged_restart_steps,
    +    jagged_restart_warmup_steps,
    +    jagged_restart_anneal_steps=1,
    +    min_lr_scale=0.001,
    +)
    +

    Wraps another scheduler to apply per-lora-restart learning rate warmups.

    +

    RexLR

    -
    utils.schedulers.RexLR(
    -    optimizer,
    -    max_lr,
    -    min_lr,
    -    total_steps=0,
    -    num_warmup_steps=0,
    -    last_step=0,
    -)
    +
    utils.schedulers.RexLR(
    +    optimizer,
    +    max_lr,
    +    min_lr,
    +    total_steps=0,
    +    num_warmup_steps=0,
    +    last_step=0,
    +)

    Reflected Exponential (REX) learning rate scheduler.

    • Original implementation: https://github.com/IvanVassi/REX_LR
    • @@ -641,12 +658,12 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});

      get_cosine_schedule_with_min_lr

      -
      utils.schedulers.get_cosine_schedule_with_min_lr(
      -    optimizer,
      -    num_warmup_steps,
      -    num_training_steps,
      -    min_lr_ratio=0.0,
      -)
      +
      utils.schedulers.get_cosine_schedule_with_min_lr(
      +    optimizer,
      +    num_warmup_steps,
      +    num_training_steps,
      +    min_lr_ratio=0.0,
      +)

      Create a learning rate schedule which has

        @@ -657,13 +674,13 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});

      get_cosine_schedule_with_quadratic_warmup

      -
      utils.schedulers.get_cosine_schedule_with_quadratic_warmup(
      -    optimizer,
      -    num_warmup_steps,
      -    num_training_steps,
      -    num_cycles=0.5,
      -    last_epoch=-1,
      -)
      +
      utils.schedulers.get_cosine_schedule_with_quadratic_warmup(
      +    optimizer,
      +    num_warmup_steps,
      +    num_training_steps,
      +    num_cycles=0.5,
      +    last_epoch=-1,
      +)

      Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the initial lr set in the optimizer.

      @@ -725,15 +742,15 @@ initial lr set in the optimizer.

      get_cosine_schedule_with_warmup_decay_constant

      -
      utils.schedulers.get_cosine_schedule_with_warmup_decay_constant(
      -    optimizer,
      -    num_warmup_steps,
      -    num_training_steps,
      -    constant_lr_ratio,
      -    min_lr_ratio,
      -    num_cycles=0.5,
      -    last_epoch=-1,
      -)
      +
      utils.schedulers.get_cosine_schedule_with_warmup_decay_constant(
      +    optimizer,
      +    num_warmup_steps,
      +    num_training_steps,
      +    constant_lr_ratio,
      +    min_lr_ratio,
      +    num_cycles=0.5,
      +    last_epoch=-1,
      +)

      Implementation of Continual Pre-Training of Large Language Models: How to (re)warm your model? (https://arxiv.org/pdf/2308.04014.pdf) Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to min_lr_ratio until num_training_steps * constant_lr_ratio, after constant_rate returns constant value of min_rate diff --git a/docs/api/utils.schemas.training.html b/docs/api/utils.schemas.training.html index fde49cf06..7a7f751ed 100644 --- a/docs/api/utils.schemas.training.html +++ b/docs/api/utils.schemas.training.html @@ -487,6 +487,7 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});

    • Classes
    @@ -518,6 +519,10 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); Training hyperparams configuration subset +JaggedLRConfig +JaggedLR configuration subset, can be used w/ ReLoRA training + + LrGroup Custom learning rate group configuration @@ -528,9 +533,14 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true});
    utils.schemas.training.HyperparametersConfig()

    Training hyperparams configuration subset

    +
    +

    JaggedLRConfig

    +
    utils.schemas.training.JaggedLRConfig()
    +

    JaggedLR configuration subset, can be used w/ ReLoRA training

    +

    LrGroup

    -
    utils.schemas.training.LrGroup()
    +
    utils.schemas.training.LrGroup()

    Custom learning rate group configuration

    diff --git a/docs/config-reference.html b/docs/config-reference.html index 8bf06695c..04baa8395 100644 --- a/docs/config-reference.html +++ b/docs/config-reference.html @@ -1331,452 +1331,461 @@ gtag('config', 'G-9KYCVJBNMQ', { 'anonymize_ip': true}); # no eval. val_set_size: float | None = 0.0 -# Set to a divisor of the number of GPUs available to split sequences into chunks of -# equal size. Use in long context training to prevent OOM when sequences cannot fit into -# a single GPU's VRAM. E.g., if 4 GPUs are available, set this value to 2 to split each -# sequence into two equal-sized subsequences, or set to 4 to split into four equal-sized -# subsequences. See https://docs.axolotl.ai/docs/sequence_parallelism.html for more -# details. -sequence_parallel_degree: int | None -# Optional; strides across the key dimension. Larger values use more memory but should -# make training faster. Must evenly divide the number of KV heads in your model. -heads_k_stride: int | None -# One of 'varlen_llama3', 'batch_ring', 'batch_zigzag', 'batch_stripe'. Defaults to -# 'varlen_llama3' in the sample packing case, and 'batch_ring' in the non-sample packing -# case. -ring_attn_func: RingAttnFunc | None -# Number of tensor parallel processes in TP group. Only supported with DeepSpeed AutoTP. -tensor_parallel_size: int | None - -# Add or change special tokens. If you add tokens here, you don't need to add them to -# the `tokens` list. -special_tokens: SpecialTokensConfig | None - # For SpecialTokensConfig: - bos_token: str | None - eos_token: str | None - pad_token: str | None - unk_token: str | None - additional_special_tokens: list[str] | None - -# Add extra tokens to the tokenizer -tokens: list[str] | None -# Mapping token_id to new_token_string to override reserved added_tokens in the -# tokenizer. Only works for tokens that are not part of the base vocab (aka are -# added_tokens). Can be checked if they exist in tokenizer.json added_tokens. -added_tokens_overrides: dict[int, str] | None - -# Whether to use torch.compile and which backend to use. setting to `auto` will enable -# torch compile when torch>=2.6.0 -torch_compile: Literal['auto'] | bool | None -# Backend to use for torch.compile -torch_compile_backend: str | None -torch_compile_mode: Literal['default', 'reduce-overhead', 'max-autotune'] | None - -# Maximum number of iterations to train for. It precedes num_epochs which means that if -# both are set, num_epochs will not be guaranteed. e.g., when 1 epoch is 1000 steps => -# `num_epochs: 2` and `max_steps: 100` will train for 100 steps -max_steps: int | None -# Number of warmup steps. Cannot use with warmup_ratio -warmup_steps: int | None -# Warmup ratio. Cannot use with warmup_steps -warmup_ratio: float | None -# Leave empty to eval at each epoch, integer for every N steps. float for fraction of -# total steps -eval_steps: int | float | None -# Number of times per epoch to run evals, mutually exclusive with eval_steps -evals_per_epoch: int | None -# Set to `no` to skip evaluation, `epoch` at end of each epoch, leave empty to infer -# from `eval_steps` -eval_strategy: str | None - -# Leave empty to save at each epoch, integer for every N steps. float for fraction of -# total steps -save_steps: int | float | None -# Number of times per epoch to save a checkpoint, mutually exclusive with save_steps -saves_per_epoch: int | None -# Set to `no` to skip checkpoint saves, `epoch` at end of each epoch, `best` when better -# result is achieved, leave empty to infer from `save_steps` -save_strategy: str | None -# Checkpoints saved at a time -save_total_limit: int | None -# Whether to checkpoint a model after the first step of training. Defaults to False. -save_first_step: bool | None - -# Logging frequency -logging_steps: int | None -# Stop training after this many evaluation losses have increased in a row. https://huggi -# ngface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppin -# gCallback -early_stopping_patience: int | None -load_best_model_at_end: bool | None = False -# Save only the model weights, skipping the optimizer. Using this means you can't resume -# from checkpoints. -save_only_model: bool | None = False -# Use tensorboard for logging -use_tensorboard: bool | None -# Enable the pytorch profiler to capture the first N steps of training to the -# output_dir. see https://pytorch.org/blog/understanding-gpu-memory-1/ for more -# information. Snapshots can be visualized @ https://pytorch.org/memory_viz -profiler_steps: int | None -# Which step to start the profiler at. Useful for only capturing a few steps mid-run. -profiler_steps_start: int | None = 0 -# bool of whether to include tokens trainer per second in the training metrics. This -# iterates over the entire dataset once, so it takes some time. -include_tokens_per_second: bool | None - -# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to -# add noise to embeddings. Currently only supported on Llama and Mistral -neftune_noise_alpha: float | None - -# Parameter controlling the relative ratio loss weight in the ORPO loss. Passed to -# `beta` in `ORPOConfig` due to trl mapping. -orpo_alpha: float | None -# Weighting of NLL term in loss from RPO paper -rpo_alpha: float | None -# Target reward margin for the SimPO loss -simpo_gamma: float | None -# Weight of the BC regularizer -cpo_alpha: float | None - -# Factor for desirable loss term in KTO loss -kto_desirable_weight: float | None -# Factor for undesirable loss term in KTO loss -kto_undesirable_weight: float | None -# The beta parameter for the RL training -rl_beta: float | None - -# Defines the max memory usage per gpu on the system. Passed through to transformers -# when loading the model. -max_memory: dict[int | Literal['cpu', 'disk'], int | str] | None -# Limit the memory for all available GPUs to this amount (if an integer, expressed in -# gigabytes); default: unset -gpu_memory_limit: int | str | None -# Whether to use low_cpu_mem_usage -low_cpu_mem_usage: bool | None - -# The name of the chat template to use for training, following values are supported: -# tokenizer_default: Uses the chat template that is available in the -# tokenizer_config.json. If the chat template is not available in the tokenizer, it will -# raise an error. This is the default value. -# alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates -# are available in the axolotl codebase at src/axolotl/utils/chat_templates.py. -# tokenizer_default_fallback_*: where * is the name of the chat template to fallback to. -# E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not -# available in the tokenizer. jinja: Uses a custom jinja template for the chat template. -# The custom jinja template should be provided in the chat_template_jinja field. The -# selected chat template will be saved to the tokenizer_config.json for easier -# inferencing -chat_template: ChatTemplate | Annotated[str, StringConstraints(pattern='^tokenizer_default_fallback_')] | None -# Custom jinja template or path to jinja file for chat template. This will be only used -# if chat_template is set to `jinja` or `null` (in which case chat_template is -# automatically set to `jinja`). Default is null. -chat_template_jinja: str | None -# Additional kwargs to pass to the chat template. This is useful for customizing the -# chat template. For example, you can pass `thinking=False` to add a generation prompt -# to the chat template. -chat_template_kwargs: dict[str, Any] | None -# Custom EOT (End-of-Turn) tokens to mask/unmask during training. These tokens mark the -# boundaries between conversation turns. For example: ['/INST', '</s>', -# '[/SYSTEM_PROMPT]']. If not specified, defaults to just the model's eos_token. This is -# useful for templates that use multiple delimiter tokens. -eot_tokens: list[str] | None -# Changes the default system message. Currently only supports chatml. -default_system_message: str | None - -fix_untrained_tokens: int | list[int] | None - -is_preprocess: bool | None -preprocess_iterable: bool | None - -# Total number of tokens - internal use -total_num_tokens: int | None -total_supervised_tokens: int | None -# You can set these packing optimizations AFTER starting a training at least once. The -# trainer will provide recommended values for these values. -sample_packing_eff_est: float | None -axolotl_config_path: str | None - -# Internal use only - Used to identify which the model is based on -is_falcon_derived_model: bool | None -# Internal use only - Used to identify which the model is based on -is_llama_derived_model: bool | None -# Internal use only - Used to identify which the model is based on. Please note that if -# you set this to true, `padding_side` will be set to 'left' by default -is_mistral_derived_model: bool | None -# Internal use only - Used to identify which the model is based on -is_qwen_derived_model: bool | None - -# Add plugins to extend the pipeline. See `src/axolotl/integrations` for the available -# plugins or doc below for more details. -# https://docs.axolotl.ai/docs/custom_integrations.html -plugins: list[str] | None - -# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files. This -# can also be a relative path to a model on disk -base_model: str (required) -# If the base_model repo on hf hub doesn't include configuration .json files, You can -# set that here, or leave this empty to default to base_model -base_model_config: str | None -cls_model_config: str | None -# Optional tokenizer configuration path in case you want to use a different tokenizer -# than the one defined in the base model -tokenizer_config: str | None -# use_fast option for tokenizer loading from_pretrained, default to True -tokenizer_use_fast: bool | None -# Whether to use the legacy tokenizer setting, defaults to True -tokenizer_legacy: bool | None -# Whether to use mistral-common tokenizer. If set to True, it will use the mistral- -# common tokenizer. -tokenizer_use_mistral_common: bool | None -# Corresponding tokenizer for the model AutoTokenizer is a good choice -tokenizer_type: str | None -# transformers processor class -processor_type: str | None -# Trust remote code for untrusted source -trust_remote_code: bool | None - -# Where to save the full-finetuned model to -output_dir: str = ./model-out -# push checkpoints to hub -hub_model_id: str | None -# how to push checkpoints to hub -hub_strategy: str | None -# Save model as safetensors (require safetensors package). Default True -save_safetensors: bool | None = True - -# This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer -load_in_8bit: bool | None = False -# Use bitsandbytes 4 bit -load_in_4bit: bool | None = False - -# If you want to use 'lora' or 'qlora' or leave blank to train all parameters in -# original model -adapter: str | None -# If you already have a lora model trained that you want to load, put that here. This -# means after training, if you want to test the model, you should set this to the value -# of `output_dir`. Note that if you merge an adapter to the base model, a new -# subdirectory `merged` will be created under the `output_dir`. -lora_model_dir: str | None -lora_r: int | None -lora_alpha: int | None -lora_fan_in_fan_out: bool | None -lora_target_modules: str | list[str] | None -# If true, will target all linear modules -lora_target_linear: bool | None -# If you added new tokens to the tokenizer, you may need to save some LoRA modules -# because they need to know the new tokens. For LLaMA and Mistral, you need to save -# `embed_tokens` and `lm_head`. It may vary for other models. `embed_tokens` converts -# tokens to embeddings, and `lm_head` converts embeddings to token probabilities. -lora_modules_to_save: list[str] | None -lora_dropout: float | None = 0.0 -# The layer indices to transform, otherwise, apply to all layers -peft_layers_to_transform: list[int] | None -peft_layers_pattern: list[str] | None - -peft: PeftConfig | None - # For PeftConfig: - # Configuration options for loftq initialization for LoRA - loftq_config: LoftQConfig | None - # For LoftQConfig: - # typically 4 bits - loftq_bits: int = 4 - -# Whether to use DoRA. -peft_use_dora: bool | None -# Whether to use RSLoRA. -peft_use_rslora: bool | None -# List of layer indices to replicate. -peft_layer_replication: list[tuple[int, int]] | None -# How to initialize LoRA weights. Default to True which is MS original implementation. -peft_init_lora_weights: bool | str | None - -# load qlora model in sharded format for FSDP using answer.ai technique. -qlora_sharded_model_loading: bool | None = False -# Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it -# takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge -lora_on_cpu: bool | None -# Whether you are training a 4-bit GPTQ quantized model -gptq: bool | None -# optional overrides to the bnb 4bit quantization configuration -bnb_config_kwargs: dict[str, Any] | None - -# loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4. -loraplus_lr_ratio: float | None -# loraplus learning rate for lora embedding layers. Default value is 1e-6. -loraplus_lr_embedding: float | None = 1e-06 - -merge_lora: bool | None - -# Number of steps per ReLoRA restart -relora_steps: int | None -# Number of per-restart warmup steps -relora_warmup_steps: int | None -# Number of anneal steps for each relora cycle -relora_anneal_steps: int | None -# threshold for optimizer magnitude when pruning -relora_prune_ratio: float | None -# True to perform lora weight merges on cpu during restarts, for modest gpu memory -# savings -relora_cpu_offload: bool | None - -# If greater than 1, backpropagation will be skipped and the gradients will be -# accumulated for the given number of steps. -gradient_accumulation_steps: int | None = 1 -# The number of samples to include in each batch. This is the number of samples sent to -# each GPU. Batch size per gpu = micro_batch_size * gradient_accumulation_steps -micro_batch_size: int | None = 1 -# Total batch size, we do not recommended setting this manually -batch_size: int | None -# per gpu micro batch size for evals, defaults to value of micro_batch_size -eval_batch_size: int | None - -# whether to find batch size that fits in memory. Passed to underlying transformers -# Trainer -auto_find_batch_size: bool | None - -# Whether to mask out or include the human's prompt from the training labels -train_on_inputs: bool | None = False -# Group similarly sized data to minimize padding. May be slower to start, as it must -# download and sort the entire dataset. Note that training loss may have an oscillating -# pattern with this enabled. -group_by_length: bool | None - -learning_rate: str | float (required) -embedding_lr: float | None -embedding_lr_scale: float | None -# Specify weight decay -weight_decay: float | None = 0.0 -# Specify optimizer -optimizer: OptimizerNames | CustomSupportedOptimizers | None = OptimizerNames.ADAMW_TORCH_FUSED -# Dictionary of arguments to pass to the optimizer -optim_args: str | dict[str, Any] | None -# The target modules to optimize, i.e. the module names that you would like to train, -# right now this is used only for GaLore algorithm -optim_target_modules: list[str] | Literal['all_linear'] | None -# Path to torch distx for optim 'adamw_anyprecision' -torchdistx_path: str | None -lr_scheduler: SchedulerType | Literal['one_cycle'] | Literal['rex'] | None = SchedulerType.COSINE -# Specify a scheduler and kwargs to use with the optimizer -lr_scheduler_kwargs: dict[str, Any] | None -lr_quadratic_warmup: bool | None -# decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of -# peak lr -cosine_min_lr_ratio: float | None -# freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means -# start cosine_min_lr at 80% of training step -cosine_constant_lr_ratio: float | None -# Learning rate div factor -lr_div_factor: float | None - -lr_groups: list[LrGroup] | None - # For LrGroup: - name: str (required) - modules: list[str] (required) - lr: float (required) - -# adamw hyperparams -adam_epsilon: float | None -# only used for CAME Optimizer -adam_epsilon2: float | None -# adamw hyperparams -adam_beta1: float | None -# adamw hyperparams -adam_beta2: float | None -# only used for CAME Optimizer -adam_beta3: float | None -# Gradient clipping max norm -max_grad_norm: float | None -num_epochs: float = 1.0 - -use_wandb: bool | None -# Set the name of your wandb run -wandb_name: str | None -# Set the ID of your wandb run -wandb_run_id: str | None -# "offline" to save run metadata locally and not sync to the server, "disabled" to turn -# off wandb -wandb_mode: str | None -# Your wandb project name -wandb_project: str | None -# A wandb Team name if using a Team -wandb_entity: str | None -wandb_watch: str | None -# "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only -# at the end of training -wandb_log_model: str | None - -use_mlflow: bool | None -# URI to mlflow -mlflow_tracking_uri: str | None -# Your experiment name -mlflow_experiment_name: str | None -# Your run name -mlflow_run_name: str | None -# set to true to copy each saved checkpoint on each save to mlflow artifact registry -hf_mlflow_log_artifacts: bool | None - -# Enable or disable Comet integration. -use_comet: bool | None -# API key for Comet. Recommended to set via `comet login`. -comet_api_key: str | None -# Workspace name in Comet. Defaults to the user's default workspace. -comet_workspace: str | None -# Project name in Comet. Defaults to Uncategorized. -comet_project_name: str | None -# Identifier for the experiment. Used to append data to an existing experiment or -# control the key of new experiments. Default to a random key. -comet_experiment_key: str | None -# Create a new experiment ("create") or log to an existing one ("get"). Default -# ("get_or_create") auto-selects based on configuration. -comet_mode: str | None -# Set to True to log data to Comet server, or False for offline storage. Default is -# True. -comet_online: bool | None -# Dictionary for additional configuration settings, see the doc for more details. -comet_experiment_config: dict[str, Any] | None - -# the number of activate layers in LISA -lisa_n_layers: int | None -# how often to switch layers in LISA -lisa_step_interval: int | None -# path under the model to access the layers -lisa_layers_attribute: str | None = model.layers - -gradio_title: str | None -gradio_share: bool | None -gradio_server_name: str | None -gradio_server_port: int | None -gradio_max_new_tokens: int | None -gradio_temperature: float | None - -use_ray: bool = False -ray_run_name: str | None -ray_num_workers: int = 1 -resources_per_worker: dict - -# The size of the image to resize to. It can be an integer (resized into padded-square -# image) or a tuple (width, height).If not provided, we will attempt to load from -# preprocessor.size, otherwise, images won't be resized. -image_size: int | tuple[int, int] | None -# The resampling algorithm to use for image resizing. Default is bilinear. Please refer -# to PIL.Image.Resampling for more details. -image_resize_algorithm: Literal['bilinear', 'bicubic', 'lanczos'] | Resampling | None - -# optional overrides to the base model configuration -overrides_of_model_config: dict[str, Any] | None -# optional overrides the base model loading from_pretrained -overrides_of_model_kwargs: dict[str, Any] | None -# If you want to specify the type of model to load, AutoModelForCausalLM is a good -# choice too -type_of_model: str | None -# You can specify to choose a specific model revision from huggingface hub -revision_of_model: str | None - -max_packed_sequence_len: int | None -rope_scaling: Any | None -noisy_embedding_alpha: float | None -dpo_beta: float | None -evaluation_strategy: str | None
    +# Number of devices to shard across. If not set, will use all available devices. +dp_shard_size: int | None +# Number of devices to replicate across. +dp_replicate_size: int | None +# Deprecated: use `context_parallel_size` instead +sequence_parallel_degree: int | None +# Set to a divisor of the number of GPUs available to split sequences into chunks of +# equal size. Use in long context training to prevent OOM when sequences cannot fit into +# a single GPU's VRAM. E.g., if 4 GPUs are available, set this value to 2 to split each +# sequence into two equal-sized subsequences, or set to 4 to split into four equal-sized +# subsequences. See https://docs.axolotl.ai/docs/sequence_parallelism.html for more +# details. +context_parallel_size: int | None +# Optional; strides across the key dimension. Larger values use more memory but should +# make training faster. Must evenly divide the number of KV heads in your model. +heads_k_stride: int | None +# One of 'varlen_llama3', 'batch_ring', 'batch_zigzag', 'batch_stripe'. Defaults to +# 'varlen_llama3' in the sample packing case, and 'batch_ring' in the non-sample packing +# case. +ring_attn_func: RingAttnFunc | None +# Number of tensor parallel processes in TP group. Only supported with DeepSpeed AutoTP. +tensor_parallel_size: int | None + +# Add or change special tokens. If you add tokens here, you don't need to add them to +# the `tokens` list. +special_tokens: SpecialTokensConfig | None + # For SpecialTokensConfig: + bos_token: str | None + eos_token: str | None + pad_token: str | None + unk_token: str | None + additional_special_tokens: list[str] | None + +# Add extra tokens to the tokenizer +tokens: list[str] | None +# Mapping token_id to new_token_string to override reserved added_tokens in the +# tokenizer. Only works for tokens that are not part of the base vocab (aka are +# added_tokens). Can be checked if they exist in tokenizer.json added_tokens. +added_tokens_overrides: dict[int, str] | None + +# Whether to use torch.compile and which backend to use. setting to `auto` will enable +# torch compile when torch>=2.6.0 +torch_compile: Literal['auto'] | bool | None +# Backend to use for torch.compile +torch_compile_backend: str | None +torch_compile_mode: Literal['default', 'reduce-overhead', 'max-autotune'] | None + +# Maximum number of iterations to train for. It precedes num_epochs which means that if +# both are set, num_epochs will not be guaranteed. e.g., when 1 epoch is 1000 steps => +# `num_epochs: 2` and `max_steps: 100` will train for 100 steps +max_steps: int | None +# Number of warmup steps. Cannot use with warmup_ratio +warmup_steps: int | None +# Warmup ratio. Cannot use with warmup_steps +warmup_ratio: float | None +# Leave empty to eval at each epoch, integer for every N steps. float for fraction of +# total steps +eval_steps: int | float | None +# Number of times per epoch to run evals, mutually exclusive with eval_steps +evals_per_epoch: int | None +# Set to `no` to skip evaluation, `epoch` at end of each epoch, leave empty to infer +# from `eval_steps` +eval_strategy: str | None + +# Leave empty to save at each epoch, integer for every N steps. float for fraction of +# total steps +save_steps: int | float | None +# Number of times per epoch to save a checkpoint, mutually exclusive with save_steps +saves_per_epoch: int | None +# Set to `no` to skip checkpoint saves, `epoch` at end of each epoch, `best` when better +# result is achieved, leave empty to infer from `save_steps` +save_strategy: str | None +# Checkpoints saved at a time +save_total_limit: int | None +# Whether to checkpoint a model after the first step of training. Defaults to False. +save_first_step: bool | None + +# Logging frequency +logging_steps: int | None +# Stop training after this many evaluation losses have increased in a row. https://huggi +# ngface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppin +# gCallback +early_stopping_patience: int | None +load_best_model_at_end: bool | None = False +# Save only the model weights, skipping the optimizer. Using this means you can't resume +# from checkpoints. +save_only_model: bool | None = False +# Use tensorboard for logging +use_tensorboard: bool | None +# Enable the pytorch profiler to capture the first N steps of training to the +# output_dir. see https://pytorch.org/blog/understanding-gpu-memory-1/ for more +# information. Snapshots can be visualized @ https://pytorch.org/memory_viz +profiler_steps: int | None +# Which step to start the profiler at. Useful for only capturing a few steps mid-run. +profiler_steps_start: int | None = 0 +# bool of whether to include tokens trainer per second in the training metrics. This +# iterates over the entire dataset once, so it takes some time. +include_tokens_per_second: bool | None + +# NEFT https://arxiv.org/abs/2310.05914, set this to a number (paper default is 5) to +# add noise to embeddings. Currently only supported on Llama and Mistral +neftune_noise_alpha: float | None + +# Parameter controlling the relative ratio loss weight in the ORPO loss. Passed to +# `beta` in `ORPOConfig` due to trl mapping. +orpo_alpha: float | None +# Weighting of NLL term in loss from RPO paper +rpo_alpha: float | None +# Target reward margin for the SimPO loss +simpo_gamma: float | None +# Weight of the BC regularizer +cpo_alpha: float | None + +# Factor for desirable loss term in KTO loss +kto_desirable_weight: float | None +# Factor for undesirable loss term in KTO loss +kto_undesirable_weight: float | None +# The beta parameter for the RL training +rl_beta: float | None + +# Defines the max memory usage per gpu on the system. Passed through to transformers +# when loading the model. +max_memory: dict[int | Literal['cpu', 'disk'], int | str] | None +# Limit the memory for all available GPUs to this amount (if an integer, expressed in +# gigabytes); default: unset +gpu_memory_limit: int | str | None +# Whether to use low_cpu_mem_usage +low_cpu_mem_usage: bool | None + +# The name of the chat template to use for training, following values are supported: +# tokenizer_default: Uses the chat template that is available in the +# tokenizer_config.json. If the chat template is not available in the tokenizer, it will +# raise an error. This is the default value. +# alpaca/inst/chatml/gemma/cohere/llama3/phi_3/deepseek_v2/jamba: These chat templates +# are available in the axolotl codebase at src/axolotl/utils/chat_templates.py. +# tokenizer_default_fallback_*: where * is the name of the chat template to fallback to. +# E.g. tokenizer_default_fallback_chatml. This is useful when the chat template is not +# available in the tokenizer. jinja: Uses a custom jinja template for the chat template. +# The custom jinja template should be provided in the chat_template_jinja field. The +# selected chat template will be saved to the tokenizer_config.json for easier +# inferencing +chat_template: ChatTemplate | Annotated[str, StringConstraints(pattern='^tokenizer_default_fallback_')] | None +# Custom jinja template or path to jinja file for chat template. This will be only used +# if chat_template is set to `jinja` or `null` (in which case chat_template is +# automatically set to `jinja`). Default is null. +chat_template_jinja: str | None +# Additional kwargs to pass to the chat template. This is useful for customizing the +# chat template. For example, you can pass `thinking=False` to add a generation prompt +# to the chat template. +chat_template_kwargs: dict[str, Any] | None +# Custom EOT (End-of-Turn) tokens to mask/unmask during training. These tokens mark the +# boundaries between conversation turns. For example: ['/INST', '</s>', +# '[/SYSTEM_PROMPT]']. If not specified, defaults to just the model's eos_token. This is +# useful for templates that use multiple delimiter tokens. +eot_tokens: list[str] | None +# Changes the default system message. Currently only supports chatml. +default_system_message: str | None + +fix_untrained_tokens: int | list[int] | None + +is_preprocess: bool | None +preprocess_iterable: bool | None + +# Total number of tokens - internal use +total_num_tokens: int | None +total_supervised_tokens: int | None +# You can set these packing optimizations AFTER starting a training at least once. The +# trainer will provide recommended values for these values. +sample_packing_eff_est: float | None +axolotl_config_path: str | None + +# Internal use only - Used to identify which the model is based on +is_falcon_derived_model: bool | None +# Internal use only - Used to identify which the model is based on +is_llama_derived_model: bool | None +# Internal use only - Used to identify which the model is based on. Please note that if +# you set this to true, `padding_side` will be set to 'left' by default +is_mistral_derived_model: bool | None +# Internal use only - Used to identify which the model is based on +is_qwen_derived_model: bool | None + +# Add plugins to extend the pipeline. See `src/axolotl/integrations` for the available +# plugins or doc below for more details. +# https://docs.axolotl.ai/docs/custom_integrations.html +plugins: list[str] | None + +# This is the huggingface model that contains *.pt, *.safetensors, or *.bin files. This +# can also be a relative path to a model on disk +base_model: str (required) +# If the base_model repo on hf hub doesn't include configuration .json files, You can +# set that here, or leave this empty to default to base_model +base_model_config: str | None +cls_model_config: str | None +# Optional tokenizer configuration path in case you want to use a different tokenizer +# than the one defined in the base model +tokenizer_config: str | None +# use_fast option for tokenizer loading from_pretrained, default to True +tokenizer_use_fast: bool | None +# Whether to use the legacy tokenizer setting, defaults to True +tokenizer_legacy: bool | None +# Whether to use mistral-common tokenizer. If set to True, it will use the mistral- +# common tokenizer. +tokenizer_use_mistral_common: bool | None +# Corresponding tokenizer for the model AutoTokenizer is a good choice +tokenizer_type: str | None +# transformers processor class +processor_type: str | None +# Trust remote code for untrusted source +trust_remote_code: bool | None + +# Where to save the full-finetuned model to +output_dir: str = ./model-out +# push checkpoints to hub +hub_model_id: str | None +# how to push checkpoints to hub +hub_strategy: str | None +# Save model as safetensors (require safetensors package). Default True +save_safetensors: bool | None = True + +# This will attempt to quantize the model down to 8 bits and use adam 8 bit optimizer +load_in_8bit: bool | None = False +# Use bitsandbytes 4 bit +load_in_4bit: bool | None = False + +# If you want to use 'lora' or 'qlora' or leave blank to train all parameters in +# original model +adapter: str | None +# If you already have a lora model trained that you want to load, put that here. This +# means after training, if you want to test the model, you should set this to the value +# of `output_dir`. Note that if you merge an adapter to the base model, a new +# subdirectory `merged` will be created under the `output_dir`. +lora_model_dir: str | None +lora_r: int | None +lora_alpha: int | None +lora_fan_in_fan_out: bool | None +lora_target_modules: str | list[str] | None +# If true, will target all linear modules +lora_target_linear: bool | None +# If you added new tokens to the tokenizer, you may need to save some LoRA modules +# because they need to know the new tokens. For LLaMA and Mistral, you need to save +# `embed_tokens` and `lm_head`. It may vary for other models. `embed_tokens` converts +# tokens to embeddings, and `lm_head` converts embeddings to token probabilities. +lora_modules_to_save: list[str] | None +lora_dropout: float | None = 0.0 +# The layer indices to transform, otherwise, apply to all layers +peft_layers_to_transform: list[int] | None +peft_layers_pattern: list[str] | None + +peft: PeftConfig | None + # For PeftConfig: + # Configuration options for loftq initialization for LoRA + loftq_config: LoftQConfig | None + # For LoftQConfig: + # typically 4 bits + loftq_bits: int = 4 + +# Whether to use DoRA. +peft_use_dora: bool | None +# Whether to use RSLoRA. +peft_use_rslora: bool | None +# List of layer indices to replicate. +peft_layer_replication: list[tuple[int, int]] | None +# How to initialize LoRA weights. Default to True which is MS original implementation. +peft_init_lora_weights: bool | str | None + +# load qlora model in sharded format for FSDP using answer.ai technique. +qlora_sharded_model_loading: bool | None = False +# Do the LoRA/PEFT loading on CPU -- this is required if the base model is so large it +# takes up most or all of the available GPU VRAM, e.g. during a model and LoRA merge +lora_on_cpu: bool | None +# Whether you are training a 4-bit GPTQ quantized model +gptq: bool | None +# optional overrides to the bnb 4bit quantization configuration +bnb_config_kwargs: dict[str, Any] | None + +# loraplus learning rate ratio lr_B / lr_A. Recommended value is 2^4. +loraplus_lr_ratio: float | None +# loraplus learning rate for lora embedding layers. Default value is 1e-6. +loraplus_lr_embedding: float | None = 1e-06 + +merge_lora: bool | None + +# Whether to use ReLoRA. Use with jagged_restart_*steps options. +relora: bool | None +# threshold for optimizer magnitude when pruning +relora_prune_ratio: float | None +# True to perform lora weight merges on cpu during restarts, for modest gpu memory +# savings +relora_cpu_offload: bool | None + +# how often to reset for jagged restarts +jagged_restart_steps: int | None +# how many warmup steps to take after reset for jagged restarts +jagged_restart_warmup_steps: int | None +# how many anneal steps to take before reset for jagged restarts +jagged_restart_anneal_steps: int | None + +# If greater than 1, backpropagation will be skipped and the gradients will be +# accumulated for the given number of steps. +gradient_accumulation_steps: int | None = 1 +# The number of samples to include in each batch. This is the number of samples sent to +# each GPU. Batch size per gpu = micro_batch_size * gradient_accumulation_steps +micro_batch_size: int | None = 1 +# Total batch size, we do not recommended setting this manually +batch_size: int | None +# per gpu micro batch size for evals, defaults to value of micro_batch_size +eval_batch_size: int | None + +# whether to find batch size that fits in memory. Passed to underlying transformers +# Trainer +auto_find_batch_size: bool | None + +# Whether to mask out or include the human's prompt from the training labels +train_on_inputs: bool | None = False +# Group similarly sized data to minimize padding. May be slower to start, as it must +# download and sort the entire dataset. Note that training loss may have an oscillating +# pattern with this enabled. +group_by_length: bool | None + +learning_rate: str | float (required) +embedding_lr: float | None +embedding_lr_scale: float | None +# Specify weight decay +weight_decay: float | None = 0.0 +# Specify optimizer +optimizer: OptimizerNames | CustomSupportedOptimizers | None = OptimizerNames.ADAMW_TORCH_FUSED +# Dictionary of arguments to pass to the optimizer +optim_args: str | dict[str, Any] | None +# The target modules to optimize, i.e. the module names that you would like to train, +# right now this is used only for GaLore algorithm +optim_target_modules: list[str] | Literal['all_linear'] | None +# Path to torch distx for optim 'adamw_anyprecision' +torchdistx_path: str | None +lr_scheduler: SchedulerType | Literal['one_cycle'] | Literal['rex'] | None = SchedulerType.COSINE +# Specify a scheduler and kwargs to use with the optimizer +lr_scheduler_kwargs: dict[str, Any] | None +lr_quadratic_warmup: bool | None +# decay lr to some percentage of the peak lr, e.g. cosine_min_lr_ratio=0.1 for 10% of +# peak lr +cosine_min_lr_ratio: float | None +# freeze lr at some percentage of the step, e.g. cosine_constant_lr_ratio=0.8 means +# start cosine_min_lr at 80% of training step +cosine_constant_lr_ratio: float | None +# Learning rate div factor +lr_div_factor: float | None + +lr_groups: list[LrGroup] | None + # For LrGroup: + name: str (required) + modules: list[str] (required) + lr: float (required) + +# adamw hyperparams +adam_epsilon: float | None +# only used for CAME Optimizer +adam_epsilon2: float | None +# adamw hyperparams +adam_beta1: float | None +# adamw hyperparams +adam_beta2: float | None +# only used for CAME Optimizer +adam_beta3: float | None +# Gradient clipping max norm +max_grad_norm: float | None +num_epochs: float = 1.0 + +use_wandb: bool | None +# Set the name of your wandb run +wandb_name: str | None +# Set the ID of your wandb run +wandb_run_id: str | None +# "offline" to save run metadata locally and not sync to the server, "disabled" to turn +# off wandb +wandb_mode: str | None +# Your wandb project name +wandb_project: str | None +# A wandb Team name if using a Team +wandb_entity: str | None +wandb_watch: str | None +# "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only +# at the end of training +wandb_log_model: str | None + +use_mlflow: bool | None +# URI to mlflow +mlflow_tracking_uri: str | None +# Your experiment name +mlflow_experiment_name: str | None +# Your run name +mlflow_run_name: str | None +# set to true to copy each saved checkpoint on each save to mlflow artifact registry +hf_mlflow_log_artifacts: bool | None + +# Enable or disable Comet integration. +use_comet: bool | None +# API key for Comet. Recommended to set via `comet login`. +comet_api_key: str | None +# Workspace name in Comet. Defaults to the user's default workspace. +comet_workspace: str | None +# Project name in Comet. Defaults to Uncategorized. +comet_project_name: str | None +# Identifier for the experiment. Used to append data to an existing experiment or +# control the key of new experiments. Default to a random key. +comet_experiment_key: str | None +# Create a new experiment ("create") or log to an existing one ("get"). Default +# ("get_or_create") auto-selects based on configuration. +comet_mode: str | None +# Set to True to log data to Comet server, or False for offline storage. Default is +# True. +comet_online: bool | None +# Dictionary for additional configuration settings, see the doc for more details. +comet_experiment_config: dict[str, Any] | None + +# the number of activate layers in LISA +lisa_n_layers: int | None +# how often to switch layers in LISA +lisa_step_interval: int | None +# path under the model to access the layers +lisa_layers_attribute: str | None = model.layers + +gradio_title: str | None +gradio_share: bool | None +gradio_server_name: str | None +gradio_server_port: int | None +gradio_max_new_tokens: int | None +gradio_temperature: float | None + +use_ray: bool = False +ray_run_name: str | None +ray_num_workers: int = 1 +resources_per_worker: dict + +# The size of the image to resize to. It can be an integer (resized into padded-square +# image) or a tuple (width, height).If not provided, we will attempt to load from +# preprocessor.size, otherwise, images won't be resized. +image_size: int | tuple[int, int] | None +# The resampling algorithm to use for image resizing. Default is bilinear. Please refer +# to PIL.Image.Resampling for more details. +image_resize_algorithm: Literal['bilinear', 'bicubic', 'lanczos'] | Resampling | None + +# optional overrides to the base model configuration +overrides_of_model_config: dict[str, Any] | None +# optional overrides the base model loading from_pretrained +overrides_of_model_kwargs: dict[str, Any] | None +# If you want to specify the type of model to load, AutoModelForCausalLM is a good +# choice too +type_of_model: str | None +# You can specify to choose a specific model revision from huggingface hub +revision_of_model: str | None + +max_packed_sequence_len: int | None +rope_scaling: Any | None +noisy_embedding_alpha: float | None +dpo_beta: float | None +evaluation_strategy: str | None diff --git a/docs/sequence_parallelism.html b/docs/sequence_parallelism.html index 50374ccbf..750b07adf 100644 --- a/docs/sequence_parallelism.html +++ b/docs/sequence_parallelism.html @@ -538,13 +538,13 @@ through a ring communication pattern.

    Configuration

    To enable sequence parallelism, add the following to your configuration file:

    # Set to a divisor (> 1) of the number of GPUs available
    -sequence_parallel_degree: 4  # Split sequences across 4 GPUs
    +context_parallel_size: 4  # Split sequences across 4 GPUs
     # Optional; strides across the key dimension. Larger values use more memory but should make training faster.
     heads_k_stride: 1
     # Optional; one of "varlen_llama3" or "batch_ring". Defaults to
     # "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
     ring_attn_func:
    -

    The sequence_parallel_degree should be a divisor of the total number of GPUs. For example:

    +

    The context_parallel_size should be a divisor of the total number of GPUs. For example: