From cb7185998b23673cf95e61ca74159cc232f008a8 Mon Sep 17 00:00:00 2001
From: zeke <40004347+KAJdev@users.noreply.github.com>
Date: Mon, 14 Apr 2025 18:33:27 -0800
Subject: [PATCH] remove LICENSE and fix README

---
 .runpod/LICENSE   |  21 ----
 .runpod/README.md | 277 +++++++++++++++++++++-------------------------
 2 files changed, 128 insertions(+), 170 deletions(-)
 delete mode 100644 .runpod/LICENSE
diff --git a/.runpod/LICENSE b/.runpod/LICENSE
deleted file mode 100644
index a80f426da..000000000
--- a/.runpod/LICENSE
+++ /dev/null
@@ -1,21 +0,0 @@
-MIT License
-
-Copyright (c) 2023 runpod-workers
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
diff --git a/.runpod/README.md b/.runpod/README.md
index 1e2227030..52ac7e5bf 100644
--- a/.runpod/README.md
+++ b/.runpod/README.md
@@ -1,15 +1,5 @@
-
-
 <h1>LLM Training- Full finetune, LoRA, QLoRa etc. Llama/Mistral/Gemma</h1>
 
-## RunPod Worker Images
-
-Below is a summary of the available RunPod Worker images, categorized by image stability and CUDA version compatibility.
-
-| Preview Image Tag                  | Development Image Tag             |
------------------------------------|-----------------------------------|
-| `runpod/llm-finetuning:preview` | `runpod/llm-finetuning:dev` 
-
 # Configuration Options
 
 This document outlines all available configuration options for training models. The configuration can be provided as a JSON request.
@@ -19,6 +9,7 @@ This document outlines all available configuration options for training models.
 You can use these configuration Options:
 
 1. As a JSON request body:
+
 ```json
 {
   "input": {
@@ -41,187 +32,180 @@ You can use these configuration Options:
 
 ### Model Configuration
 
-| Option | Description | Default |
-|--------|-------------|---------|
-| `base_model` | Path to the base model (local or HuggingFace) | Required |
-| `base_model_config` | Configuration path for the base model | Same as base_model |
-| `revision_of_model` | Specific model revision from HuggingFace hub | Latest |
-| `tokenizer_config` | Custom tokenizer configuration path | Optional |
-| `model_type` | Type of model to load | AutoModelForCausalLM |
-| `tokenizer_type` | Type of tokenizer to use | AutoTokenizer |
-| `hub_model_id` | Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name) | Optional |
-
-
+| Option              | Description                                                                                   | Default              |
+| ------------------- | --------------------------------------------------------------------------------------------- | -------------------- |
+| `base_model`        | Path to the base model (local or HuggingFace)                                                 | Required             |
+| `base_model_config` | Configuration path for the base model                                                         | Same as base_model   |
+| `revision_of_model` | Specific model revision from HuggingFace hub                                                  | Latest               |
+| `tokenizer_config`  | Custom tokenizer configuration path                                                           | Optional             |
+| `model_type`        | Type of model to load                                                                         | AutoModelForCausalLM |
+| `tokenizer_type`    | Type of tokenizer to use                                                                      | AutoTokenizer        |
+| `hub_model_id`      | Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name) | Optional             |
 
 ## Model Family Identification
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `is_falcon_derived_model` | `false` | Whether model is Falcon-based |
-| `is_llama_derived_model` | `false` | Whether model is LLaMA-based |
-| `is_qwen_derived_model` | `false` | Whether model is Qwen-based |
+| Option                     | Default | Description                    |
+| -------------------------- | ------- | ------------------------------ |
+| `is_falcon_derived_model`  | `false` | Whether model is Falcon-based  |
+| `is_llama_derived_model`   | `false` | Whether model is LLaMA-based   |
+| `is_qwen_derived_model`    | `false` | Whether model is Qwen-based    |
 | `is_mistral_derived_model` | `false` | Whether model is Mistral-based |
 
 ## Model Configuration Overrides
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `overrides_of_model_config.rope_scaling.type` | `"linear"` | RoPE scaling type (linear/dynamic) |
-| `overrides_of_model_config.rope_scaling.factor` | `1.0` | RoPE scaling factor |
+| Option                                          | Default    | Description                        |
+| ----------------------------------------------- | ---------- | ---------------------------------- |
+| `overrides_of_model_config.rope_scaling.type`   | `"linear"` | RoPE scaling type (linear/dynamic) |
+| `overrides_of_model_config.rope_scaling.factor` | `1.0`      | RoPE scaling factor                |
 
 ### Model Loading Options
 
-| Option | Description | Default |
-|--------|-------------|---------|
-| `load_in_8bit` | Load model in 8-bit precision | false |
-| `load_in_4bit` | Load model in 4-bit precision | false |
-| `bf16` | Use bfloat16 precision | false |
-| `fp16` | Use float16 precision | false |
-| `tf32` | Use tensor float 32 precision | false |
-
+| Option         | Description                   | Default |
+| -------------- | ----------------------------- | ------- |
+| `load_in_8bit` | Load model in 8-bit precision | false   |
+| `load_in_4bit` | Load model in 4-bit precision | false   |
+| `bf16`         | Use bfloat16 precision        | false   |
+| `fp16`         | Use float16 precision         | false   |
+| `tf32`         | Use tensor float 32 precision | false   |
 
 ## Memory and Device Settings
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `gpu_memory_limit` | `"20GiB"` | GPU memory limit |
-| `lora_on_cpu` | `false` | Load LoRA on CPU |
-| `device_map` | `"auto"` | Device mapping strategy |
-| `max_memory` | `null` | Max memory per device |
+| Option             | Default   | Description             |
+| ------------------ | --------- | ----------------------- |
+| `gpu_memory_limit` | `"20GiB"` | GPU memory limit        |
+| `lora_on_cpu`      | `false`   | Load LoRA on CPU        |
+| `device_map`       | `"auto"`  | Device mapping strategy |
+| `max_memory`       | `null`    | Max memory per device   |
 
 ## Training Hyperparameters
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `gradient_accumulation_steps` | `1` | Gradient accumulation steps |
-| `micro_batch_size` | `2` | Batch size per GPU |
-| `eval_batch_size` | `null` | Evaluation batch size |
-| `num_epochs` | `4` | Number of training epochs |
-| `warmup_steps` | `100` | Warmup steps |
-| `warmup_ratio` | `0.05` | Warmup ratio |
-| `learning_rate` | `0.00003` | Learning rate |
-| `lr_quadratic_warmup` | `false` | Quadratic warmup |
-| `logging_steps` | `null` | Logging frequency |
-| `eval_steps` | `null` | Evaluation frequency |
-| `evals_per_epoch` | `null` | Evaluations per epoch |
-| `save_strategy` | `"epoch"` | Checkpoint saving strategy |
-| `save_steps` | `null` | Saving frequency |
-| `saves_per_epoch` | `null` | Saves per epoch |
-| `save_total_limit` | `null` | Maximum checkpoints to keep |
-| `max_steps` | `null` | Maximum training steps |
+| Option                        | Default   | Description                 |
+| ----------------------------- | --------- | --------------------------- |
+| `gradient_accumulation_steps` | `1`       | Gradient accumulation steps |
+| `micro_batch_size`            | `2`       | Batch size per GPU          |
+| `eval_batch_size`             | `null`    | Evaluation batch size       |
+| `num_epochs`                  | `4`       | Number of training epochs   |
+| `warmup_steps`                | `100`     | Warmup steps                |
+| `warmup_ratio`                | `0.05`    | Warmup ratio                |
+| `learning_rate`               | `0.00003` | Learning rate               |
+| `lr_quadratic_warmup`         | `false`   | Quadratic warmup            |
+| `logging_steps`               | `null`    | Logging frequency           |
+| `eval_steps`                  | `null`    | Evaluation frequency        |
+| `evals_per_epoch`             | `null`    | Evaluations per epoch       |
+| `save_strategy`               | `"epoch"` | Checkpoint saving strategy  |
+| `save_steps`                  | `null`    | Saving frequency            |
+| `saves_per_epoch`             | `null`    | Saves per epoch             |
+| `save_total_limit`            | `null`    | Maximum checkpoints to keep |
+| `max_steps`                   | `null`    | Maximum training steps      |
 
 ### Dataset Configuration
 
 ```yaml
 datasets:
-  - path: vicgalle/alpaca-gpt4  # HuggingFace dataset or TODO: You will be able to add the local path. 
-    type: alpaca               # Format type (alpaca, gpteacher, oasst, etc.)
-    ds_type: json             # Dataset type
-    data_files: path/to/data  # Source data files
-    train_on_split: train     # Dataset split to use
+  - path: vicgalle/alpaca-gpt4 # HuggingFace dataset or TODO: You will be able to add the local path.
+    type: alpaca # Format type (alpaca, gpteacher, oasst, etc.)
+    ds_type: json # Dataset type
+    data_files: path/to/data # Source data files
+    train_on_split: train # Dataset split to use
 ```
 
-
 ## Chat Template Settings
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `chat_template` | `"tokenizer_default"` | Chat template type |
-| `chat_template_jinja` | `null` | Custom Jinja template |
+| Option                   | Default                          | Description            |
+| ------------------------ | -------------------------------- | ---------------------- |
+| `chat_template`          | `"tokenizer_default"`            | Chat template type     |
+| `chat_template_jinja`    | `null`                           | Custom Jinja template  |
 | `default_system_message` | `"You are a helpful assistant."` | Default system message |
 
 ## Dataset Processing
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `dataset_prepared_path` | `"data/last_run_prepared"` | Path for prepared dataset |
-| `push_dataset_to_hub` | `""` | Push dataset to HF hub |
-| `dataset_processes` | `4` | Number of preprocessing processes |
-| `dataset_keep_in_memory` | `false` | Keep dataset in memory |
-| `shuffle_merged_datasets` | `true` | Shuffle merged datasets |
-| `dataset_exact_deduplication` | `true` | Deduplicate datasets |
+| Option                        | Default                    | Description                       |
+| ----------------------------- | -------------------------- | --------------------------------- |
+| `dataset_prepared_path`       | `"data/last_run_prepared"` | Path for prepared dataset         |
+| `push_dataset_to_hub`         | `""`                       | Push dataset to HF hub            |
+| `dataset_processes`           | `4`                        | Number of preprocessing processes |
+| `dataset_keep_in_memory`      | `false`                    | Keep dataset in memory            |
+| `shuffle_merged_datasets`     | `true`                     | Shuffle merged datasets           |
+| `dataset_exact_deduplication` | `true`                     | Deduplicate datasets              |
 
 ## LoRA Configuration
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `adapter` | `"lora"` | Adapter type (lora/qlora) |
-| `lora_model_dir` | `""` | Directory with pretrained LoRA |
-| `lora_r` | `8` | LoRA attention dimension |
-| `lora_alpha` | `16` | LoRA alpha parameter |
-| `lora_dropout` | `0.05` | LoRA dropout |
-| `lora_target_modules` | `["q_proj", "v_proj"]` | Modules to apply LoRA |
-| `lora_target_linear` | `false` | Target all linear modules |
-| `peft_layers_to_transform` | `[]` | Layers to transform |
-| `lora_modules_to_save` | `[]` | Modules to save |
-| `lora_fan_in_fan_out` | `false` | Fan in/out structure |
-
+| Option                     | Default                | Description                    |
+| -------------------------- | ---------------------- | ------------------------------ |
+| `adapter`                  | `"lora"`               | Adapter type (lora/qlora)      |
+| `lora_model_dir`           | `""`                   | Directory with pretrained LoRA |
+| `lora_r`                   | `8`                    | LoRA attention dimension       |
+| `lora_alpha`               | `16`                   | LoRA alpha parameter           |
+| `lora_dropout`             | `0.05`                 | LoRA dropout                   |
+| `lora_target_modules`      | `["q_proj", "v_proj"]` | Modules to apply LoRA          |
+| `lora_target_linear`       | `false`                | Target all linear modules      |
+| `peft_layers_to_transform` | `[]`                   | Layers to transform            |
+| `lora_modules_to_save`     | `[]`                   | Modules to save                |
+| `lora_fan_in_fan_out`      | `false`                | Fan in/out structure           |
 
 ## Optimization Settings
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `train_on_inputs` | `false` | Train on input prompts |
-| `group_by_length` | `false` | Group by sequence length |
-| `gradient_checkpointing` | `false` | Use gradient checkpointing |
-| `early_stopping_patience` | `3` | Early stopping patience |
+| Option                    | Default | Description                |
+| ------------------------- | ------- | -------------------------- |
+| `train_on_inputs`         | `false` | Train on input prompts     |
+| `group_by_length`         | `false` | Group by sequence length   |
+| `gradient_checkpointing`  | `false` | Use gradient checkpointing |
+| `early_stopping_patience` | `3`     | Early stopping patience    |
 
 ## Learning Rate Scheduling
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `lr_scheduler` | `"cosine"` | Scheduler type |
-| `lr_scheduler_kwargs` | `{}` | Scheduler parameters |
-| `cosine_min_lr_ratio` | `null` | Minimum LR ratio |
-| `cosine_constant_lr_ratio` | `null` | Constant LR ratio |
-| `lr_div_factor` | `null` | LR division factor |
+| Option                     | Default    | Description          |
+| -------------------------- | ---------- | -------------------- |
+| `lr_scheduler`             | `"cosine"` | Scheduler type       |
+| `lr_scheduler_kwargs`      | `{}`       | Scheduler parameters |
+| `cosine_min_lr_ratio`      | `null`     | Minimum LR ratio     |
+| `cosine_constant_lr_ratio` | `null`     | Constant LR ratio    |
+| `lr_div_factor`            | `null`     | LR division factor   |
 
 ## Optimizer Settings
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `optimizer` | `"adamw_hf"` | Optimizer choice |
-| `optim_args` | `{}` | Optimizer arguments |
-| `optim_target_modules` | `[]` | Target modules |
-| `weight_decay` | `null` | Weight decay |
-| `adam_beta1` | `null` | Adam beta1 |
-| `adam_beta2` | `null` | Adam beta2 |
-| `adam_epsilon` | `null` | Adam epsilon |
-| `max_grad_norm` | `null` | Gradient clipping |
+| Option                 | Default      | Description         |
+| ---------------------- | ------------ | ------------------- |
+| `optimizer`            | `"adamw_hf"` | Optimizer choice    |
+| `optim_args`           | `{}`         | Optimizer arguments |
+| `optim_target_modules` | `[]`         | Target modules      |
+| `weight_decay`         | `null`       | Weight decay        |
+| `adam_beta1`           | `null`       | Adam beta1          |
+| `adam_beta2`           | `null`       | Adam beta2          |
+| `adam_epsilon`         | `null`       | Adam epsilon        |
+| `max_grad_norm`        | `null`       | Gradient clipping   |
 
 ## Attention Implementations
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `flash_optimum` | `false` | Use better transformers |
-| `xformers_attention` | `false` | Use xformers |
-| `flash_attention` | `false` | Use flash attention |
+| Option                     | Default | Description                   |
+| -------------------------- | ------- | ----------------------------- |
+| `flash_optimum`            | `false` | Use better transformers       |
+| `xformers_attention`       | `false` | Use xformers                  |
+| `flash_attention`          | `false` | Use flash attention           |
 | `flash_attn_cross_entropy` | `false` | Flash attention cross entropy |
-| `flash_attn_rms_norm` | `false` | Flash attention RMS norm |
-| `flash_attn_fuse_qkv` | `false` | Fuse QKV operations |
-| `flash_attn_fuse_mlp` | `false` | Fuse MLP operations |
-| `sdp_attention` | `false` | Use scaled dot product |
-| `s2_attention` | `false` | Use shifted sparse attention |
-
+| `flash_attn_rms_norm`      | `false` | Flash attention RMS norm      |
+| `flash_attn_fuse_qkv`      | `false` | Fuse QKV operations           |
+| `flash_attn_fuse_mlp`      | `false` | Fuse MLP operations           |
+| `sdp_attention`            | `false` | Use scaled dot product        |
+| `s2_attention`             | `false` | Use shifted sparse attention  |
 
 ## Tokenizer Modifications
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `special_tokens` | - | Special tokens to add/modify |
-| `tokens` | `[]` | Additional tokens |
+| Option           | Default | Description                  |
+| ---------------- | ------- | ---------------------------- |
+| `special_tokens` | -       | Special tokens to add/modify |
+| `tokens`         | `[]`    | Additional tokens            |
 
 ## Distributed Training
 
-| Option | Default | Description |
-|--------|---------|-------------|
-| `fsdp` | `null` | FSDP configuration |
-| `fsdp_config` | `null` | FSDP config options |
-| `deepspeed` | `null` | Deepspeed config path |
-| `ddp_timeout` | `null` | DDP timeout |
-| `ddp_bucket_cap_mb` | `null` | DDP bucket capacity |
-| `ddp_broadcast_buffers` | `null` | DDP broadcast buffers |
-
+| Option                  | Default | Description           |
+| ----------------------- | ------- | --------------------- |
+| `fsdp`                  | `null`  | FSDP configuration    |
+| `fsdp_config`           | `null`  | FSDP config options   |
+| `deepspeed`             | `null`  | Deepspeed config path |
+| `ddp_timeout`           | `null`  | DDP timeout           |
+| `ddp_bucket_cap_mb`     | `null`  | DDP bucket capacity   |
+| `ddp_broadcast_buffers` | `null`  | DDP broadcast buffers |
 
 <details>
 <summary><h3>Example Configuration Request:</h3></summary>
@@ -299,20 +283,21 @@ Here's a complete example for fine-tuning a LLaMA model using LoRA:
   }
 }
 ```
+
 </details>
 
 ### Advanced Features
 
 #### Wandb Integration
+
 - `wandb_project`: Project name for Weights & Biases
 - `wandb_entity`: Team name in W&B
 - `wandb_watch`: Monitor model with W&B
 - `wandb_name`: Name of the W&B run
 - `wandb_run_id`: ID for the W&B run
 
-
-
 #### Performance Optimization
+
 - `sample_packing`: Enable efficient sequence packing
 - `eval_sample_packing`: Use sequence packing during evaluation
 - `torch_compile`: Enable PyTorch 2.0 compilation
@@ -336,8 +321,6 @@ The following optimizers are supported:
 - `sgd`: Stochastic Gradient Descent
 - `adagrad`: Adagrad optimizer
 
-
-
 ## Notes
 
 - Set `load_in_8bit: true` or `load_in_4bit: true` for memory-efficient training
@@ -347,10 +330,6 @@ The following optimizers are supported:
 
 For more detailed information, please refer to the [documentation](https://axolotl-ai-cloud.github.io/axolotl/docs/config.html).
 
+### Errors:
 
-### Errors: 
 - if you face any issues with the Flash Attention-2, Delete yoor worker and Re-start.
-
-
-
-