Productivity
News: ๐๐๐ We now support the full training process of pre-trained Doge-Base, instruction fine-tuned Doge-Instruct, and reasoning fine-tuned Doge-R1, please refer to the guide!
[!TIP] We hope to use open-source tools and frameworks as much as possible to simplify the process from data processing to model training, so that beginners can easily understand and use.๐ค
This project aims to develop a series of dynamic and fast small models to promote their application in the field of embodied intelligence, especially in resource-constrained environments, to meet real-time response needs, and to promote the practical application of downstream fields.
[!TIP] As of 2025-2-20: The small doge series has completed the pre-training of 3 model models, with a minimum of 20M, which can have smooth conversation capabilities!
Model | tokens | max_train_steps | batch_size | learning_rate | scheduler | warmup_ratio | decay_ratio | weight_decay | min_lr_rate |
---|---|---|---|---|---|---|---|---|---|
Doge-20M | 4B | 8,000 | 256 | 8e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
Doge-60M | 16B | 16,000 | 512 | 6e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
Doge-160M | 32B | 24,000 | 768 | 4e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
The following one model are currently in pre-training, and researchers with the capability are welcome to help(poor man's cry)!๐
Model | tokens | max_train_steps | batch_size | learning_rate | scheduler | warmup_ratio | decay_ratio | weight_decay | min_lr_rate |
---|---|---|---|---|---|---|---|---|---|
Doge-320M | 64B | 32,000 | 1024 | 2e-3 | warmup_stable_decay | 0.1 | 0.1 | 0.01 | 0.0 |
As shown in the figure, the sequence transformation part of the Doge architecture uses Dynamic Mask Attention
, which can be understood as using self-attention related to value states during training, and using state-space without past state decay during inference, to solve the problem of existing Transformers or SSMs getting lost in long text. The state transformation part of Doge uses Cross Domain Mixture of Experts
, which consists of dense linear layers and sparse embedding layers, and can additionally increase sparse parameters to continue training from dense weight checkpoints without retraining the entire model, thereby reducing the cost of continuous iteration of the model. In addition, Doge also uses RMSNorm
and Residual
with learnable parameters to adapt the gradient range of deep models.
Dynamic Mask Attention Module
Cross Domain Mixture of Experts Module
Our codebase requires the following environment if you need to pre-train or fine-tune:
We highly recommend that you install the latest version of PyTorch and CUDA for optimal performance.
Of course, you can also use the open-source Docker PyTorch image to avoid the hassle of configuring the environment.
docker pull nvcr.io/nvidia/pytorch:24.12-py3
docker run --privileged --gpus all -it --name PyTorch --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 -v <your code path>:/workspace -v <your datasets path>:/workspace/Doge/datasets nvcr.io/nvidia/pytorch:24.12-py3
pip install transformers
: The core framework for all subsequent work.pip install datasets sentencepiece boto3
: Used to download and process datasets.pip install accelerate
: Used for distributed training.pip install trl
: Used for fine-tuning with reinforcement learning.git clone https://github.com/SmallDoges/small-doge.git
cd small-doge
pip install -e .
We have written a notebook and a training guide to demonstrate the entire process of dataset processing, model training, and model evaluation. You can also use the models that have been released independently. If you are interested, please read the notebook or training guide in detail, which contains specific steps and details!
Doge uses wsd_scheduler
as the training scheduler, which divides the learning rate into three stages: warmup
, stable
, and decay
. It allows us to continue training on any new dataset from any checkpoint in the stable stage
without spikes of the training.
Here are the initial learning rates required to continue training at each checkpoint:
Model | Learning Rate | Schedule | Warmup Steps | Stable Steps |
---|---|---|---|---|
Doge-20M | 8e-3 | wsd_scheduler | 800 | 6400 |
Doge-60M | 6e-3 | wsd_scheduler | 1600 | 12800 |
Doge-160M | 4e-3 | wsd_scheduler | 2400 | 19200 |
Doge-320M | 2e-3 | wsd_scheduler | 3200 | 25600 |
Pre-Training:
Model | Training Data | Steps | Content Length | Tokens | LR | Batch Size | Precision | RTX 4090 GPU hours |
---|---|---|---|---|---|---|---|---|
Doge-20M | HuggingFaceTB/smollm-corpus | 8k | 2048 | 4B | 8e-3 | 0.5M | bfloat16 | 14 |
Doge-60M | HuggingFaceTB/smollm-corpus | 16k | 2048 | 16B | 6e-3 | 1M | bfloat16 | 128 |
Doge-160M | HuggingFaceTB/smollm-corpus | 24k | 2048 | 32B | 4e-3 | 1.5M | bfloat16 | 522 |
Evaluation:
Model | MMLU | TriviaQA | ARC | PIQA | HellaSwag | OBQA | Winogrande | tokens / s on i7-11 CPU |
---|---|---|---|---|---|---|---|---|
Doge-20M | 25.4 | 0.03 | 29.8 | 58.4 | 27.3 | 25.6 | 50.2 | 142 |
Doge-60M | 26.4 | 0.2 | 37.9 | 61.4 | 31.5 | 28.0 | 50.8 | 62 |
Doge-160M | 29.2 | 4.8 | 44.4 | 66.3 | 38.7 | 34.4 | 52.2 | 28 |
SFT:
Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
---|---|---|---|---|---|---|
Doge-20M-Instruct-SFT | HuggingFaceTB/smoltalk | 2 | 2048 | 8e-4 | 0.25M | bfloat16 |
Doge-60M-Instruct-SFT | HuggingFaceTB/smoltalk | 2 | 2048 | 6e-4 | 0.25M | bfloat16 |
DPO:
Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
---|---|---|---|---|---|---|
Doge-20M-Instruct | HuggingFaceH4/ultrafeedback_binarized | 2 | 1024 | 8e-5 | 0.125M | bfloat16 |
Doge-60M-Instruct | HuggingFaceH4/ultrafeedback_binarized | 2 | 1024 | 6e-5 | 0.125M | bfloat16 |
Evaluation:
Model | IFEval (Prompt Strict Acc) | MMLU | BBH | ARC | PIQA | HellaSwag | tokens / s on i7-11 CPU |
---|---|---|---|---|---|---|---|
Doge-20M-Instruct | 7.3 | 26.3 | 18.3 | 29.2 | 57.8 | 27.8 | 142 |
Doge-60M-Instruct | 7.4 | 27.5 | 27.7 | 37.5 | 61.4 | 32.1 | 62 |
Doge-160M-Instruct | 16.8 | 29.7 | 29.1 | 42.8 | 64.1 | 37.1 | 28 |
Training Environment:
[!IMPORTANT]
- If you find this project helpful, please consider giving it a star โญ!
- Due to time and expertise constraints, there may be omissions in the project. Feel free to submit your insights through Issues or PRs to help improve the project, your support is the driving force behind the continuous progress of the project!๐
- One person can go fast, but a group of people can go further. If you have already trained a new small-doge model, feel free to share your model weights, training recipes, evaluation results, and other relevant information in Discussions or Issues. It can be a new small-doge model version for specific downstream tasks or vertical fields, such as sentiment recognition, medical, psychological, financial, legal Q&A, etc. It can also be an expanded training, such as exploring new small-doge model versions with longer text sequences, larger parameters, or larger datasets. Your sharing will greatly promote the development of the community!๐๐๐
If you use this codebase, or otherwise find our work valuable, please cite our paper:
@misc{shi2024wonderfulmatrices,
title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture},
author={Jingze Shi and Bingheng Wu},
year={2024},
eprint={2412.11834},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.11834},
}