SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning (work in progress)

Yuze Zhao Jintao Huang Jinghan Hu Xingjun Wang Yunlin Mao Daoze Zhang Zeyinzi Jiang Zhikai Wu Baole Ai Ang Wang Wenmeng Zhou Yingda Chen ModelScope Team, Alibaba Group
{yuze.zyz,huangjintao.hjt,xingjun.wxj,maoyunlin.myl,zhangdaoze.zdz,zeyinzi.jzyz,wuzhikai.wzk,
baole.abl,wangang.wa,wenmeng.zwm,yingda.chen}@alibaba-inc.com
hjhcs@zju.edu.cn HangZhou China
https://github.com/modelscope/ms-swift

Abstract.

Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal tasks like Visual Question Answering (VQA) and Optical Character Recognition (OCR), which were previously addressed using different models, can now be tackled based on one foundation model. Consequently, the training and lightweight fine-tuning of LLMs and MLLMs, especially those based on Transformer architecture, has become particularly important. In recognition of these overwhelming needs, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over $300+$ LLMs and $50+$ MLLMs, SWIFT stands as the open-source framework that provide the most comprehensive support for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. In addition to the core functionalities of fine-tuning, SWIFT also integrates post-training processes such as inference, evaluation, and model quantization, to facilitate fast adoptions of large models in various application scenarios. With a systematic integration of various training techniques, SWIFT offers helpful utilities such as benchmark comparisons among different training techniques for large models. For fine-tuning models specialized in agent framework, we show that notable improvements on the ToolBench leader-board can be achieved by training with customized dataset on SWIFT, with an increase of 5.2%-21.8% in the Act.EM metric over various baseline models, a reduction in hallucination by 1.6%-14.1%, and an average performance improvement of 8%-17%.

1. Introduction

In the past few years, Transformer (NIPS2017_3f5ee243, ) have been widely recognized as one of the dominant architectures for large models. In the early stages, encoder-only structures are utilized to accomplish tasks such as text-classification and sequence-labeling, with models like BERT(devlin2018bert, ) serving as typical examples. Conversely, encoder-decoder and decoder-only structures were used mostly for text generation tasks. In comparison, vision models often adopts ResNet architecture to handle tasks like Visual Question Answering (VQA), Object Detection, and Image Segmentation. These early approaches to deep learning tasks were characterized by the use of distinct model structures for different tasks.

The abundance in computational power and structured training data has brought out potentials of Transformer-based models, as industry begins to overtake models tuned for single-task. This shift positions Transformer as the preferred architectures for open-domain applications. Most notable examples include the GPT models(NEURIPS2020_1457c0d6, ; radford2019language, ) from OpenAI, as well as M6 (lin2021m6, ) and OFA (pmlr-v162-wang22al, ) models. Such progress highlighted the feasibility of using a single model to address multiple closed-domain tasks. Consequently, the paradigm of leveraging large-scale foundation models for generative tasks has emerged as the new standard for addressing multiple tasks including text-classification and sequence-labeling. The attention mechanism has also gained traction in addressing distinct multi-modal tasks with one foundation model. The release of Qwen-VL (bai2023qwen, ), GLM4-V (glm2024chatglm, ), InternVL (chen2023internvl, ), and DiT models (peebles2023scalablediffusionmodelstransformers, ) , can all satisfy to this trend. These foundation models exhibit robust capabilities in open-domain image-text and video-text question answering, as well as in image generation. They have also shown potentials for recognizing detailed image information and performing bounding-box annotation, achieving results comparable to previous closed-domain solutions.

Throughout the development of large models, open-source communities have played a critical role. Platforms such as Hugging Face¹¹1Web: https://huggingface.co/; GitHub: https://github.com/huggingface, and ModelScope²²2Web: https://modelscope.cn; GitHub: https://github.com/modelscope are both notable examples in promoting sharing and development of large models. Launched in 2017, Hugging Face was initially tasked with the mission to address issues related to the PyTorch version of BERT. The resulting Transformers library later became the de facto standard for implementing large models. At the same time, the Hugging Face trainer also supports multi-node and multi-GPU parallel training methods such as DeepSpeed and FSDP, making it one of the most widely used trainers. For alignment techniques, TRL³³3GitHub:https://github.com/huggingface/trl is introduced to extend base trainer class from Transformers and implements specific methods to support techniques such as Direct Preference Optimization (DPO)(rafailov2024directpreferenceoptimizationlanguage, ), Optimized Reward Policy Optimization (ORPO)(hong2024orpomonolithicpreferenceoptimization, ), and Knowledge Transfer Optimization (KTO)(ethayarajh2024ktomodelalignmentprospect, ).

Given the large number of parameters and high memory consumption of large models, their out-of-box training has become as a significant bottleneck in the proliferation of AI. Early solutions, such as Prefix Tuning (li2021prefixtuning, ), Prompt Tuning (lester-etal-2021-power, ), and P-Tuning (liu2021gpt, ; DBLP:journals/corr/abs-2110-07602, ) open the chapter of resource-efficient training, but they can suffer from “knowledge forgetting” – a phenomenon description the scenario where fine-tuned LLMs may lost it general capacities from the foundation models. The introduction of LoRA (hu2022lora, ) shows the potentials of reduces memory consumption during training significantly comparing to full-parameter training, without loss of generality in models. This allows developers to embark on efficient-training on domain data using hardware that is much easier to access. Subsequently, more similar techniques were introduced, such as the enhancement algorithm rsLoRA (kalajdzievski2023rankstabilizationscalingfactor, ), DoRA (liu2024doraweightdecomposedlowrankadaptation, ), PISSA (meng2024pissaprincipalsingularvalues, ), OLoRA (buyukakyuz2024oloraorthonormallowrankadaptation, ), LoRA+ (hayou2024loraefficientlowrank, ), and the LLaMA-Pro (wu2024llamaproprogressivellama, ) have proliferated, providing an array of new techniques for efficient fine-tuning. In recognition of the vast differences among these different techniques, efforts begin to emerge to unify training interfaces. For example, Hugging Face has introduced the PEFT ⁴⁴4GitHub: https://github.com/huggingface/peft that specializes in collecting and standardizing interfaces for efficient fine-tuning algorithms.

In addition to lightweight training techniques based on additional structures such as LoRA, quantization stands as another solution for reducing memory consumption during training. Typically, LLMs use 16-bit half-precision formats, such as float16 and bfloat16, for both inference and training. By reducing the tensor types to 8-bit or 4-bit, the same model can be loaded with less memory. It is even possible to run the model with 1-bit or 1.5-bit precision; this approach is collectively known as quantization. For instance, BitsAndBytes (dettmers2022llmint8, ) employs segmented quantization with outlier thresholds, AutoGPTQ (frantar2023gptqaccurateposttrainingquantization, ) performs Taylor series decomposition on parameter matrices and uses the Hessian matrix to evaluate parameter importance, and AWQ (lin2024awqactivationawareweightquantization, ) evaluates parameter significance and applies scaling factors for quantization. Due to the complexity of quantization techniques and their poor adaptability to different models, the Hugging Face community has introduced the Optimum library⁵⁵5GitHub: https://github.com/huggingface/optimum as a unified implementation for various quantization methods. Nevertheless, the task of LLM training and fine-tuning remains formidable for most developers, as the aforementioned solutions only cover support of a limited number of models and techniques. In particular, support for newer models and techniques are often lacking in existing solutions. Furthermore, to ensure that the trained models can be effectively deployed, the post-training processes, such as inference and evaluation, are also steps in utilization of trained models. To address this, we introduced SWIFT ⁶⁶6GitHub:https://github.com/modelscope/swift, an open-source framework targeted at facilitating lightweight training of large models, which also incorporates comprehensive functionality for post-training processes. SWIFT assists developers to perform training and inference operations with minimal learning overhead. By streamlining various technical components, sourced or self-developed, in a unified way, SWIFT is tasked to enable efficient training and development pipelines of large models.

Specifically, our contributions can be summarized as follows:

•

We introduce SWIFT, a training framework compatible with the general model standards of the Transformers library. SWIFT integrates libraries such as PEFT and Optimum, enabling pre-training, fine-tuning, and human alignment for LLM and MLLM models. In addition quantization training (QLoRA (dettmers2023qloraefficientfinetuningquantized, )) is supported as well. Today SWIFT supports over 300 LLM models and over 50 MLLM models, encompassing all major open-source models. It also comes with support of over 150 pure text and multi-modal datasets.
•

Other than the standard Attention structures, training and inference for model structures such as Mamba (gu2024mambalineartimesequencemodeling, ) model are also supported in Swift. Furthermore, training with Megatron (shoeybi2020megatronlmtrainingmultibillionparameter, ) structured models is supported as well which facilitates large-scale parallel pre-training across multiple nodes and GPUs to be performed with SWIFT.
•

Several SOTA tuners are implemented or planted in SWIFT project to enhance lightweight training. These tuners are designed to be used independent of our SWIFT training loop to allow more flexible usage.
•

Numerous post-training operations are integrated in SWIFT library, including quantization (BNB/GPTQ/AWQ, etc.), LoRA merging, evaluation (supporting over 100+ pure text and multi-modal evaluation sets), as well as inference and deployment capabilities. For deployment, we support native PyTorch deployment and inference acceleration based on vLLM (kwon2023efficient, ) , and LMDeploy (2023lmdeploy, ), together these integrations provide support for inference against most text and multi-modal large models.

In summary, we have comprehensively constructed a complete technical chain around LLM training, effectively reducing the cost of understanding and using large models. Particularly for the training of multi-modal models, to our knowledge, we are the first open-source framework to establish a comprehensive multi-task training and complete end-to-end solution for numerous multi-modal large models.

	LLaMA-Factory	FireFly	FastChat	Axolotl	LMFlow	SWIFT(Ours)
LoRA	✓	✓	✓	✓	✓	✓
QLoRA	✓	✓	✓	✓	✓	✓
LLaMA-Pro	✓					✓
LongLoRA(chen2024longloraefficientfinetuninglongcontext, )	✓	✓		✓		✓
GaLore(zhao2024galore, )	✓			✓		✓
Q-GaLore(zhang2024qgalore, )				✓		✓
FourierFt(gao2024parameterefficientfinetuningdiscretefourier, )						✓
LoRA+	✓			✓		✓
LISA				✓	✓	✓
DoRA	✓			✓		✓
rsLoRA	✓			✓		✓
UnSloth	✓	✓		✓		✓
LLM-PRETRAIN	✓	✓	✓	✓	✓	✓
LLM-Megatron-PRETRAIN						✓
LLM-SFT	✓	✓	✓	✓	✓	✓
LLM-DPO	✓	✓		✓	✓	✓
LLM-CPO(xu2024contrastivepreferenceoptimizationpushing, )				✓		✓
LLM-ORPO	✓			✓		✓
LLM-KTO	✓			✓		✓
LLM-SimPO(meng2024simposimplepreferenceoptimization, )	✓			✓		✓
MLLM-PRETRAIN	3 models					50+ models
MLLM-SFT	3 models					50+ models
MLLM-RLHF	3 models					50+ models
vLLM	✓		✓		✓	✓
LMDeploy						✓
LLM Evaluation	3 datasets		✓		✓	48 datasets, 2 custom datasets
MLLM Evaluation						95 datasets
WEB-UI	✓		✓			✓

Table 1. The comparison of support for training auxiliary capabilities

2. Related Works

LLaMA-Factory(zheng2024llamafactoryunifiedefficientfinetuning, ) is a versatile, all-in-one large model training framework. This framework is fully compatible with the Hugging Face model ecosystem. Additionally, it supports a WEB-UI based on Gradio, further reducing the cost of usage. LLaMA-Factory supports the pre-training, fine-tuning, and human alignment of over 100 text LLMs. It also facilitates the training of some of them multi-modal models such as LLaVA, PaliGemma, and YI-VL. In terms of evaluation capabilities, it supports the evaluation processes for the CEVAL, MMLU, and CMMLU datasets and enables inference and deployment workflows based on vLLM.

Firefly⁷⁷7GitHub:https://github.com/yangjianxin1/Firefly leverages the transformers training ecosystem (Trainer/PEFT, etc.). Remarkably, it explores training datasets and creates several popular datasets for NLP training, such as firefly-train-1.1M, moss-003-sft-data, and ultrachat. They have also utilized these dataset for lightweight training on various models, including firefly-mixtral-8x7b, which has outperformed Yi-34B-Chat on multiple leaderboards.

FastChat(zheng2023judging, ) is a comprehensive training and inference framework. This framework is equipped with capabilities for model training, inference, deployment, and evaluation. FastChat has leveraged Transformers and PEFT for training, and supports models such as LLaMA, Vicuna, and T5. It supports lightweight fine-tuning using techniques such as LoRA, QLoRA, and XFormers⁸⁸8GitHub: https://github.com/facebookresearch/xformers. For deployment, FastChat supports inference acceleration frameworks like vLLM, SGLang⁹⁹9https://github.com/sgl-project/sglang, and LightLLM¹⁰¹⁰10https://github.com/ModelTC/lightllm. FastChat places a focus on model inference and deployment, with relatively limited training support.

Axolotl¹¹¹¹11https://github.com/axolotl-ai-cloud/axolotl employs training component libraries such as TRL, PEFT, and Transformers. This framework has extended training capabilities, including the encapsulation of the mambassm¹²¹²12GitHub: https://github.com/state-spaces/mamba library, thereby enhancing the ability to train these models using the transformers ecosystem. Axolotl supports the LoRA and QLoRA training of various models across multiple series, including LLaMA, Mistral, Qwen, and Phi, and also supports inference and merge-LoRA operations.

The LMFlow(diao2023lmflow, ) encapsulates model training process in a pipeline style. It supports SFT and RLHF training for LLM models like LLaMA, Gemma, and Qwen, and allows for custom optimizers and tuners, such as LoRA. Additionally, it has developed lightweight fine-tuning techniques like LISA(pan2024lisalayerwiseimportancesampling, ). LMFlow provides capabilities for evaluating LLMs and supports inference and inference acceleration for both pure text and multi-modal models.

We have summarized capabilities of all these frameworks in the table 1 for easy reference.

3. Implementations and Frameworks

We believe that unifying multiple model architectures to enhance the all-around capability of a model, is an important trend in the development of large models. For instance, the primary distinction between text and multi-modal LLMs lies in the additional vision-tower component. The hidden states from this vision-tower, once processed through a projector, are integrated into the LLM’s embeddings. Furthermore, the majority of multi-modal models can support text input and perform inference in the manner of a text-only model.

Pre-training a text-only model typically requires processing data that amounts to the order of tens of terabytes of tokens, which is fastly approaching the limit of exhausting all effective text corpus available. However, from a multi-modal perspective, the hidden states of text, image, and video data can be interchanged in high-dimensional space(radford2021learningtransferablevisualmodels, ). Consequently, models trained through multi-modal pre-training could be considered to possess virtually unlimited data, therefore they will exhibit a significant data advantage over those trained solely on text. To this end, it is our belief that multi-modal models shall become predominant in future model development. In our framework design, we strive to eliminate the gap between training pure text LLMs and multi-modal LLMs, and we do this by establishing unified standards in data processing, model templates, and model training.

Training, or fine-tuning is not the end of LLM applications. Once a model is trained, there is often needs for convenient and efficient evaluation processes to determine model quality. These evaluation processes can even be integrated into the training phase for cross-validation (e.g., incorporating gsm8k evaluation during training), or during inference to conduct comprehensive evaluations on specific datasets. Additionally, post-training quantization of models can be performed to use quantized models for service, ensuring minimal memory usage while maintaining theoretical performance.

This necessity applies to model deployment as well. Efficient inference and deployment of various post-training checkpoints, including original models, LoRA models, LLaMA-Pro models, and quantized models, are equally important. Therefore, integrating upstream and downstream capabilities within the framework, alongside the training itself is crucial. This integration will not only streamline the overall process of model application, but also enable exploration different capabilities together. The joint relationship can be found between evaluation and deployment, between evaluation and training, as well as between quantization and deployment. We believe to truly lower the barrier for model utilization, it is of paramount importance to construct a unified framework for both text and multi-modal LLMs that centered around training capabilities and its downstream integration.

Refer to caption — Figure 1. The framework of SWIFT

3.1. Design of Training Architecture

SWIFT supports several categories of lightweight tuning techniques:

•

Reducing Trainable Parameters: This involves partially training the original model’s parameters. Reducing the number of trainable parameters can effectively decrease the number of gradient values. For instance, LISA, which randomly activates different layers, can significantly reduce memory usage without notably decreasing training accuracy.
•

Model Quantization: Quantization is a crucial method for reducing memory pressure. The main idea is to convert the low-precision floating-point values of the model into 8-bit or lower fixed-point values. SWIFT currently supports six types of quantization: BNB, HQQ, EETQ, AWQ, GPTQ, and AQLM. Generally, quantization methods are combined with additional structures for training, such as QLoRA.
•

Reducing Memory Usage of Gradient Values: Techniques such as GaLore perform SVD decomposition on gradient values, effectively reducing the memory required for storing these values.
•

Freezing the Original Model: This approach supports training with additional structures. Typical implementations include LoRA and AdaLoRA.
•

Sharding or Mixed Precision: Examples include DeepSpeed Zero1/2/3, FSDP, and mixed precision training.

As show in Fig. 1, tuners can leverage and extend the capabilities of the PEFT library. For instance, tuners incorporate techniques such as LoRA, AdaLoRA (zhang2023adaloraadaptivebudgetallocation, ), IA3 (liu2022fewshotparameterefficientfinetuningbetter, ), BOFT (liu2024parameterefficientorthogonalfinetuningbutterfly, ), and Vera (kopiczko2024veravectorbasedrandommatrix, ). These tuners are introduced with adjustments to ensure compatibility and seamless operation within MLLMs during training. Additionally, SWIFT offers support for a much wider range of tuners when comparing with PEFT, including SCEdit (jiang2023sceditefficientcontrollableimage, ) and ResTuning (jiang2023restuningflexibleefficienttuning, ), as well as LLaMA-Pro, LongLoRA, and LISA. These tuners can be used in combination, similar to the MixedPeftModel capability of PEFT, and support offloading of deactivated tuners to CPU or meta devices. This integration of tuners allows them to be applied to not only models supported within SWIFT, but also external models as well. SWIFT provides seamless support for both PEFT tuners and customized tuners through its prepare_model and from_pretrained methods.

In the model functionality module, SWIFT provides a basic model loader that allows for flexible customization of model configurations. Given that various compatibility issues may arise during training, such as dtype conversion errors or tensor in-place change errors, SWIFT utilizes a patcher module to address these issues post model-loading, ensuring smooth operation in different scenarios including single-GPU, multi-GPU, full-parameter, or LoRA training.

In the dataset module, three types of data sources are supported. The first is MsDataset that loads dataset from ModelScope. The second is the ‘datasets‘ module from Hugging Face, which provides loading capabilities for Hugging Face datasets. Lastly, we support user-defined datasets, such as local CSV or JSONL files. A key feature of the dataset module is the pre-processing capability, which serves two main functions: converting different datasets into a standard format. The specific format details can be found in the appendix sectionA.

One of the critical components of the model module is the template. This component ensures that various models supported by SWIFT can correctly produce key fields such as input_ids, attention_masks, pixel_values, and labels according to the design of model training. This module interfaces with the aforementioned standard dataset formats and converts these formats into different inputs as per the requirements of different models. Specifically, for multi-modal grounding tasks, bounding box (bbox) coordinates are converted within the template. For example, a bbox_type ’real’ represents actual pixel values, ’norm_1000’ represents values in thousandths, and ’norm_1’ represents normalized coordinate values. The template converts the actual coordinate values of the data into the coordinate values required by the model.

In the training component of the model, a significant part is the trainer, which includes both the SFT/PT trainer and the human alignment trainer. The former directly inherits from the trainer of Transformers and is used for predicting and training the cross-entropy of the next token. The latter inherits from the corresponding class of TRL and is used for training various RLHF algorithms such as DPO, ORPO, and KTO. For RLHF tasks of multi-modal models, we have made additional modifications and adaptations to ensure that all multi-modal models we support can use any compliant alignment dataset for RLHF training.

Specifically, in the direction of pre-training, SWIFT supports Megatron architecture models. Particularly in the CPT scenario, SWIFT first converts the checkpoints of the transformers architecture into the checkpoints of the Megatron architecture and then continues pre-training the model using various parallel methods of Megatron. After training, the checkpoints can be converted back to the format supported by transformers. SWIFT utilizes the PAI-Megatron-Patch framework ¹³¹³13GitHub: https://github.com/alibaba/Pai-Megatron-Patch for supporting Megatron. The conversion process of checkpoints is supported through the export module in SWIFT.

In the training component, new optimizers such as GaLore and Q-GaLore are integrated, making them readily available for use during training. To further alleviate training pressure, SWIFT supports sequence parallelism technology (2023xtuner, ), which distributes sequences across different processes under DDP conditions, thereby reducing the memory consumption for long-sequence training.

To facilitate the use of LLM training, inference, and evaluation in actual production environments, we have released SWIFT on PYPI and supported various functionalities via the command line. For the training process, SWIFT can be easily invoked using command-line commands, which can be found in the appendix sectionA.

SWIFT provides three commands for different tasks: ‘pt‘ for pre-training, ‘sft‘ for fine-tuning, and ‘rlhf‘ for RLHF. These commands are consistent for both pure text models and multi-modal models. For dataset selection, SWIFT supports the use of the ‘–dataset‘ option to directly use pure text and multi-modal datasets, and it also supports referencing local training files.

The highest-level interface is the web UI. For users who are familiar with graphical interfaces, using the web UI aligns more with their habits. The web UI uses Gradio as the foundational framework. Once the web UI is launched, users can select different training stages and adjust various training hyper-parameters directly in the interface. After the training starts, the interface will display training logs and loss/accuracy charts. For RLHF tasks, the charts will be replaced with metrics such as margin and logps that are relevant to the task type. This workflow is applicable to both pure text models and multi-modal models. Essentially, the SWIFT WEB UI serves as a command-line assembler. The web UI assembles the command-line strings for single-node multi-GPU or multi-node multi-GPU execution, and it uses these commands for multi-process background execution.

For lightweight training, SWIFT supports QLoRA training methods. The quantization methods available include BNB, HQQ (badri2023hqq, ), EETQ (gordon2023eptqenhancedposttrainingquantization, ), AWQ, GPTQ, and AQLM (egiazarian2024extremecompressionlargelanguage, ). In terms of model support, we facilitate the training processes for over 300 NLP models and more than 50 multi-modal models. Specifically, to effectively fine-tune agents, we collaborated to create the MS-Agent dataset ¹⁴¹⁴14WebPage: https://www.modelscope.cn/datasets/iic/ms_agent. This dataset is a relatively rare, high-quality Chinese fine-tuning dataset for agents. Subsequently, we updated the MSAgent-Pro dataset¹⁵¹⁵15WebPage: https://www.modelscope.cn/datasets/iic/MSAgent-Pro, which adopts the ToolBench format. This dataset is very important for supervised fine-tuning (SFT) to enhance agent capabilities, and it includes the Chain of Thought (CoT) process, which significantly improves the effectiveness of multi-turn agent calls. To facilitate agent fine-tuning, SWIFT supports the ‘tools‘ field in dataset formats and allows fine-tuning training using different prompt formats (e.g., ToolBench (qin2023toolllm, ) format, ReACT (yao2023reactsynergizingreasoningacting, ) format, or other formats defined by model templates). This can be seen in the SWIFT standard dataset definitions.

SWIFT supports the ‘loss-scale‘ technique (li2023modelscopeagentbuildingcustomizableagent, ), which increases the training weight for important tokens. This makes it easier for the model to remember important content during learning. We used this technique to increase the weights for crucial parts of agent training, such as Action and Action_Input fields, resulting in significant performance improvements compared to not using ‘loss-scale‘.

In the multi-modal field, SWIFT provides comprehensive support, and various open-domain tasks can be run in SWIFT, such as Vision Question Answering (VQA), Optical Character Recognition (OCR), Grounded Captioning, and Referring Grounding.

3.1.1. Design of Inference and Deployment Architecture

Inference and deployment inherently possess a natural interdependence. The core logic of inference can be applied within deployment, and conversely, deployment can be viewed as a service encapsulation of inference. SWIFT’s inference and deployment can be categorized into three types based on the backend: PyTorch Native(PT), vLLM, and LMDeploy. These three inference frameworks share identical parameters, allowing for the easy expansion of other inference acceleration frameworks in the future. One significant reason for SWIFT’s encapsulation of inference for vLLM and LMDeploy is that, in cases where the original framework does not adequately support the model’s templates, SWIFT can use its own templates to mask the differences between frameworks.

SWIFT employs FastAPI to encapsulate inference as a service, fully complying with the OpenAI universal interface definition¹⁶¹⁶16WebPage: https://platform.openai.com/docs/api-reference. For the deployment of Agent capabilities, SWIFT supports OpenAI standard fields such as tools and tool, and it also supports inference and deployment of Agent data formats like ToolBench and ReACT in terms of data format. We have directly incorporated the concatenation of Agent prompts into the Template, which means that we can easily support the specific Agent formats of different models.

SWIFT supports the inference and deployment of both official and trained models, and these functionalities are equally supported in the WEB-UI. This implies that users can utilize SWIFT both as a deployment framework and as a ChatBot. Notably, we have integrated support for multi-LoRA inference and deployment in both the vLLM and PT backends. Specifically, users can conveniently switch between different LoRA configurations by specifying the respective LoRA names within the OpenAI interface, without the need to merge the LoRA models.

3.1.2. Evaluation Architecture Design

Evaluation and inference deployment are interdependent. The evaluation of models, particularly those that are post-training, depends on whether the models can initiate inference or deployment. In many evaluation frameworks, such as OpenCompass(2023opencompass, ), the standard OpenAI interface is directly used as a dependency, which is one of the reasons for supporting inference and deployment in SWIFT. In practice, developers may use different inference backends, such as vLLM or LMDeploy. Therefore, during evaluation, developers can flexibly choose different backends and deployment forms (e.g., official models, post-training LoRA models, post-training LLaMA-Pro models, merged models, quantized models) for evaluation, making the process more aligned with their actual use cases or ablation study scenarios.

To facilitate the use of custom datasets, SWIFT supports custom evaluation datasets for two types of tasks:

•

Objective Question Evaluation Similar to CEval: Developers can format their datasets as CEval-style CSV files and then conduct evaluations, yielding conclusive results.
•

Subjective Question Evaluation for QA: This evaluation uses standard metrics like ROUGE and BLEU. After writing the data into a CSV file, developers can perform evaluations.
•

For evaluation capabilities, we rely on the EvalScope¹⁷¹⁷17GitHub:https://github.com/modelscope/evalscope framework from the ModelScope Community. This framework constructs evaluation capabilities by integrating OpenCompass (for text models) and VLMEvalKit(duan2024vlmevalkit, ) (for multi-modal models). By incorporating EvalScope, SWIFT currently supports over 100+ total pure text evaluation sets and multi-modal evaluation sets, as well as the aforementioned two types of custom evaluation datasets and their evaluation processes.

3.1.3. Design of Quantization and Export Architecture

The export module is primarily used for merging tuners, converting checkpoint formats, and quantization. Currently, the following types of export operations are supported within this module:

•

Merging Tuners: This includes merging tuners such as LoRA, LongLoRA, and LLaMA-Pro.
•

Converting Checkpoints: This involves the mutual conversion of checkpoints between the Transformers format and the Megatron format.
•

Quantization: At this stage, we support three quantization methods: AWQ, GPTQ, and BNB.

Quantize Method	QAT	QLoRA	PTQ
BNB	✓	✓	✓
HQQ	✓	✓
EETQ	✓	✓
AWQ		✓	✓
GPTQ		✓	✓
AQLM		✓

Table 2. Support for Quantization Methods

4. Exporting to Ollama: This process includes the incorporation of the model’s template configuration, allowing users to conveniently run the model using the ‘ollama‘ command.

4. Experiments

In addition to the algorithmic framework, we have also explored the tuning of the models and the underlying technology. Our objective is for SWIFT to serve not only as a framework but also as a way to validate the technology itself. To this end, we have divided our exploration of LLM training into several directions.

4.1. Lightweight Tuning Benchmark

We utilized SWIFT to replicate and validate the impact of various lightweight tuning algorithms on models. Using qwen-7b-chat as the base model, we conducted training on a single A100-80G GPU, comparing memory usage and loss, among other metrics.

Hyper-parameter	Value
batch_size	1
gradient_accumulation_steps	16
epoch	1
max_length	2048
learning_rate	5e-5
gradient_checkpointing	true
flash_attn	true
weight_decay	0.01
warmup_ratio	0.03
lora_rank	8
galore_rank	128
llamapro_new_blocks	4
lisa_activated_layers	2

Table 3. Tuner benchmark hyper-parameter settings

The experiment hyper-parameters displays in chart 3, and the experiment results are summarized in table 4.

Tuner	Train/Eval loss	Trainable (M)	Memory (GiB)	Speed (samples/s)
AdaLoRA	0.57 / 1.07	26.84 (0.35%)	32.55	0.92
DoRA	0.53 / 1.01	19.25 (0.25%)	32.46	0.51
GaLore	0.55 / 1.00	7721.32 (100%)	47.02	1.10
Q-GaLore	0.54 / 1.00	7721.32 (100%)	41.53	1.45
LLaMAPro	0.53 / 1.00	809.58 (9.49%)	38.11	1.53
LoRA+	0.53 / 0.98	17.89 (0.23%)	32.35	0.95
LoRA	0.53 / 1.01	17.89 (0.23%)	32.35	0.95
RsLoRA	0.53 / 0.99	17.89 (0.23%)	32.35	0.94
LISA	0.62 / 1.06	-	31.11	2.66
Full	0.54 / 0.95	7721.32 (100%)	73.53	1.43

Table 4. Profiles of various Tuners

Among the benchmarks, ”Full” represents the control group experiment using full-parameter training. It can be observed that the lowest memory consumption and fastest speed are achieved by LISA. Within the additional structure tuners, the lowest evaluation loss is recorded by LoRA+. In gradient reduction methods, Q-GaLore exhibits the lowest memory consumption. None of these tuning methods are included in the PEFT library.

4.2. Agent Training

Agent training constitutes an important category within model SFT. The quality of Agent training determines whether the model can be applied within an Agent framework to solve practical business problems. Generally, Agent capabilities are categorized into three types:

1. Document retrieval

2. Code Interpreter

3. API Calling

In the narrow sense, Agent training typically refers to the API Calling. This capability is positively correlated with the model’s Chain of Thought (CoT) capability; the stronger the CoT capability, the better the model’s understanding of APIs and its ability to reflect upon errors.

In this study, we utilized a mixed dataset comprising the ToolBench dataset and the AgentFlan (chen2024agentflandesigningdatamethods, ) dataset, and conducted a series of experiments. We employed the LLaMA3-8b-instruct model and the Qwen2-7b-instruct model for training, and compared the results before and after training. The hyper-parameter settings are shown in the table 5.

Hyper-parameter	Value
batch_size	1
gradient_accumulation_steps	32
epoch	1
max_length	4096
learning_rate	2e-5
gradient_checkpointing	true
flash_attn	true
lora_target_modules	All linears
lora_rank	8

Table 5. Agent experiment hyper-parameter settings

In certain experiments, we employed the loss-scale technique to enhance the weights of some important tokens.

The ablation study comparing the loss-scale technique based on the LLaMA3-8b-instruct model is shown in the table 6 and 7.

Model	Plan.EM	Act.EM	Hallu Rate	Avg.F1	R-L
Original	74.22	36.17	15.68	20.0	12.14
w/o loss-scale	84.29	55.71	4.85	49.40	25.06
w/ loss-scale	85.1	58.15	1.57	52.10	26.02

Table 6. LoRA (in-domain) ablation tests for loss-scale

Model	Plan.EM	Act.EM	Hallu Rate	Avg.F1	R-L
Original	69.47	34.21	14.72	20.25	14.07
w/o loss-scale	85.10	55.55	5.26	48.52	31.22
w/ loss-scale	85.79	59.43	2.56	52.19	31.43

Table 7. LoRA (out-of-domain) ablation tests for loss-scale

The experimental results indicate that introducing the loss-scale significantly improved all evaluation metrics.

The experimental results for Qwen2-7b-instruct are shown in the table 8 and 9.

Model	Plan.EM	Act.EM	Hallu Rate	Avg.F1	R-L
Original	74.11	54.74	4.16	46.53	8.51
GPT4	80.28	55.52	5.98	48.74	28.69
LoRA(Ours)	77.05	56.97	0.9	49.53	19.81
Full(Ours)	83.37	60.01	2.58	54.41	26.34

Table 8. Qwen2-7b-instruct ToolBench (in-domain) results

Model	Plan.EM	Act.EM	Hallu Rate	Avg.F1	R-L
Original	73.17	57.67	3.84	48.58	11.23
GPT4	77.80	55.26	5.12	47.45	30.61
LoRA(Ours)	78.05	58.91	1.53	51.28	26.04
Full(Ours)	82.57	60.14	1.79	55.25	31.34

Table 9. Qwen2-7b-instruct ToolBench (out-of-domain) results

It can be observed that, compared to the official Qwen2 model, the average metrics after training improved by 8.25%, and the model hallucinations were reduced to single digits. Moreover, most metrics surpassed those of GPT-4.

The experimental results for LLaMA3-8b-instruct are shown in the table 10 and 11.

Model	Plan.EM	Act.EM	Hallu Rate	Avg.F1	R-L
Original	74.22	36.17	15.68	20.0	12.14
LoRA(Ours)	84.58	44.73	15.11	38.90	22.22

Table 10. Llama3-8b-instruct ToolBench (in-domain) results

Model	Plan.EM	Act.EM	Hallu Rate	Avg.F1	R-L
Original	69.47	34.21	14.72	20.25	14.07
LoRA(Ours)	84.3	49.56	13.19	43.09	24.85

Table 11. Llama3-8b-instruct ToolBench (out-of-domain) results

Based on LoRA training, the average metrics of LLaMA3 improved by 17%. This demonstrates that open-source models and datasets are meaningful for Agent training in practical vertical scenarios. We have summarized the hyper-parameters and other experiences from the training process to facilitate replication and application by other developers. The mentioned dataset ¹⁸¹⁸18https://www.modelscope.cn/datasets/iic/MSAgent-Pro and models ¹⁹¹⁹19WebPage: https://modelscope.cn/models/swift/qwen2-7b-agent-instruct ²⁰²⁰20WebPage: https://modelscope.cn/models/swift/llama3-8b-agent-instruct-v2 can all be found on ModelScope.

5. Conclusion

In this paper, we described SWIFT, a lightweight, one-stop large model training framework from ModelScope. We hope that this framework can eliminate the mismatches between different models, datasets, and SOTA technologies, providing developers with a standardized solution that can solve the entire problem in a closed-loop manner. SWIFT supports over 300 LLMs and 50 MLLMs, and provides an easy-to-use WEB-UI based on the command line. Developers can perform various command-line operations on the WEB-UI, greatly reducing the cost of use. However, due to limited development time and other factors, SWIFT still has more features planned, such as:

1. Better support for Megatron large-scale parallel training. Currently, SWIFT’s support for Megatron models does not fully cover mainstream LLMs and MLLMs. We hope that SWIFT can provide greater pre-training convenience for foundational model developers.

2. More in-depth multi-modal research. While SWIFT already supports training for most mainstream multi-modal models, we still lack more in-depth work on multi-modal datasets and models, such as providing high-quality datasets to prevent knowledge forgetting or training new multi-modal models using ModelScope’s self-developed datasets. Additionally, we hope to conduct more in-depth research on multi-modal Agents, multi-modal CoT, and multi-modal alignment training.

3. Support for RAG systems. We hope that SWIFT’s training technology can be more SOTA and robust, making it easier to connect to various AI systems, such as enhancement training for RAG system models, helping RAG systems improve recall rates and answer accuracy.

References

[1] Hicham Badri and Appu Shaji. Half-quadratic quantization of large machine learning models, November 2023.
[2] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
[4] Kerim Büyükakyüz. Olora: Orthonormal low-rank adaptation of large language models, 2024.
[5] Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models, 2024.
[6] Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models, 2024.
[7] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[8] LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023.
[9] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
[10] XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023.
[11] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
[12] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[14] Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum, Jipeng Zhang, Wei Xiong, and Tong Zhang. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
[15] Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024.
[16] Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization, 2024.
[17] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024.
[18] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
[19] Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform, 2024.
[20] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
[21] Ofir Gordon, Hai Victor Habi, and Arnon Netzer. Eptq: Enhanced post-training quantization via label-free hessian, 2023.
[22] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024.
[23] Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models, 2024.
[24] Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model, 2024.
[25] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
[26] Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yiliang Lv, Yujun Shen, Deli Zhao, and Jingren Zhou. Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone, 2023.
[27] Zeyinzi Jiang, Chaojie Mao, Yulin Pan, Zhen Han, and Jingfeng Zhang. Scedit: Efficient and controllable image diffusion generation via skip connection editing, 2023.
[28] Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023.
[29] Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024.
[30] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[31] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
[32] Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi, Ji Zhang, Fei Huang, and Jingren Zhou. Modelscope-agent: Building your customizable agent system with open-source large language models, 2023.
[33] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation, 2021.
[34] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024.
[35] Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, et al. M6: A chinese multimodal pretrainer. arXiv preprint arXiv:2103.00823, 2021.
[36] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
[37] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation, 2024.
[38] Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bernhard Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization, 2024.
[39] Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. CoRR, abs/2110.07602, 2021.
[40] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv:2103.10385, 2021.
[41] Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models, 2024.
[42] Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024.
[43] Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning, 2024.
[44] William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023.
[45] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
[46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[47] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
[48] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024.
[49] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[51] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR, 17–23 Jul 2022.
[52] Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, and Ping Luo. Llama pro: Progressive llama with block expansion, 2024.
[53] Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024.
[54] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.
[55] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning, 2023.
[56] Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang. Q-galore: Quantized galore with int4 projection and layer-adaptive low-rank gradients, 2024.
[57] Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection, 2024.
[58] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
[59] Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024.

Appendix A Supported Models and Datasets

Supported Models	Modal	Structure
LLaMA Series	NLP	Decoder-only
Mistral/Mixtral Series	NLP	Decoder-only
Gemma Series	NLP	Decoder-only
Phi Series	NLP	Decoder-only
Qwen1/1.5/2 Series	NLP	Decoder-only
YI Series	NLP	Decoder-only
ChatGLM1/2/3 Series	NLP	Decoder-only
DeepSeek1/2 Series	NLP	Decoder-only
InternLM1/2 Series	NLP	Decoder-only
Mamba	NLP	SSM
PaliGemma Series	Visual	Decoder-only
Qwen-VL Series	Visual	Decoder-only
GLM4v	Visual	Decoder-only
DeepSeek-VL Series	Visual	Decoder-only
LLaVA Series	Visual	Decoder-only
InternVL1/2 Series	Visual	Decoder-only
Phi3-vision	Visual	Decoder-only
Yi-VL Series	Visual	Decoder-only
MiniCPM Series	Visual	Decoder-only
lorence Series	Visual	Encoder-Decoder
Qwen-Audio Series	Audio	Decoder-only

Table 12. Part of models SWIFT supported

Supported Datasets	Modal	Task	Language
alpaca-en	NLP	QA	English
synthetic-text-to-sql	NLP	Text to Sql	English
firefly-train-1.1M	NLP	QA	Chinese
deepctrl-sft	NLP	QA	Chinese
ruozhiba	NLP	QA	Chinese
ms-agent	NLP	Agent	Chinese
ms-agent-pro	NLP	Agent	English
chinese-c4	NLP	Pretrain	Chinese
fineweb	NLP	Pretrain	Chinese
okvqa/a-okvqa	Vision	VQA	English
chart-qa	Vision	VQA	English
ocr-vqa	Vision	OCR	English
llava-pretrain	Vision	VQA	English
llava-instruct-150k	Vision	VQA	English
mantis-instruct	Vision	VQA	English
grit	Vision	VQA	English
science-qa	Vision	VQA	English
refcoco/refcocog	Vision	Grounding	English
rlaif-v	Vision	RLHF	English
aishell1-zh-mini	Audio	Audio QA	English

Table 13. Part of datasets SWIFT supported

Appendix B Loss-scale settings

Pattern	Value
The response of tool selection query	3.0
The response of param recalling query	3.0
The response of param name query	3.0
The response of param value query	3.0
The content of ’Name:’	3.0
The content of ’Action:’	3.0
The content of ’Action Input:’	3.0
The content of ’Tool:’	3.0
The content of ’Command’	3.0
The content of ’Arguments:’	3.0
’Observation:’	2.0

Table 14. The weight of content in Agent training for loss-scale tesing

Appendix C SWIFT commands

Listing 1: The training and inference code of tuners

⬇

1# Prepare a tuner

2model = Swift.prepare_model(model, {’lora’: LoRAConfig(),

3 ’llamapro’: LLaMAProConfig()})

4# Load checkpoint

5model = Swift.from_pretrained(model, ’some-training-ckpt-dir’)

7# Simple code for training

8model = Model.from_pretrained(’qwen/Qwen2-7B-Instruct’…)

9model = Swift.prepare_model(model,

10 {’first_tuner’: LoraConfig(…),

11 ’second_tuner’: LLaMAProConfig(…))

12train_data = MsDataset.load(’<dataset-id>’, split=’train’)

13eval_data = MsDataset.load(’<dataset-id>’, split=’eval’)

15trainer = Seq2SeqTrainer(

16 model=model,

17 args=Seq2SeqTrainingArguments(learning_rate=1e-4…),

18 train_dataset=train_data, eval_dataset=eval_data)

20trainer.train()

Listing 2: The standard prompts of SWIFT

⬇

1# QA

2{”query”: ”Calculate␣22+45”, ”response”: ”The␣answer␣is␣67.”}

3# QA with history and tools

4{”system”: ”You␣are␣a␣good␣math␣teacher.”,

5”query”: ”Calculate␣22+45”,

6”response”: ”The␣answer␣is␣67.”,

7”history”: [[”Can␣you␣calculate␣math?”,

8”Yes,␣I␣can␣do␣math␣calculation.”]],

9”tools”: [{”type”: ”function”, ”function”:

10{”name”: ”get_current_weather”, …]}, …}

11# RLHF

12{”query”: ”Calculate␣22+45”, ”response”: ”The␣answer␣is␣67.”,

13”rejected_response”: ”I␣cannot␣calculate␣math.”}

14# VQA

15{”query”: ”<image>What␣is␣in␣the␣image?”,

16”response”: ”The␣image␣shows␣a␣little␣girl␣walking.”,

17”images”: [”/coco2017/train/10045.jpg”]}

18# Multi-Modal RLHF

19{”query”: ”<image>What␣is␣in␣the␣image?”,

20”response”: ”The␣image␣shows␣a␣little␣girl␣walking.”,

21”rejected_response”: ”I␣cannot␣see␣any␣image.”,

22”images”: [”/coco2017/train/10045.jpg”]}

23# Grounding

24{”query”: ”<image>Where␣is␣<ref-object>?”,

25”response”: ”The␣position␣is␣<bbox>”,

26”images”: [”/coco2017/train/10045.jpg”],

27”objects”: ”[{\”caption\”:␣\”guy␣in␣red\”,

28\”bbox\”:␣[138,␣136,␣235,␣359],

29\”bbox_type\”:␣\”real\”,␣\”image\”:␣0}}”

Listing 3: The SWIFT command lines

⬇

1# Multi-GPU sft command

2CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \

3NPROC_PER_NODE=8 \

4swift sft \

5 –model_type qwen1half-32b-chat \

6 –dataset blossom-math-zh \

7 –deepspeed default-zero3

8# Single GPU RLHF command

9swift rlhf

10 –model_type llama3-8b-instruct \

11 –rlhf_type dpo \

12 –dataset hh-rlhf

13# Inference a multi-modal model

14swift infer

15 –model_type internvl2-8b \

16 –infer_backend lmdeploy

17# Deploy a checkpoint by vllm

18swift deploy

19 –ckpt_dir /mnt/my-custom/ckpt-1100 \

20 –infer_backend vllm

21# Evaluate an nlp model

22swift eval

23 –model_type llama3-8b-instruct \

24 –eval_dataset ceval gsm8k

25# Evaluate a multi-modal model

26swift eval

27 –model_type internvl2-8b \

28 –eval_dataset COCO_VAL

29# Evaluate an OpenAI url

30swift eval

31 –eval_url https://127.0.0.1/8000 \

32 –eval_dataset mmlu

33# Evaluate use a custom dataset

34swift eval

35 –model_type llama3-8b-instruct \

36 –custom_eval_config /mnt/my-dataset.json

37# Merge LoRA

38swift export –ckpt_dir /mnt/my-custom/ckpt-1100 –merge_lora true

39# Quantize

40swift export –ckpt_dir /mnt/my-custom/ckpt-1100 –quant_method awq

41# Export Ollama

42swift export –ckpt_dir /mnt/my-custom/ckpt-1100 –to_ollama true

43# To Megatron

44swift export –model_type qwen2-7b-instruct –to_megatron true