The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining.
adapter_model.safetensors
checkpoint file, you have to manually convert adapter_model.bin
into adapter_model.safetensors
.
You can use the official app or the python script for conversion.--adapter-model
argument.
--adapter-model
: Add an adapter model with adapter name and path. The path can be Hugging Face hub’s name.[LAUNCH_OPTIONS]
at Running Friendli Container: Launch Options.
--adapter-model
with comma-separated string.(e.g. --adapter-model "adapter_name_0:/adapter/model1,adapter_name_1:/adapter/model2"
)tokenizer_config.json
file is in an adapter checkpoint path, the engine uses a different chat template in tokenizer_config.json
.meta-llama/Llama-2-7b-chat-hf
with FinGPT/fingpt-forecaster_dow30_llama2-7b_lora
adapter model.
model
in the body of an inference request.
For example, assuming you set the launch option of --adpater-model
to “<adapter-model-name>:<adapter-file-path>”, you can send a request to the adapter model as follows.
model
field in your request, the base model will be used for generating an inference request.
You can send a request to the base model as shown below.
peft
.Base model checkpoint and adapter model checkpoint should have the same datatype.When serving multiple adapters simultaneously, each adapter model should have the same target modules. In Hugging Face, the target modules are listed at adapter_config.json
.