Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container.
FRIENDLI_TOKEN
.FRIENDLI_CONTAINER_SECRET
.Log in to the Friendli registry
Pull image
safetensors
format, which is compatible with Hugging Face transformers, you can serve the model directly with Friendli Container.
Friendli Container supports direct loading of safetensors
checkpoints for many model types. You can find the complete list of supported models on the Supported Models page.
If your model does not exist in supported model list, please contact us.
Here are the instructions to run Friendli Container to serve a Hugging Face model:
[LAUNCH_OPTIONS]
should be replaced with Launch Options for Friendli Container.
By running the above command, you will have a running Docker container that exports an HTTP endpoint for handling inference requests.
$GPU_ENUMERATION
(e.g., ‘“device=0,1,2,3”’).--num-devices
(or -d
) option to specify the tensor parallelism degree (e.g., --num-devices 4
).$GPU_ENUMERATION
(e.g., ‘“device=0,1,2,3”’).--num-workers
(or -n
) option to specify the pipeline parallelism degree (e.g., --num-workers 4
).meta-llama/Llama-3.1-8B-Instruct
is allowed only for authorized users, you need to provide your Hugging Face User Access Token through HF_TOKEN
environment variable.
It works the same for all private repositories.Options | Type | Summary | Default | Required |
---|---|---|---|---|
--version | - | Print Friendli Container version. | - | ❌ |
--help | - | Print Friendli Container help message. | - | ❌ |
Options | Type | Summary | Default | Required |
---|---|---|---|---|
--web-server-port | INT | Web server port. | 8000 | ❌ |
--metrics-port | INT | Prometheus metrics export port. | 8281 | ❌ |
--hf-model-name | TEXT | Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at ~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before creating the inference endpoint. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format. | - | ❌ |
--tokenizer-file-path | TEXT | Absolute path of tokenizer file. This option is not needed when tokenizer.json is located under the path specified at --ckpt-path . | - | ❌ |
--tokenizer-add-special-tokens | BOOLEAN | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer’s add_special_tokens argument. The default value is false for versions < v1.6.0. | true | ❌ |
--tokenizer-skip-special-tokens | BOOLEAN | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer’s skip_special_tokens argument. | true | ❌ |
--dtype | CHOICE: [bf16, fp16, fp32] | Data type of weights and activations. Choose one of <fp16|bf16|fp32>. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of torch_dtype in config.json file or assumes fp16. | fp16 | ❌ |
--bad-stop-file-path | TEXT | JSON file path that contains stop sequences or bad words/tokens. | - | ❌ |
--num-request-threads | INT | Thread pool size for handling HTTP requests. | 4 | ❌ |
--timeout-microseconds | INT | Server-side timeout for client requests, in microseconds. | 0 (no timeout) | ❌ |
--ignore-nan-error | BOOLEAN | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request. | - | ❌ |
--max-batch-size | INT | Max number of sequences that can be processed in a batch. | 384 | ❌ |
--num-devices , -d | INT | Number of devices to use in tensor parallelism degree. | 1 | ❌ |
--num-workers , -n | INT | Number of workers to use in a pipeline (i.e., pipeline parallelism degree). | 1 | ❌ |
--search-policy | BOOLEAN | Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at Optimizing Inference with Policy Search. | false | ❌ |
--terminate-after-search | BOOLEAN | Terminates engine container after the policy search. | false | ❌ |
--algo-policy-dir | TEXT | Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at Optimizing Inference with Policy Search. | current working dir | ❌ |
--adapter-model | TEXT | Add an adapter model with adapter name and path; <adapter_name>:<adapter_ckpt_path>. The path can be a name from a Hugging Face model hub. | - | ❌ |
Options | Type | Summary | Default | Required |
---|---|---|---|---|
--max-input-length | INT | Maximum input length. | - | ✅ |
--max-output-length | INT | Maximum output length. | - | ✅ |