For specialized cases like MoE or quantized models, optimizing the execution policy in Friendli Engine can boost inference performance by 1.5x to 2x, improving throughput and reducing latency.
Options | Type | Summary | Default |
---|---|---|---|
--algo-policy-dir | TEXT | Path to the directory to save the searched optimal policy file. The default value is the current working directory. | current working dir |
--search-policy | BOOLEAN | Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc. | false |
--terminate-after-search | BOOLEAN | Terminates engine container after policy search. | false |
FriendliAI/Llama-3.1-8B-Instruct-fp8
mistralai/Mixtral-8x7B-Instruct-v0.1
(TP=4)$POLICY_DIR
.
If the policy file already exists, the engine will search only the necessary spaces and update the policy file accordingly.
After the policy search, engine starts to serve endpoint with using the policy file.
--terminate-after-search true
option.
FriendliAI/Llama-3.1-8B-Instruct-fp8
mistralai/Mixtral-8x7B-Instruct-v0.1
(TP=4)--num-devices
and --num-workers
options)