Optimizing Inference with Policy Search

Introduction

For specialized cases, like serving MoE models (e.g., Mixtral) or quantized models, performance of inference can be further optimized through a execution policy search. This process can be skipped, but it is necessary to get the optimized speed of Friendli Engine. When Friendli Engine runs with the optimal policy, the performance can increase by from 1.5x to 2x (i.e., throughput and latency). Therefore, we recommend skipping policy search for simple model testing, and performing policy search for cost analysis or latency analysis in production service.

Policy search is effective only when serving (1) MoE models (2) AWQ, FP8 or INT8 quantized models. Otherwise, it is useless.

Running Policy Search

You can run policy search by adding the following options to the launch command of Friendli Container.

Options	Type	Summary	Default
`--algo-policy-dir`	TEXT	Path to the directory to save the searched optimal policy file. The default value is the current working directory.	current working dir
`--search-policy`	BOOLEAN	Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc.	false
`--terminate-after-search`	BOOLEAN	Terminates engine container after policy search.	false

Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

For example, you can start the policy search for FriendliAI/Llama-3.1-8B-Instruct-fp8 model as follows:

export HF_MODEL_NAME="FriendliAI/Llama-3.1-8B-Instruct-fp8"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0"'
export POLICY_DIR=$PWD/policy

mkdir -p $POLICY_DIR

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    --algo-policy-dir /policy \
    --search-policy true

Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

export HF_MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0,1,2,3"'
export POLICY_DIR=$PWD/policy

mkdir -p $POLICY_DIR

docker run -p 8000:8000 \
  --ipc=host --gpus $GPU_ENUMERATION \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name $HF_MODEL_NAME \
    --num-devices 4 \
    --algo-policy-dir /policy \
    --search-policy true

Once the policy search is complete, a policy file will be created in $POLICY_DIR. If the policy file already exists, the engine will search only the necessary spaces and update the policy file accordingly. After the policy search, engine starts to serve endpoint with using the policy file.

It takes up to several minutes to find the optimal policy for Llama 2 13B model with NVIDIA A100 80GB GPU. The estimated time and remaining time will be displayed in the stderr when you run the policy search.

Running Policy Search Without Starting Serving Endpoint

To search for the best policy without starting the serving endpoint, launch the engine with the Friendli Container command and include the --terminate-after-search true option.

Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

docker run \
  --gpus $GPU_ENUMERATION \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 \
    --algo-policy-dir /policy
    --search-policy true
    --terminate-after-search true

Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

docker run -p 8000:8000 \
  --ipc=host --gpus $GPU_ENUMERATION \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $POLICY_DIR:/policy \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  $FRIENDLI_CONTAINER_IMAGE \
    --hf-model-name mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --num-devices 4 \
    --algo-policy-dir /policy \
    --search-policy true
    --terminate-after-search true

FAQ: When to Run Policy Search Again?

The execution policy depends on the following factors:

Model
GPU
GPU count and parallelism degree (The value for --num-devices and --num-workers options)
NVIDIA Driver major version
Friendli Container version

You should run policy search again when any of these are changed from your serving setup.

Get Started

Core Concepts

Administration

Products

Optimizing Inference with Policy Search

Introduction

Running Policy Search

Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

Running Policy Search Without Starting Serving Endpoint

Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

FAQ: When to Run Policy Search Again?

Get Started

Core Concepts

Administration

Products

​Introduction

​Running Policy Search

​Example: FriendliAI/Llama-3.1-8B-Instruct-fp8

​Example: mistralai/Mixtral-8x7B-Instruct-v0.1 (TP=4)

​Running Policy Search Without Starting Serving Endpoint

​Example: FriendliAI/Llama-3.1-8B-Instruct-fp8

​Example: mistralai/Mixtral-8x7B-Instruct-v0.1 (TP=4)

​FAQ: When to Run Policy Search Again?

Introduction

Running Policy Search

Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

Running Policy Search Without Starting Serving Endpoint

Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`

Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)

FAQ: When to Run Policy Search Again?