With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for extensive infrastructure expertise or significant capital expenditures.This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you’ll learn how to:
Select and deploy pre-trained LLMs from Hugging Face repositories
Deploy and manage your models using the Friendli Engine
Monitor and optimize your inference deployments
By the end of this tutorial, you’ll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let’s get started and explore the possibilities of Friendli Dedicated Endpoints!
Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard.
If not done already, start the free trial for Dedicated Endpoints.
Create a new project, then click on the ‘New Endpoint’ button.
Fill in the basic information:
Endpoint name: Choose a unique name for your endpoint (e.g., “My New Endpoint”).
Select the model:
Model Repository: Select “Hugging Face” as the model provider.
Model ID: Enter “meta-llama/Meta-Llama-3-8B-Instruct” as the model id. As the search bar loads the list, click on the top result that exactly matches the repository id.
By default, the model pulls the latest commit on the default branch of the model. You may manually select a specific branch / tag / commit instead.If you’re using your own model, check Format Requirements for requirements.
Select the instance:
Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models.
In some cases where the model’s size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM.
Edit the configurations:
Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it’s not being used, which reduces cost.
Advanced configuration: Some LLM options including the batch size and token configurations are mutable. For this tutorial, we’ll leave it as-is.