Deploying Llama-3-1 8B on AWS

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Build full stack Apps from prompts

Deploying Llama-3-1 8B on AWS

Faizan Khan

@faizan10114

Published on Jun 10, 2024

Deploying Llama-3-1 8B on AWS

Faizan Khan

@faizan10114

Published on Jun 10, 2024

In this article we will show you how to host a LLaMa 3 model on AWS. LLaMA 3 is a line of open-source models released by Meta. The latest version of which is LLaMA 3. Meta has plans to incorporate LLaMA 3 into most of its social media applications. Meta has released two versions of LLaMa 3, one with 8B parameters, and one with 70B parameters. [1] [2]

The 70B version of LLaMA 3 has been trained on a custom-built 24k GPU cluster on over 15T tokens of data, which is roughly 7x larger than that used for LLaMA 2. [2] [3]

Where to download it from

LLaMA 3 can be downloaded for free from Meta’s website and pulled in from hugging-face. It is offered in two variants: pre-trained, which is a basic model for next token prediction, and instruction-tuned, which is fine-tuned to adhere to user commands. It can be downloaded for free from Meta's website in two different parameter sizes: 8 billion (8B) and 70 billion (70B). Users can sign up to access these versions. [2]

You can also get LLaMA-3 from hugging-face, which is what we are going to do.

How good is it

According to famous leaderboards like LMSYS and hugging-face, the LLaMA-3 8B outperforms GPT-3.5-turbo and the LLaMA-3 80B outperforms GPT-4 base model as of today. Similarly among the open-source models the LLaMA3 8B outperforms most famous open-source models like Google's Gemma 7B and Mistral 7B Instruct.

The instance size required for the model

In order to deploy any model, we first need to determine the compute requirements. As a basic rule of thumb, the size of the parameter is the space it needs on the disk. However, in order to load it in memory, there is usually an overhead, so roughly a 1.5x-4x the size it takes on the disk. [4]

Since LLaMA3-8B has 8 billion params. It roughly requires 16GB storage, and around 24GB RAM. Now the RAM could be GPU or a CPU, with the GPU resulting in faster inference. We are going to use the GPU. Similarly, LLaMA3-70B requires around 140GB of storage and roughly 160GB of RAM.

Which instance to choose on AWS

Ok so we need roughly 24GB of RAM and 16 GB of disk space for LLaMA-3-8B. We want to go for instances that are optimized for compute with a single GPU, preferably one with the latest version. For the purpose of this tutorial, we will go with the g5.xlarge. Its has Nvidia A10 GPU, which gives better performance than the T-4 instances, and cheaper than the ones with the A100 GPUs. [5]

Cost and Pricing

All EC2 instances have on-demand pricing, unless they are reserved. This means that you are charged for the amount of time it's running. You will not be charged if you stop the instance. The price for a g5.xlarge in US-east-1 is roughly 1$/hr. This amounts to a total of 732$ per month, if its always on. You will save money if the server spins down after inactivity. You can attach certain triggers to handle that but it is beyond the scope of this article. [6] . So the total cost of running llama-3.1 8B is 732$ per month

Launching the Instance

Go to the launch instance portal in the relevant zone. In the application and OS images section type ami-041855, click enter. This is the id of the base image we will use to spin up our instance.

This will take you to the AMI search portal. Go to the Community AMIs tab , and select the Nvidia Deep Learning Base AMI.

Once you select the image, the view should look like the following.

In the instance type, panel, select g5.xlarge from the dropdown

Make sure you attach a key-pair to the instance.

Leave the default network settings, i.e. only allow ssh Traffic for now. We will modify this in a bit.

In the configure storage, attach an external volume. Attach an extra storage of 128GB, this is mostly if you want to persist some data between reboots. The default storage on the instance is ephemeral, which means that data can get lost if the instance is restarted.

The summary of the instance should look like the following, click on launch instance

Once the instance is running, modify the security group to allow incoming requests at port 8000. This is because our inference server will be running on port 8000.

Preparing the instance

We are going to use vLLM to serve the language model on our server [6]. vLLM is a python package designed for fast inference of open-source language models. It optimizes the inference by using multiple methods such as continuous batching the incoming queries. In a multiGPU instance this can result in faster inferences.

Once the instance is running. Establish an SSH connection to the instance from your terminal. You can do that by using the following command.

ssh -i path_to_your_key.pem ubuntu@instance_public_ip

Once inside the instance, create a virtual env, and install vllm with the following command. This will take a few minutes

 pip install vllm

In case you encounter compatibility problems while installing, it may be simpler for you to compile vLLM from the source or use their Docker image: have a look at the vLLM installation instructions.

Launch the Inference Server

Make sure you have logged into the hugging-face portal and applied for the license. You can do that by going to the relevant model page and applying there. In our case the model that we are interested in is meta-llama/Meta-Llama-3-8B. Once you are approved, make sure to run the following in the terminal, to configure your hugging-face API key.

Huggingface-cli login

Once the download is completed, you can simply run an OpenAI compatible inference server by running the following command. The name of the model comes from the relevant hugging-face name [7]. This takes a few mins to run the server fully

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct

Once the server is running, you will see something like the following

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

This means that our server is running on port 8000, and you can call it from your computer by running the following curl command. The vllm exposes an OpenAI compatible endpoint, and you can essentially use any OpenAI chat completion client to access this model.

curl http://server_public_ip_address:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "prompt": "What are the most popular quantization techniques for LLMs?"
}'

You now have a production ready inference server that can handle many parallel requests, thanks to vLLMs continuous batching. However, it can only serve so much, if the number of requests increases, then you might have to think about horizontal scaling i.e. spinning up other instances and load balancing the requests among them. Similarly, make sure you also think about attaching a spin-down policy to the instance, otherwise the monthly bill can get pretty expensive. If you don’t want to do all of that yourself, you can check out SlashML, which handles all of these things for you.

If you have questions about LLaMA 3 and AI deployment in general, please don't hesitate to ask us, it's always a pleasure to help!

Checkout our awesome apps ⬇️⬇️⬇️

If you want to chat with data and generate visualizations, please visit SirPlotsAlot

If you want to chat with Any github Codebase, please visit CodesALot

If you are struggling to integrate AI into your apps

Build AI enbaled apps from a single prompt

Build full stack Apps from prompts

READ OTHER POSTS