Deploying Llama3.1 on Amazon Sagemaker

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Build full stack Apps from prompts

Deploying Llama3.1 on Amazon Sagemaker

Faizan Khan

@faizan10114

Published on Jan 3, 2025

Deploying Llama3.1 on Amazon Sagemaker

Faizan Khan

@faizan10114

Published on Jan 3, 2025

In this guide, we'll walk through the process of deploying Meta's Llama 3 model on Amazon SageMaker. We'll cover everything from setting up your AWS environment to deploying and testing the model.

Prerequisites

Before we begin, make sure you have:

An AWS account
Python 3.11 or later installed
Basic familiarity with Python and AWS concepts

Step 1: Setting Up Your AWS Environment

1.1 Create an IAM User

First, you'll need an IAM user with appropriate permissions:

Go to AWS IAM Console (https://console.aws.amazon.com/iam/)
Click "Users" → "Add user"
Set a username and enable "Access key - Programmatic access"
Attach the following policies:
- AmazonSageMakerFullAccess
- AmazonS3FullAccess

Remember to save your access key ID and secret access key securely.

1.2 Install Required Python Packages

Create a new Python virtual environment and install the necessary packages:

1.3 Configure AWS Credentials

Set up your AWS credentials using one of these methods:

Step 2: Preparing the Deployment Code

Create a new Python script named deploy_llama.py:

pythonCopyimport boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
import json

def create_llama_model(
    role_arn,
    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
    instance_type="ml.g5.2xlarge",
):
    # Initialize SageMaker session
    session = sagemaker.Session()
    
    # Define model configuration
    huggingface_model = HuggingFaceModel(
        model_id=model_id,
        role=role_arn,
        transformers_version="4.28",
        pytorch_version="2.0",
        py_version="py310",
    )
    
    # Deploy the model
    predictor = huggingface_model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name="llama3-1-endpoint"
    )
    
    return predictor

def get_sagemaker_role():
    """Get or create SageMaker execution role"""
    iam = boto3.client('iam')
    
    # Try to get existing SageMaker role
    try:
        role = iam.get_role(RoleName='SageMakerExecutionRole')
        return role['Role']['Arn']
    except iam.exceptions.NoSuchEntityException:
        # Create new role if it doesn't exist
        role_policy = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "sagemaker.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
        
        iam.create_role(
            RoleName='SageMakerExecutionRole',
            AssumeRolePolicyDocument=json.dumps(role_policy)
        )
        
        # Attach necessary policies
        policies = [
            'arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
            'arn:aws:iam::aws:policy/AmazonS3FullAccess'
        ]
        
        for policy in policies:
            iam.attach_role_policy(
                RoleName='SageMakerExecutionRole',
                PolicyArn=policy
            )
        
        role = iam.get_role(RoleName='SageMakerExecutionRole')
        return role['Role']['Arn']

if __name__ == "__main__

Step 3: Deploying the Model

Make sure you have access to Llama 3.1. You'll need to:
- Accept the model on Hugging Face: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Run the deployment script:

Step 4: Querying the Model

Create a script named query_llama.py:

import boto3
import json

def query_endpoint(text_input, endpoint_name="llama2-endpoint"):
    # Create a SageMaker runtime client
    client = boto3.client('sagemaker-runtime')
    
    # Prepare the input
    payload = {
        "inputs": text_input,
        "parameters": {
            "max_length": 100,
            "temperature": 0.7,
            "top_p": 0.9,
        }
    }
    
    # Query the endpoint
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    # Parse and return the response
    result = json.loads(response['Body'].read().decode())
    return result

if __name__ == "__main__

Important Considerations

Costs: Keep in mind that running a SageMaker endpoint incurs costs. The ml.g5.2xlarge instance type costs approximately $1.52 per hour (US East region).
Model Size: This guide uses the 7B parameter version of Llama 2. For larger versions (13B, 70B), you'll need more powerful instance types.
Instance Types:
- 7B model: ml.g5.2xlarge
- 13B model: ml.g5.4xlarge
- 70B model: ml.g5.12xlarge or larger
Cleanup: To avoid unnecessary charges, delete endpoints when not in use:

Troubleshooting

Common issues and solutions:

Model Access Error: Make sure you've been granted access to Llama 3.1 and accepted the model on Hugging Face.
Instance Limit Error: You might need to request a service quota increase for your chosen instance type.
Memory Issues: If you see out-of-memory errors, try a larger instance type or reduce the batch size in your requests.

Conclusion

You now have a working Llama 3 deployment on Amazon SageMaker! Remember to monitor your costs and delete unused endpoints. For production deployments, consider adding error handling, monitoring, and auto-scaling configurations.

For more information, refer to:

SageMaker Documentation

If you want to chat with Any github Codebase, please visit CodesALot

If you want to chat with data and generate visualizations, please visit SirPlotsAlot

If you are struggling to integrate AI into your apps

Build AI enbaled apps from a single prompt

Build full stack Apps from prompts

READ OTHER POSTS