Deploying Llama3 on Azure

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Build full stack Apps from prompts

Deploying Llama3 on Azure

Faizan Khan

@faizan10114

Published on Jan 5, 2025

Deploying Llama3 on Azure

Faizan Khan

@faizan10114

Published on Jan 5, 2025

This guide walks you through deploying Llama 3 on Azure Machine Learning using Python. We'll cover environment setup, deployment, and monitoring.

Prerequisites

Azure subscription
Access to Llama 3 (request from Meta)
Python 3.11+

Compute Requirements

Llama 3 variants and their minimum requirements:

Llama 3.1 8B: 16GB GPU RAM

Recommended Azure VM sizes:

8B: Standard_NC6s_v3 (1x V100 16GB)

Step 1: Environment Setup

1.1 Install Required Packages

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install requirements
pip install azure-ai-ml azure-identity transformers torch azureml-core


# Create Resource Group
az group create --name "llama-rg" --location "eastus"

# Create Azure ML Workspace
az ml workspace create \
    --name "llama-workspace" \
    --resource-group "llama-rg" \
    --location "eastus"


# Register all the required providers: DO NOT SKIP THIS
az provider register --namespace Microsoft.MachineLearningServices
az provider register --namespace Microsoft.ContainerRegistry
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.PolicyInsights
az provider register --namespace Microsoft.Cdn

1.2 Request Quota Increase

Visit Azure Portal: https://portal.azure.com
Go to Subscriptions → Your Subscription → Usage + quotas
Select "Machine Learning" service
Request quota increase for:
- NC-series vCPUs for V100s
- Or NDAs-series vCPUs for A100s

Step 2: Deployment Code

Create a file deploy_llama_azure.py:

pythonCopyfrom azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)
from azure.identity import DefaultAzureCredential
import json

def create_azure_endpoint(subscription_id, resource_group, workspace_name, model_name):

    credential = DefaultAzureCredential()

    ml_client = MLClient(
        credential = credential,
        subscription_id = subscription_id,
        resource_group_name = resource_group,
        workspace_name = workspace_name
    )


    # Create the Resource Management Client
    resource_client = ResourceManagementClient(
        credential = DefaultAzureCredential(),
        subscription_id = subscription_id
    )

    try:
        workspace = ml_client.workspaces.get(name=workspace_name)
        print(f"Found workspace: {workspace.name}")
    except Exception as e:
        print(f"Error accessing workspace: {str(e)}")


    # Create a unique endpoint name
    registry_name = "HuggingFace"
    model_id = f"azureml://registries/{registry_name}/models/{model_name}/labels/latest"

    import time
    endpoint_name="hf-ep-" + str(int(time.time())) # endpoint name must be unique per Azure region, hence appending timestamp 

    # wait for the endpoint to be up
    ml_client.begin_create_or_update(ManagedOnlineEndpoint(name=endpoint_name) ).wait()


    # Create environment for the deployment
    environment = Environment(
        name="bert-env",
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
        conda_file={
            "channels": ["conda-forge", "pytorch"],
            "dependencies": [
                "python=3.11",
                "pip",
                "pytorch",
                "transformers",
                "numpy"
            ]
        }
    )

    from azure.ai.ml.constants import AssetTypes

    model = Model(
        path=f"hf://{model_id}",
        type=AssetTypes.CUSTOM_MODEL,
        name="hf-model",
        description="HuggingFace model from Model Hub"
    )

    print('endpoint created', endpoint_name)
    ml_client.online_deployments.begin_create_or_update(ManagedOnlineDeployment(
        name="demo",
        endpoint_name=endpoint_name,
        model=model_id,
        environment=environment,
        instance_type="Standard_DS3_v2",
        instance_count=1,
    )).wait()


    endpoint = ml_client.online_endpoints.get(endpoint_name)
    endpoint.traffic = {"demo": 100}
    ml_client.begin_create_or_update(endpoint).result()

    print(endpoint.scoring_uri)

def main():
    # Azure ML workspace details
    subscription_id = "your-subscription-id"
    resource_group = "your-resource-group"
    workspace_name = "your-workspace-name"

    config = {
        "model_id":"meta-llama-meta-llama-3-8b-instruct",
        "machine_type": "a2-highgpu-4g"
    }
    
    create_azure_endpoint(subscription_id, resource_group, workspace_name, config['model_id'])

    print(f"Endpoint URL: {endpoint.scoring_uri}")

if __name__ == "__main__":
    main()

Step 3: Deploy the Model

Run deployment:

python deploy_llama_azure.py

Step 4: Querying the Deployment

Create query_endpoint.py:

import json
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

def query_azure_endpoint(subscription_id, resoruce_group, workspace_name, endpoint_name, query):
    credential = DefaultAzureCredential()
    ml_client = MLClient(
        credential=credential,
        subscription_id=subscription_id,
        resource_group_name=resource_group,
        workspace_name=workspace_name
    )

    # Test data
    test_data = {
        "inputs": query
    }

    # Save the test data to a temporary file
    with open("test_request.json", "w") as f:
        json.dump(test_data, f)

    # Get prediction
    response = ml_client.online_endpoints.invoke(
        endpoint_name=endpoint_name,
        request_file = 'test_request.json'
    )

    print('Raw Response Content:', response)
    # delete a file
    os.remove("test_request.json")
    return response

if __name__=="__main__":
    # Azure ML workspace details
    subscription_id = "your-subscription-id"
    resource_group = "your-resource-group"
    workspace_name = "your-workspace-name"
    endpoint_name = 'my-unique-endpoint'

    query="whats the meaning of life?"
    print(query_endpoint(subscription_id, resource_group, workspace_name, endpoint_name, query))

Execute query_endpoint.py:

Cost Management

Estimated costs (East US region):

Standard_NC6s_v3: ~$0.90/hour
Standard_NC12s_v3: ~$1.80/hour
Standard_NC96ads_A100_v4: ~$32.77/hour

Monitoring

Using Azure Portal

Go to Azure ML Studio
Select Endpoints
Click on your endpoint
View metrics:
- Request latency
- CPU/Memory usage
- GPU utilization
- Success rate

Using Python

from azure.ai.ml import MLClient
from azure.monitor.query import MetricsQueryClient

# Get metrics
metrics_client = MetricsQueryClient(DefaultAzureCredential())
response = metrics_client.query_resource(
    resource_uri=endpoint.id,
    metric_names=["RequestLatency", "GPUUtilization"],
    timespan={"duration": "PT1H"}
)

# Print metrics
for metric in response.metrics:
    print(f"{metric.name}:")
    for timeseries in metric.timeseries:
        for data in timeseries.data:
            print(f"  {data.timestamp}: {data.average}")

Troubleshooting

Common issues and solutions:

Quota Limits
- Check current quota in Azure Portal
- Request increase if needed
- Consider different regions
Deployment Failures
- Check activity log in Azure Portal
- Verify VM size availability
- Check model compatibility
Performance Issues
- Monitor GPU utilization
- Adjust batch size
- Check for memory leaks

Conclusion

You now have a Llama 3.1 model running on Azure ML! Remember to:

Monitor costs
Update model versions
Implement security best practices

For more information:

If you want to chat with Any github Codebase, please visit CodesALot

If you want to chat with data and generate visualizations, please visit SirPlotsAlot

If you are struggling to integrate AI into your apps

Build AI enbaled apps from a single prompt

Build full stack Apps from prompts

READ OTHER POSTS