Faizan Khan

@faizan10114

Published on Jan 5, 2025

Deploying Llama3 on Azure

Faizan Khan

@faizan10114

Published on Jan 5, 2025

This guide walks you through deploying Llama 3 on Azure Machine Learning using Python. We'll cover environment setup, deployment, and monitoring.

Prerequisites

  • Azure subscription

  • Access to Llama 3 (request from Meta)

  • Python 3.11+

Compute Requirements

Llama 3 variants and their minimum requirements:

  • Llama 3.1 8B: 16GB GPU RAM


Recommended Azure VM sizes:

  • 8B: Standard_NC6s_v3 (1x V100 16GB)

Step 1: Environment Setup

1.1 Install Required Packages

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install requirements
pip install azure-ai-ml azure-identity transformers torch azureml-core


# Create Resource Group
az group create --name "llama-rg" --location "eastus"

# Create Azure ML Workspace
az ml workspace create \
    --name "llama-workspace" \
    --resource-group "llama-rg" \
    --location "eastus"


# Register all the required providers: DO NOT SKIP THIS
az provider register --namespace Microsoft.MachineLearningServices
az provider register --namespace Microsoft.ContainerRegistry
az provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.ContainerService
az provider register --namespace Microsoft.PolicyInsights
az provider register --namespace Microsoft.Cdn

1.2 Request Quota Increase

  1. Visit Azure Portal: https://portal.azure.com

  2. Go to Subscriptions → Your Subscription → Usage + quotas

  3. Select "Machine Learning" service

  4. Request quota increase for:

    • NC-series vCPUs for V100s

    • Or NDAs-series vCPUs for A100s


Step 2: Deployment Code

Create a file deploy_llama_azure.py:

pythonCopyfrom azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)
from azure.identity import DefaultAzureCredential
import json

def create_azure_endpoint(subscription_id, resource_group, workspace_name, model_name):

    credential = DefaultAzureCredential()

    ml_client = MLClient(
        credential = credential,
        subscription_id = subscription_id,
        resource_group_name = resource_group,
        workspace_name = workspace_name
    )


    # Create the Resource Management Client
    resource_client = ResourceManagementClient(
        credential = DefaultAzureCredential(),
        subscription_id = subscription_id
    )

    try:
        workspace = ml_client.workspaces.get(name=workspace_name)
        print(f"Found workspace: {workspace.name}")
    except Exception as e:
        print(f"Error accessing workspace: {str(e)}")


    # Create a unique endpoint name
    registry_name = "HuggingFace"
    model_id = f"azureml://registries/{registry_name}/models/{model_name}/labels/latest"

    import time
    endpoint_name="hf-ep-" + str(int(time.time())) # endpoint name must be unique per Azure region, hence appending timestamp 

    # wait for the endpoint to be up
    ml_client.begin_create_or_update(ManagedOnlineEndpoint(name=endpoint_name) ).wait()


    # Create environment for the deployment
    environment = Environment(
        name="bert-env",
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
        conda_file={
            "channels": ["conda-forge", "pytorch"],
            "dependencies": [
                "python=3.11",
                "pip",
                "pytorch",
                "transformers",
                "numpy"
            ]
        }
    )

    from azure.ai.ml.constants import AssetTypes

    model = Model(
        path=f"hf://{model_id}",
        type=AssetTypes.CUSTOM_MODEL,
        name="hf-model",
        description="HuggingFace model from Model Hub"
    )

    print('endpoint created', endpoint_name)
    ml_client.online_deployments.begin_create_or_update(ManagedOnlineDeployment(
        name="demo",
        endpoint_name=endpoint_name,
        model=model_id,
        environment=environment,
        instance_type="Standard_DS3_v2",
        instance_count=1,
    )).wait()


    endpoint = ml_client.online_endpoints.get(endpoint_name)
    endpoint.traffic = {"demo": 100}
    ml_client.begin_create_or_update(endpoint).result()

    print(endpoint.scoring_uri)

def main():
    # Azure ML workspace details
    subscription_id = "your-subscription-id"
    resource_group = "your-resource-group"
    workspace_name = "your-workspace-name"

    config = {
        "model_id":"meta-llama-meta-llama-3-8b-instruct",
        "machine_type": "a2-highgpu-4g"
    }
    
    create_azure_endpoint(subscription_id, resource_group, workspace_name, config['model_id'])

    print(f"Endpoint URL: {endpoint.scoring_uri}")

if __name__ == "__main__":
    main()

Step 3: Deploy the Model

  1. Run deployment:

python deploy_llama_azure.py

Step 4: Querying the Deployment

Create query_endpoint.py:

import json
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

def query_azure_endpoint(subscription_id, resoruce_group, workspace_name, endpoint_name, query):
    credential = DefaultAzureCredential()
    ml_client = MLClient(
        credential=credential,
        subscription_id=subscription_id,
        resource_group_name=resource_group,
        workspace_name=workspace_name
    )

    # Test data
    test_data = {
        "inputs": query
    }

    # Save the test data to a temporary file
    with open("test_request.json", "w") as f:
        json.dump(test_data, f)

    # Get prediction
    response = ml_client.online_endpoints.invoke(
        endpoint_name=endpoint_name,
        request_file = 'test_request.json'
    )

    print('Raw Response Content:', response)
    # delete a file
    os.remove("test_request.json")
    return response

if __name__=="__main__":
    # Azure ML workspace details
    subscription_id = "your-subscription-id"
    resource_group = "your-resource-group"
    workspace_name = "your-workspace-name"
    endpoint_name = 'my-unique-endpoint'

    query="whats the meaning of life?"
    print(query_endpoint(subscription_id, resource_group, workspace_name, endpoint_name, query))

Execute query_endpoint.py:


Cost Management

Estimated costs (East US region):

  • Standard_NC6s_v3: ~$0.90/hour

  • Standard_NC12s_v3: ~$1.80/hour

  • Standard_NC96ads_A100_v4: ~$32.77/hour

Monitoring

Using Azure Portal

  1. Go to Azure ML Studio

  2. Select Endpoints

  3. Click on your endpoint

  4. View metrics:

    • Request latency

    • CPU/Memory usage

    • GPU utilization

    • Success rate

Using Python

from azure.ai.ml import MLClient
from azure.monitor.query import MetricsQueryClient

# Get metrics
metrics_client = MetricsQueryClient(DefaultAzureCredential())
response = metrics_client.query_resource(
    resource_uri=endpoint.id,
    metric_names=["RequestLatency", "GPUUtilization"],
    timespan={"duration": "PT1H"}
)

# Print metrics
for metric in response.metrics:
    print(f"{metric.name}:")
    for timeseries in metric.timeseries:
        for data in timeseries.data:
            print(f"  {data.timestamp}: {data.average}")

Troubleshooting

Common issues and solutions:

  1. Quota Limits

    • Check current quota in Azure Portal

    • Request increase if needed

    • Consider different regions

  2. Deployment Failures

    • Check activity log in Azure Portal

    • Verify VM size availability

    • Check model compatibility

  3. Performance Issues

    • Monitor GPU utilization

    • Adjust batch size

    • Check for memory leaks

Conclusion

You now have a Llama 3.1 model running on Azure ML! Remember to:

  • Monitor costs

  • Update model versions

  • Implement security best practices

For more information:

Try out our dashboard

Try out our dashboard

Deploy any model In Your Private Cloud or SlashML Cloud

READ OTHER POSTS

©2024 – Made with ❤️ & ☕️ in Montreal

©2024 – Made with ❤️ & ☕️ in Montreal

©2024 – Made with ❤️ & ☕️ in Montreal