Faizan Khan

@faizan10114

Published on Jan 4, 2025

Deploying Llama3.1 on GCP Vertex AI

Faizan Khan

@faizan10114

Published on Jan 4, 2025

This guide walks you through deploying Llama 3` on Google Cloud's Vertex AI platform using Python from your terminal. We'll cover everything from environment setup to handling compute quotas.

Prerequisites

  • Google Cloud account

  • Python 3.10+

  • Access to Llama 3.1 (request from Meta)

Compute Requirements

Llama 3.1 has different variants with different resource needs:

  • Llama 3.1 8B: Minimum 16GB GPU RAM (A2 High-Memory)

Recommended instance types:

  • 8B: a2-highgpu-2g

Step 1: Environment Setup

1.1 Install Google Cloud SDK

Install the Google Cloud SDK and authenticate using:

1.2 Initialize Google Cloud


1.3 Set Up Python Environment


Step 2: Request Quota Increase

Before deploying, ensure you have sufficient quota for GPU instances:

  1. Visit the Google Cloud Console: https://console.cloud.google.com

  2. Go to IAM & Admin → Quotas

  3. Filter for "GPUs (all regions)"

  4. Select the quota for your region (e.g., "GPUs (us-east1)")

  5. Click "EDIT QUOTAS"

  6. Enter new quota limit:

    • For 7B model: Request at least 1 A2 GPU

    • For 13B model: Request at least 2 A2 GPUs

    • For 70B model: Request at least 4 A2 GPUs

  7. Fill in the request form:

Note: Quota approval can take 24-48 hours.

Step 3: Deployment Code

Create a new file deploy_llama.py:

from google.cloud import aiplatform
import os

def deploy_hf_model(
    project_id: str,
    location: str,
    model_id: str,
    machine_type: str =  "a2-highgpu-4g,
):
    """
    Deploy a Hugging Face model using pre-built container
    """
    # Initialize Vertex AI
    aiplatform.init(project=project_id, location=location)
    
    env_vars = {
        "MODEL_ID": model_id,
        "MAX_INPUT_LENGTH": "512",
        "MAX_TOTAL_TOKENS": "1024",
        "MAX_BATCH_PREFILL_TOKENS": "2048",
        "NUM_SHARD": "1",
        # "HF_TOKEN": ""  # Add your Hugging Face token if needed
    }


    # Create model using pre-built container
    model = aiplatform.Model.upload(
        display_name=f"hf-{model_id.replace('/', '-')}",
        # Using the official container for Hugging Face models
        serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310",
        serving_container_environment_variables=env_vars
    )
  
    print(f"model uploaded: {model}")

    # Deploy model to endpoint
    endpoint = model.deploy(
        machine_type=machine_type,
        min_replica_count=1,
        max_replica_count=1,
        accelerator_type="NVIDIA_TESLA_A100",
        accelerator_count=1,
        sync=True
    )
    
    print(f"Model deployed to endpoint: {endpoint.resource_name}")
    return endpoint


def create_completion(
    endpoint,
    prompt: str,
    max_tokens: int = 100,
    temperature: float = 0.7
):
    """
    Generate text using deployed model
    """
    response = endpoint.predict({
        "text": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature,
            "top_p": 0.95,
            "top_k": 40,
        }
    })
    return response

if __name__ == "__main__

Step 4: Deploy the Model

  1. Run the deployment:


Step 5: Monitor the Deployment

Using gcloud CLI:

bashCopy# List endpoints
gcloud ai endpoints list

# Get endpoint details
gcloud ai endpoints describe ENDPOINT_ID

# Get endpoint predictions
gcloud ai endpoints predict ENDPOINT_ID \
    --json-request='{"instances": [{"text": "Tell me a joke"}]

Using Google Cloud Console:

  1. Go to Vertex AI → Models

  2. Find your model in the list

  3. Click on the Endpoints tab

  4. Monitor metrics:

    • Prediction requests

    • Latency

    • Error rate

Cost Optimization

To minimize costs:

  1. Delete endpoints when not in use:

Troubleshooting

Common issues and solutions:

  1. Quota Exceeded

    • Check current quota: gcloud compute regions describe REGION

    • Request increase as described above

  2. Out of Memory

    • Reduce batch size in environment variables

    • Use larger instance type

    • Reduce sequence length

  3. Model Access Error

    • Ensure Hugging Face token is set

    • Verify Meta approval for Llama 3.1


For more information:

Try out our dashboard

Try out our dashboard

Deploy any model In Your Private Cloud or SlashML Cloud

READ OTHER POSTS

©2024 – Made with ❤️ & ☕️ in Montreal

©2024 – Made with ❤️ & ☕️ in Montreal

©2024 – Made with ❤️ & ☕️ in Montreal