Host Deepseek R1 Distilled Llama-8b on GCP Vertex AI

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Tools

AI Gallery

Blog

/ML Studio

Build full stack Apps from prompts

Host Deepseek R1 Distilled Llama-8b on GCP Vertex AI

Jneid Jneid

@jjneid94

Published on Jan 29, 2025

Host Deepseek R1 Distilled Llama-8b on GCP Vertex AI

Jneid Jneid

@jjneid94

Published on Jan 29, 2025

For this tutorial, we are using deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.

This step-by-step guide covers both interactive and YAML-based deployment options.

We will be using Magemaker, a Python tool that simplifies deploying open-source AI models to cloud providers like AWS, GCP, and Azure.

Step 1: GCP Setup

1. Create a Google Cloud account if you haven't already

2. Install gcloud CLI:

   # Follow instructions at cloud.google.com/sdk/docs/install-sdk
   gcloud init

3. Enable Vertex AI API in your project:

Go to Google Cloud Console
Search for "Vertex AI API"
Click "Enable"

For a more step by step config, use this.

Step 2: Authentication

# Login and set application default credentials
gcloud auth application-default login

# Verify your configuration
gcloud config list

Step 3: Create YAML Configuration

Create a file named `deploy-deepseek-gcp.yaml`:

deployment: !Deployment
  destination: gcp
  endpoint_name: deepseek-r1-distill
  accelerator_count: 1
  instance_type: g2-standard-12
  accelerator_type: NVIDIA_L4
  num_gpus: null
  quantization: null
models:
- !Model
  id: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  location: null
  predict: null
  source: huggingface
  task: text-generation
  version: null

Step 4: Deploy

# Install Magemaker if you haven't already
pip install magemaker

# Deploy using the YAML file
magemaker --deploy deploy-deepseek-gcp.yam

Step 5: Verify Deployment

Go to Google Cloud Console
Navigate to Vertex AI → Model Registry
Check your endpoint status

Step 6: Test the Endpoint

Use this Python code to test your deployment:

from google.cloud import aiplatform
endpoint = aiplatform.Endpoint(
    endpoint_name="projects/{project}/locations/{location}/endpoints/{endpoint_id}"
)
response = endpoint.predict(
    instances=[{
        "inputs": "Write a Python function to calculate fibonacci numbers"
    }]
)
print(response)

Common Issues and Solutions

Quota Issues

If you encounter quota errors:

Go to IAM & Admin → Quotas
Search for "NVIDIA L4 GPUs"
Request quota increase

Authentication Issues

# Verify your credentials
gcloud auth list

# Reset if needed
gcloud auth login
gcloud auth application-default login

Instance Availability

Check if g2-standard-12 is available in your region
Try different regions if needed

Monitoring Your Deployment

Monitor through Google Cloud Console:

Vertex AI → Endpoints
Cloud Monitoring
Cloud Logging

Cost Management

Pricing Breakdown

g2-standard-12 with NVIDIA L4: ~$1 per hour
Additional costs:
- Network egress
- API calls
- Storage for model artifacts

Cost Optimization Tips

1. Delete endpoints when not in use:

   magemaker --cloud gcp
   # Select "Delete a model endpoint"

2. Use batch processing when possible

3. Monitor usage patterns

4. Set up billing alerts

5. Consider scheduled shutdowns for non-critical workloads

Monthly Cost Estimates

24/7 running: ~$720/month
8 hours/day: ~$240/month
4 hours/day: ~$120/month

Next Steps

Set up monitoring alerts
Configure auto-scaling if needed
Implement proper error handling
Test with different prompts

We are open-sourcong Magemaker!! Stay Tuned!!!

As Always, Happy Coding!!!

if you have any questions, please do not hesitate to ask faizan|jneid@slashml.com.

If you want to chat with Any github Codebase, please visit CodesALot

If you want to chat with data and generate visualizations, please visit SirPlotsAlot

If you are struggling to integrate AI into your apps

Build AI enbaled apps from a single prompt

Build full stack Apps from prompts

READ OTHER POSTS