Senin, 21 Oktober 2024

Deploying Llama-3.1-Nemotron-70B-Instruct with VLLM and FastAPI: An In-Depth Guide

Deploying Llama 3.1-70B Nemotron with VLLM and FastAPI: An In-Depth Guide

Introduction

Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to enhance the helpfulness of LLM-generated responses to user queries. This model achieves impressive scores across several benchmark tests, demonstrating its alignment and performance capabilities. The model scores Arena Hard of 85.0, AlpacaEval 2 LC of 57.6, and GPT-4-Turbo MT-Bench of 8.98, all of which are known to be predictive indicators of LMSys Chatbot Arena Elo. As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Instruct ranks #1 across three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), surpassing other strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

In this guide, we will walk through the process of deploying the Llama 3.1-70B-Nemotron model using VLLM and FastAPI. This setup leverages VLLM’s efficient model serving capabilities alongside FastAPI’s rapid API development framework to create a scalable and efficient deployment for large language models. We’ll go through the entire codebase, explaining each section in detail, and finally deploy the API using Uvicorn.

Prerequisites

Before getting started, make sure you have the following Python packages installed. You can install them using pip:

pip install fastapi==0.111.0 transformers==4.45.2 peft==0.12.0 trl==0.10.1 bitsandbytes==0.43.3 pytest httpx uvicorn==0.30.1 vllm==0.6.3.post1

Ensure you have an NVIDIA GPU with sufficient memory (e.g., an A100 80GB, as this model requires substantial resources for inference). CUDA and PyTorch should be set up correctly to utilize the GPU.

Download Llama-3.1-Nemotron-70B-Instruct-HF here via Huggingface:

https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

or

huggingface-cli download nvidia/Llama-3.1-Nemotron-70B-Instruct-HF --local-dir Llama-3.1-Nemotron-70B-Instruct-HF

Project Structure

The project is structured as follows:

project_root/
│
├── src/
│   ├── llama_model.py        # Defines the model and handles initialization
│   ├── llama_service.py      # Service layer for generating responses
│
├── main.py                   # FastAPI application and endpoint definitions
├── requirements.txt          # Dependencies for the project
└── README.md                 # Project documentation

Step 1: Model Initialization (src/llama_model.py)

In this module, we define a class called LlamaModel to initialize and configure the model using VLLM. This class is crucial for setting up the environment and loading the model efficiently.

Code Walkthrough


from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import torch
import os
import logging
import time 

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:False"
os.environ["TOKENIZERS_PARALLELISM"] = "True"

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

TENSOR_PARALLEL_SIZE = 4
GPU_MEMORY_UTILIZATION = 0.95

class LlamaModel:
    def __init__(self, model_path: str, tensor_parallel_size: int = TENSOR_PARALLEL_SIZE, gpu_memory_utilization=GPU_MEMORY_UTILIZATION):
        self.model_path = model_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
        self.tensor_parallel_size = tensor_parallel_size
        self.gpu_memory_utilization = gpu_memory_utilization
        
        self.num_gpus = torch.cuda.device_count()
        logger.info(f"Number of GPUs available: {self.num_gpus}")

        try:
            self.model = LLM(
                model=self.model_path,
                tensor_parallel_size=self.tensor_parallel_size,
                dtype=torch.bfloat16,
                trust_remote_code=True,
                max_num_seqs=125,
                max_model_len=2048,
                # quantization="bitsandbytes",
                # load_format="bitsandbytes",
                gpu_memory_utilization=self.gpu_memory_utilization
            )
            logger.info(f"Model loaded successfully from {model_path}")
        except Exception as e:
            logger.error(f"Error loading model: {str(e)}")
            raise RuntimeError("Model loading failed")

    def generate_text(
        self, 
        prompt: str, 
        system_prompt: str = "You are an asistant", 
        max_new_tokens: int = 2000, 
        top_k: int = 50, 
        temperature: float = 0.4, 
        top_p: float = 0.9
    ) -> str:
        
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ]

        try:
            prompts = self.tokenizer.apply_chat_template(
                messages, 
                add_generation_prompt=True, 
                tokenize=False
            )

            sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_new_tokens, top_k=top_k)

            start_time = time.time()  # Start timing before generating
            outputs = self.model.generate(prompts, sampling_params)
            end_time = time.time()  # End timing after generation

            generated_text = outputs[0].outputs[0].text

            # Calculate metrics
            self.num_tokens_generated = len(self.tokenizer.encode(generated_text))
            self.generation_time = end_time - start_time
            self.throughput = self.num_tokens_generated / self.generation_time if self.generation_time > 0 else 0

            # Log metrics
            logger.info(f"Tokens generated: {self.num_tokens_generated}")
            logger.info(f"Time taken: {self.generation_time:.2f} seconds")
            logger.info(f"Throughput: {self.throughput:.2f} tokens per second")

            return generated_text

        except Exception as e:
            logger.error(f"Error generating text: {str(e)}")
            raise RuntimeError("Text generation failed")

This code defines a class called LlamaModel that leverages the VLLM library and Hugging Face’s transformers for loading and generating text with the Llama model. The code begins by setting environment variables for CUDA memory allocation and parallel tokenization, optimizing resource usage. The class initializes the model using a specified model_path, setting parameters for tensor parallelism and GPU memory utilization, which is critical for efficient inference on large models.

In the __init__ method, the code checks the number of available GPUs and logs this information. It attempts to load the model with various configurations, including bfloat16 precision for reduced memory usage and faster computation. If loading fails, an error is logged, and a runtime exception is raised.

The generate_text method takes user input (a prompt) and parameters for the generation process, such as the maximum number of new tokens and sampling settings like top_k, temperature, and top_p. It formats the prompt using the tokenizer and defines sampling parameters. The method times the generation process to measure throughput and performance, logging metrics such as the number of tokens generated and the time taken.

Overall, this code is structured to efficiently load a large model for inference and manage the generation process with flexibility, while also logging critical information for performance monitoring and troubleshooting.

Step 2: Service Layer (src/llama_service.py)

The service layer acts as a mediator between the model and the API layer. It handles prompt inputs and calls the model's generate_text method.

import os
from llama_model import LlamaModel
import logging

logger = logging.getLogger(__name__)

MODEL_PATH = os.getenv("MODEL_PATH", "../../local-llama-models/Llama-3.1-Nemotron-70B-Instruct-HF")

class LlamaService:
    def __init__(self):
        try:
            self.model = LlamaModel(MODEL_PATH)
            logger.info("LlamaService initialized successfully")
        except Exception as e:
            logger.error(f"Failed to initialize LlamaService: {str(e)}")
            raise RuntimeError("LlamaService initialization failed")

    def generate_response(self, prompt: str, system_prompt: str, max_new_tokens: int, top_k: int, temperature: float, top_p: float) -> str:
        try:
            return self.model.generate_text(prompt, system_prompt=system_prompt, max_new_tokens=max_new_tokens, top_k=top_k, temperature=temperature, top_p=top_p)
        except Exception as e:
            logger.error(f"Failed to generate response: {str(e)}")
            raise

The code defines a class called LlamaService that serves as a service layer for interacting with the LlamaModel. It first imports the necessary libraries and retrieves the model path from an environment variable, MODEL_PATH, which defaults to the specified path if not set.

In the __init__ method, the class attempts to initialize an instance of LlamaModel using the retrieved model path. If the initialization is successful, an informational log message is generated. If any error occurs during this process, an error message is logged, and a RuntimeError is raised, ensuring the failure is handled gracefully.

The generate_response method is responsible for generating a response based on the given prompt and various parameters such as system_prompt, max_new_tokens, top_k, temperature, and top_p. It calls the generate_text method of the LlamaModel instance and handles any exceptions by logging the error and re-raising the exception. This design ensures that the service layer effectively manages both model initialization and text generation while handling errors appropriately.

Step 3: API Deployment (main.py)

This file defines the FastAPI application. It sets up the /generate endpoint to accept POST requests with prompt data, and calls the LlamaService to generate responses.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.llama_service import LlamaService
import logging

# Initialize logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="Llama-3.1-Nemotron-70B-Instruct-HF")
service = LlamaService()

class Prompt(BaseModel):
    text: str
    system_prompt: str = "You are an AI assistant"
    max_new_tokens: int = Field(default=1024, gt=0, le=1000000)
    top_k: int = Field(default=50, ge=0, le=1000)
    temperature: float = Field(default=0.4, gt=0.0, le=1.0)
    top_p: float = Field(default=0.9, gt=0.0, le=1.0)

@app.post("/generate", summary="Generate text response", description="This endpoint generates a text response based on the given prompt.")
async def generate(prompt: Prompt):
    try:
        logger.info(f"Received prompt: {prompt.text}")
        response = service.generate_response(
            prompt.text,
            prompt.system_prompt,
            prompt.max_new_tokens,
            prompt.top_k,
            prompt.temperature,
            prompt.top_p
        )
        logger.info("Text generated successfully")
        return {"response": response}
    except Exception as e:
        logger.error(f"Failed to generate text: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Internal Server Error: {str(e)}")

Running the Application

To run the API:

uvicorn main:app --host 0.0.0.0 --port 8000

Posting Komentar