Introduction
Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to enhance the helpfulness of LLM-generated responses to user queries. This model achieves impressive scores across several benchmark tests, demonstrating its alignment and performance capabilities. The model scores Arena Hard of 85.0, AlpacaEval 2 LC of 57.6, and GPT-4-Turbo MT-Bench of 8.98, all of which are known to be predictive indicators of LMSys Chatbot Arena Elo. As of 1 Oct 2024, Llama-3.1-Nemotron-70B-Instruct ranks #1 across three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), surpassing other strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
In this guide, we will walk through the process of deploying the Llama 3.1-70B-Nemotron model using VLLM and FastAPI. This setup leverages VLLM’s efficient model serving capabilities alongside FastAPI’s rapid API development framework to create a scalable and efficient deployment for large language models. We’ll go through the entire codebase, explaining each section in detail, and finally deploy the API using Uvicorn.
Prerequisites
Before getting started, make sure you have the following Python packages installed. You can install them using pip
:
pip install fastapi==0.111.0 transformers==4.45.2 peft==0.12.0 trl==0.10.1 bitsandbytes==0.43.3 pytest httpx uvicorn==0.30.1 vllm==0.6.3.post1
Ensure you have an NVIDIA GPU with sufficient memory (e.g., an A100 80GB, as this model requires substantial resources for inference). CUDA and PyTorch should be set up correctly to utilize the GPU.
Download Llama-3.1-Nemotron-70B-Instruct-HF here via Huggingface:
https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
or
huggingface-cli download nvidia/Llama-3.1-Nemotron-70B-Instruct-HF --local-dir Llama-3.1-Nemotron-70B-Instruct-HF
Project Structure
The project is structured as follows:
project_root/
│
├── src/
│ ├── llama_model.py # Defines the model and handles initialization
│ ├── llama_service.py # Service layer for generating responses
│
├── main.py # FastAPI application and endpoint definitions
├── requirements.txt # Dependencies for the project
└── README.md # Project documentation
Step 1: Model Initialization (src/llama_model.py
)
In this module, we define a class called LlamaModel
to initialize and configure the model using VLLM. This class is crucial for setting up the environment and loading the model efficiently.
Code Walkthrough
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import torch
import os
import logging
import time
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:False"
os.environ["TOKENIZERS_PARALLELISM"] = "True"
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
TENSOR_PARALLEL_SIZE = 4
GPU_MEMORY_UTILIZATION = 0.95
class LlamaModel:
def __init__(self, model_path: str, tensor_parallel_size: int = TENSOR_PARALLEL_SIZE, gpu_memory_utilization=GPU_MEMORY_UTILIZATION):
self.model_path = model_path
self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
self.tensor_parallel_size = tensor_parallel_size
self.gpu_memory_utilization = gpu_memory_utilization
self.num_gpus = torch.cuda.device_count()
logger.info(f"Number of GPUs available: {self.num_gpus}")
try:
self.model = LLM(
model=self.model_path,
tensor_parallel_size=self.tensor_parallel_size,
dtype=torch.bfloat16,
trust_remote_code=True,
max_num_seqs=125,
max_model_len=2048,
# quantization="bitsandbytes",
# load_format="bitsandbytes",
gpu_memory_utilization=self.gpu_memory_utilization
)
logger.info(f"Model loaded successfully from {model_path}")
except Exception as e:
logger.error(f"Error loading model: {str(e)}")
raise RuntimeError("Model loading failed")
def generate_text(
self,
prompt: str,
system_prompt: str = "You are an asistant",
max_new_tokens: int = 2000,
top_k: int = 50,
temperature: float = 0.4,
top_p: float = 0.9
) -> str:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt},
]
try:
prompts = self.tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_new_tokens, top_k=top_k)
start_time = time.time() # Start timing before generating
outputs = self.model.generate(prompts, sampling_params)
end_time = time.time() # End timing after generation
generated_text = outputs[0].outputs[0].text
# Calculate metrics
self.num_tokens_generated = len(self.tokenizer.encode(generated_text))
self.generation_time = end_time - start_time
self.throughput = self.num_tokens_generated / self.generation_time if self.generation_time > 0 else 0
# Log metrics
logger.info(f"Tokens generated: {self.num_tokens_generated}")
logger.info(f"Time taken: {self.generation_time:.2f} seconds")
logger.info(f"Throughput: {self.throughput:.2f} tokens per second")
return generated_text
except Exception as e:
logger.error(f"Error generating text: {str(e)}")
raise RuntimeError("Text generation failed")
This code defines a class called LlamaModel
that leverages the VLLM library and Hugging Face’s transformers for loading and generating text with the Llama model. The code begins by setting environment variables for CUDA memory allocation and parallel tokenization, optimizing resource usage. The class initializes the model using a specified model_path
, setting parameters for tensor parallelism and GPU memory utilization, which is critical for efficient inference on large models.
In the __init__
method, the code checks the number of available GPUs and logs this information. It attempts to load the model with various configurations, including bfloat16
precision for reduced memory usage and faster computation. If loading fails, an error is logged, and a runtime exception is raised.
The generate_text
method takes user input (a prompt) and parameters for the generation process, such as the maximum number of new tokens and sampling settings like top_k
, temperature
, and top_p
. It formats the prompt using the tokenizer and defines sampling parameters. The method times the generation process to measure throughput and performance, logging metrics such as the number of tokens generated and the time taken.
Overall, this code is structured to efficiently load a large model for inference and manage the generation process with flexibility, while also logging critical information for performance monitoring and troubleshooting.
Step 2: Service Layer (src/llama_service.py
)
The service layer acts as a mediator between the model and the API layer. It handles prompt inputs and calls the model's generate_text
method.
import os
from llama_model import LlamaModel
import logging
logger = logging.getLogger(__name__)
MODEL_PATH = os.getenv("MODEL_PATH", "../../local-llama-models/Llama-3.1-Nemotron-70B-Instruct-HF")
class LlamaService:
def __init__(self):
try:
self.model = LlamaModel(MODEL_PATH)
logger.info("LlamaService initialized successfully")
except Exception as e:
logger.error(f"Failed to initialize LlamaService: {str(e)}")
raise RuntimeError("LlamaService initialization failed")
def generate_response(self, prompt: str, system_prompt: str, max_new_tokens: int, top_k: int, temperature: float, top_p: float) -> str:
try:
return self.model.generate_text(prompt, system_prompt=system_prompt, max_new_tokens=max_new_tokens, top_k=top_k, temperature=temperature, top_p=top_p)
except Exception as e:
logger.error(f"Failed to generate response: {str(e)}")
raise
The code defines a class called LlamaService
that serves as a service layer for interacting with the LlamaModel
. It first imports the necessary libraries and retrieves the model path from an environment variable, MODEL_PATH
, which defaults to the specified path if not set.
In the __init__
method, the class attempts to initialize an instance of LlamaModel
using the retrieved model path. If the initialization is successful, an informational log message is generated. If any error occurs during this process, an error message is logged, and a RuntimeError
is raised, ensuring the failure is handled gracefully.
The generate_response
method is responsible for generating a response based on the given prompt and various parameters such as system_prompt
, max_new_tokens
, top_k
, temperature
, and top_p
. It calls the generate_text
method of the LlamaModel
instance and handles any exceptions by logging the error and re-raising the exception. This design ensures that the service layer effectively manages both model initialization and text generation while handling errors appropriately.
Step 3: API Deployment (main.py
)
This file defines the FastAPI application. It sets up the /generate
endpoint to accept POST requests with prompt data, and calls the LlamaService
to generate responses.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from src.llama_service import LlamaService
import logging
# Initialize logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="Llama-3.1-Nemotron-70B-Instruct-HF")
service = LlamaService()
class Prompt(BaseModel):
text: str
system_prompt: str = "You are an AI assistant"
max_new_tokens: int = Field(default=1024, gt=0, le=1000000)
top_k: int = Field(default=50, ge=0, le=1000)
temperature: float = Field(default=0.4, gt=0.0, le=1.0)
top_p: float = Field(default=0.9, gt=0.0, le=1.0)
@app.post("/generate", summary="Generate text response", description="This endpoint generates a text response based on the given prompt.")
async def generate(prompt: Prompt):
try:
logger.info(f"Received prompt: {prompt.text}")
response = service.generate_response(
prompt.text,
prompt.system_prompt,
prompt.max_new_tokens,
prompt.top_k,
prompt.temperature,
prompt.top_p
)
logger.info("Text generated successfully")
return {"response": response}
except Exception as e:
logger.error(f"Failed to generate text: {str(e)}")
raise HTTPException(status_code=500, detail=f"Internal Server Error: {str(e)}")
Running the Application
To run the API:
uvicorn main:app --host 0.0.0.0 --port 8000
Posting Komentar