LiteLLM Proxy - 1K RPS Load test on locust
Tutorial on how to get to 1K+ RPS with LiteLLM Proxy on locust
Pre-Testing Checklist
- Ensure you're using the latest
-stable
version of litellm - Ensure you're following ALL best practices for production
- Locust - Ensure you're Locust instance can create 1K+ requests per second
- 👉 You can use our maintained locust instance here
- If you're self hosting locust
- Use this machine specification for running litellm proxy
- Enterprise LiteLLM - Use
prometheus
as a callback in yourproxy_config.yaml
to get metrics on your load test Setlitellm_settings.callbacks
to monitor success/failures/all types of errorslitellm_settings:
callbacks: ["prometheus"] # Enterprise LiteLLM Only - use prometheus to get metrics on your load test
Load Test - Fake OpenAI Endpoint
Expected Performance
Metric | Value |
---|---|
Requests per Second | 1174+ |
Median Response Time | 96ms |
Average Response Time | 142.18ms |
Run Test
- Add
fake-openai-endpoint
to your proxy config.yaml and start your litellm proxy litellm provides a hostedfake-openai-endpoint
you can load test against
model_list:
- model_name: fake-openai-endpoint
litellm_params:
model: openai/fake
api_key: fake-key
api_base: https://exampleopenaiendpoint-production.up.railway.app/
litellm_settings:
callbacks: ["prometheus"] # Enterprise LiteLLM Only - use prometheus to get metrics on your load test
pip install locust
Create a file called
locustfile.py
on your local machine. Copy the contents from the litellm load test located hereStart locust Run
locust
in the same directory as yourlocustfile.py
from step 2locust -f locustfile.py --processes 4
Run Load test on locust
Head to the locust UI on http://0.0.0.0:8089
Set Users=1000, Ramp Up Users=1000, Host=Base URL of your LiteLLM Proxy
Expected results
Load test - Endpoints with Rate Limits
Run a load test on 2 LLM deployments each with 10K RPM Quota. Expect to see ~20K RPM
Expected Performance
- We expect to see 20,000+ successful responses in 1 minute
- The remaining requests fail because the endpoint exceeds it's 10K RPM quota limit - from the LLM API provider
Metric | Value |
---|---|
Successful Responses in 1 minute | 20,000+ |
Requests per Second | ~1170+ |
Median Response Time | 70ms |
Average Response Time | 640.18ms |
Run Test
- Add 2
gemini-vision
deployments on your config.yaml. Each deployment can handle 10K RPM. (We setup a fake endpoint with a rate limit of 1000 RPM on the/v1/projects/bad-adroit-crow
route below )
All requests with model="gemini-vision"
will be load balanced equally across the 2 deployments.
model_list:
- model_name: gemini-vision
litellm_params:
model: vertex_ai/gemini-1.0-pro-vision-001
api_base: https://exampleopenaiendpoint-production.up.railway.app/v1/projects/bad-adroit-crow-413218/locations/us-central1/publishers/google/models/gemini-1.0-pro-vision-001
vertex_project: "adroit-crow-413218"
vertex_location: "us-central1"
vertex_credentials: /etc/secrets/adroit_crow.json
- model_name: gemini-vision
litellm_params:
model: vertex_ai/gemini-1.0-pro-vision-001
api_base: https://exampleopenaiendpoint-production-c715.up.railway.app/v1/projects/bad-adroit-crow-413218/locations/us-central1/publishers/google/models/gemini-1.0-pro-vision-001
vertex_project: "adroit-crow-413218"
vertex_location: "us-central1"
vertex_credentials: /etc/secrets/adroit_crow.json
litellm_settings:
callbacks: ["prometheus"] # Enterprise LiteLLM Only - use prometheus to get metrics on your load test
pip install locust
Create a file called
locustfile.py
on your local machine. Copy the contents from the litellm load test located hereStart locust Run
locust
in the same directory as yourlocustfile.py
from step 2locust -f locustfile.py --processes 4 -t 60
Run Load test on locust
Head to the locust UI on http://0.0.0.0:8089 and use the following settings
Expected results
- Successful responses in 1 minute = 19,800 = (69415 - 49615)
- Requests per second = 1170
- Median response time = 70ms
- Average response time = 640ms
Prometheus Metrics for debugging load tests
Use the following prometheus metrics to debug your load tests / failures
Metric Name | Description |
---|---|
litellm_deployment_failure_responses | Total number of failed LLM API calls for a specific LLM deployment. Labels: "requested_model", "litellm_model_name", "model_id", "api_base", "api_provider", "hashed_api_key", "api_key_alias", "team", "team_alias", "exception_status", "exception_class" |
litellm_deployment_cooled_down | Number of times a deployment has been cooled down by LiteLLM load balancing logic. Labels: "litellm_model_name", "model_id", "api_base", "api_provider", "exception_status" |
Machine Specifications for Running Locust
Metric | Value |
---|---|
locust --processes 4 | 4 |
vCPUs on Load Testing Machine | 2.0 vCPUs |
Memory on Load Testing Machine | 450 MB |
Replicas of Load Testing Machine | 1 |
Machine Specifications for Running LiteLLM Proxy
👉 Number of Replicas of LiteLLM Proxy=20 for getting 1K+ RPS
Service | Spec | CPUs | Memory | Architecture | Version |
---|---|---|---|---|---|
Server | t2.large . | 2vCPUs | 8GB | x86 |
Locust file used for testing
import os
import uuid
from locust import HttpUser, task, between
class MyUser(HttpUser):
wait_time = between(0.5, 1) # Random wait time between requests
@task(100)
def litellm_completion(self):
# no cache hits with this
payload = {
"model": "fake-openai-endpoint",
"messages": [{"role": "user", "content": f"{uuid.uuid4()} This is a test there will be no cache hits and we'll fill up the context" * 150 }],
"user": "my-new-end-user-1"
}
response = self.client.post("chat/completions", json=payload)
if response.status_code != 200:
# log the errors in error.txt
with open("error.txt", "a") as error_log:
error_log.write(response.text + "\n")
def on_start(self):
self.api_key = os.getenv('API_KEY', 'sk-1234')
self.client.headers.update({'Authorization': f'Bearer {self.api_key}'})