Life of a Request

High Level architecture

Request Flow

User Sends Request: The process begins when a user sends a request to the LiteLLM Proxy Server (Gateway).
Virtual Keys: At this stage the Bearer token in the request is checked to ensure it is valid and under it's budget. Here is the list of checks that run for each request
- 2.1 Check if the Virtual Key exists in Redis Cache or In Memory Cache
- 2.2 If not in Cache, Lookup Virtual Key in DB
Rate Limiting: The MaxParallelRequestsHandler checks the rate limit (rpm/tpm) for the the following components:
- Global Server Rate Limit
- Virtual Key Rate Limit
- User Rate Limit
- Team Limit
LiteLLM proxy_server.py: Contains the /chat/completions and /embeddings endpoints. Requests to these endpoints are sent through the LiteLLM Router
LiteLLM Router: The LiteLLM Router handles Load balancing, Fallbacks, Retries for LLM API deployments.
litellm.completion() / litellm.embedding(): The litellm Python SDK is used to call the LLM in the OpenAI API format (Translation and parameter mapping)
Post-Request Processing: After the response is sent back to the client, the following asynchronous tasks are performed:
- Logging to LangFuse (logging destination is configurable)
- The MaxParallelRequestsHandler updates the rpm/tpm usage for the
  - Global Server Rate Limit
  - Virtual Key Rate Limit
  - User Rate Limit
  - Team Limit
- The _PROXY_track_cost_callback updates spend / usage in the LiteLLM database. Here is everything tracked in the DB per request

Life of a Request

High Level architecture​

Request Flow​

High Level architecture

Request Flow