The enterprise adoption of Large Language Models (LLMs) presents profound architectural challenges that extend far beyond simply hosting a model. Organizations demand robust, high-performance API gateways capable of providing identity-based routing, strict rate limiting, zero-trust security, and granular observability. Managing LLM traffic is uniquely demanding because it typically involves long-lived, chunked HTTP/2 streams that generate intense memory and connection pressure. Deploying a traditional API gateway often introduces the severe risk of a “streaming tax”, which is the latency overhead imposed by the proxy when buffering these continuous token streams.
To evaluate these demands, we engineered a comprehensive validation framework for the Envoy AI Gateway deployed on a VMware vSphere Kubernetes Service (VKS) cluster running on top of VMware Cloud Foundation (VCF). Instead of treating this as a simple functional check, we strived to make our evaluation as rigorous and science-grounded as possible. By relying on realistic traffic simulations and massive, high-cardinality datasets, we aimed to empirically evaluate the gateway’s routing efficiency under complex, unpredictable workloads.
Here is a look into our methodology and the architectural discoveries we made along the way. We believe you will find these insights useful for scaling your own AI infrastructure.
1. Simulating Real-World Traffic with Queuing Theory Models
Traditional performance testing methodologies often rely on static think times or smooth request intervals, which fail entirely to capture the chaotic, unpredictable nature of LLM workloads. Real-world enterprise traffic features wild, instantaneous spikes followed by long periods of latency.
To build a genuinely realistic validation suite, we abandoned static delays. Instead, we engineered our tests using specific mathematical arrival distributions mapped to established queuing theory models:
- Markov-Modulated Poisson Processes (MMPP): We utilized this to simulate multi-turn API Orchestration. Autonomous agents do not trigger independent requests, as their tool execution graphs create causal, cascaded chains. We used a 2-State ON/OFF Burst Generator to mimic the intense, highly correlated logic cycles of an agentic swarm rapidly consuming external APIs.
- Gamma Distribution (CV=1.5): We utilized this to model automated CI/CD logging and Terminal Siege tasks. Real-world empirical data confirms that request patterns are heavily skewed. The high variance of the Gamma distribution rigorously tested the system’s ability to handle massive, unstructured text blocks hitting the gateway unpredictably, maximizing Key-Value (KV) cache thrashing.
- Exponential Distributions: We utilized these for UI Navigation and Developer Assistance profiles. Grounded deeply in classical queuing theory, exponential inter-arrival times accurately simulate uncoordinated human pacing, reflecting how human knowledge workers read, pause, and interact with interfaces independently.
This interplay of distinct mathematical models created a chaotic, highly realistic load signature that genuinely pushed the limits of the hardware. It allowed us to discover the exact architectural compute boundary, known as the T2 Saturation Point, at exactly 224 concurrent users.
2. Avoiding the “Prefix Caching Trap” with High-Cardinality Data
Testing with low-cardinality data, such as repeatedly looping through the same ten user sessions, creates a devastating architectural trap. Under massive concurrency, identical repeated prompts cause the inference engine’s Prefix Caching (PC) mechanism to achieve an artificial hit rate approaching 99.9%. This bypasses the prefill compute phase and entirely masks the true processing bottlenecks and VRAM fragmentation that occur under authentic production load.
To rigorously test the compute layer, we utilized a massive 124GB high-cardinality dataset explicitly engineered to contain over 20,000 unique interaction sessions:
- Profile A (API Orchestration): This profile included 5,000 unique sessions from ToolBench-v1.
- Profile B (Terminal Siege): This profile included 865 unique sessions from Terminal-Bench 2.0.
- Profile C (UI Navigation): This profile included 5,000 unique sessions from OmniACT.
- Profiles D & E (Developer Assistance): These profiles included 10,000 combined sessions from SWE-rebench and CoderForge.
Equally critical was our implementation of a “Zero-CPU Parsing Guarantee.” We unrolled the vast conversational trajectories offline into standalone HTTP payloads contained in .jsonl files. At runtime, the load generation workers simply read raw strings and fired bytes over the wire, eliminating JSON serialization overhead. This prevented client-side CPU exhaustion from artificially capping the request throughput before the gateway was ever fully stressed.
3. Verifying Core Enterprise Invariants
Before executing high-concurrency stress tests, it was imperative to programmatically validate Envoy’s core routing and security capabilities in an isolated suite. The gateway relies on native Custom Resource Definition (CRD) capabilities to enforce policy-driven control at the edge:
- Identity-Based Routing (AIGatewayRoute): The gateway dynamically inspects the x-workload-app HTTP header to identify the calling application. It routes standard background operations to a cost-effective on-premise vLLM cluster (Tier 2), and elevates complex, reasoning-heavy agentic tasks to an external cloud model (Tier 1). Traffic identifying as unregistered is immediately rejected directly at the edge with an HTTP 404 Not Found response.
- Zero-Trust Security (BackendSecurityPolicy): Enterprise security mandates that edge clients must never possess direct access to upstream backend API keys. The gateway seamlessly intercepts inbound traffic, proactively strips client-provided authorization headers, and securely injects genuine API keys retrieved dynamically from Kubernetes Secrets. Direct bypass attempts fail entirely with HTTP 400 authentication errors.
- Token-Aware Rate Limiting (BackendTrafficPolicy): To prevent any single rogue agent from monopolizing the GPU cluster’s Video RAM (VRAM), we established token-aware local rate limiting. The gateway successfully enforces strict budgets by proactively shedding excess load with HTTP 429 Too Many Requests responses. This critical defense mechanism guarantees that explosive parallel bursts do not trigger Out-Of-Memory (OOM) crashes on the backend.
Conclusion
By grounding our methodology in real-world queuing theory and massive, high-cardinality data, we proved that the Envoy AI Gateway is an exceptionally robust control plane. The system flawlessly protected strict Service Level Objectives (SLOs) required for human interactivity, even while managing chaotic, highly correlated agentic bursts. These empirical findings confirm that organizations do not have to sacrifice inference speed to achieve foundational enterprise security, routing, and reliability. For a deeper dive into these findings, refer to our white paper Gateway to Enterprise AI Architecting and Scaling LLM Workloads with Envoy.
Discover more from VMware Cloud Foundation (VCF) Blog
Subscribe to get the latest posts sent to your email.