F5 boosts Kubernetes AI inference with NVIDIA BlueField-3

Thu, 19th Mar 2026

F5 has expanded BIG-IP Next for Kubernetes with NVIDIA BlueField-3 DPUs, targeting AI inference infrastructure used by enterprises and cloud providers.

The updated offering is designed to improve the economics of running AI workloads by increasing token throughput, reducing latency and supporting shared infrastructure for multiple users. In AI systems, tokens are the units of output generated during inference, such as words, symbols or data fragments.

BIG-IP Next for Kubernetes now uses NVIDIA NIM statistics, Dynamo runtime signals and GPU telemetry to make routing decisions before workloads are executed. The goal is to direct inference jobs to the most suitable accelerators in real time while reducing delays and recompute.

The announcement reflects a broader shift in the AI market as businesses move from experimentation to commercial services. That transition has increased focus on operational measures such as time to first token, cost per token and the amount of output each GPU can sustain.

Performance data

Testing validated by The Tolly Group found that BIG-IP Next for Kubernetes running with NVIDIA BlueField-3 DPUs delivered up to a 40% increase in token throughput, a 61% faster time to first token and a 34% reduction in overall request latency, according to F5.

Those gains were achieved without changes to existing AI models. The setup offloads networking, encryption, AI-aware load balancing and traffic management tasks to the DPU, leaving host CPUs and GPUs with more capacity for inference work.

That is significant for operators seeking better returns on costly GPU estates. Enterprises and GPU-as-a-service providers are increasingly focused on extracting more output from installed hardware rather than simply adding more accelerators.

"AI infrastructure is no longer just about access to GPU or scaling their deployments. It has evolved into maximising economic output per accelerator," said Kunal Anand, Chief Product Officer, F5.

He added: "Together with NVIDIA, we are enabling AI factories to treat token production as a measurable business metric. BIG-IP Next for Kubernetes provides the intelligence and governance required to increase GPU yield, reduce cost per token, and scale shared AI platforms confidently."

NVIDIA described the integration as a way to improve inference efficiency without altering the underlying models. The collaboration combines F5's Kubernetes-based traffic management layer with BlueField DPUs, which are designed to handle data processing and networking functions outside the main CPU.

"NVIDIA's accelerated computing infrastructure coupled with F5's AI-aware Application Delivery and Security Platform unlocks superior AI factory tokenomics-delivering scalable and cost-effective inference without making any changes to the models," said Kevin Deierling, SVP, Networking, NVIDIA.

He added: "Together, F5 and NVIDIA are empowering enterprises to scale AI factory inference efficiently and economically."

Shared AI

The new features also address the growing use of agent-driven AI workloads, which can involve persistent, context-aware interactions rather than one-off requests. Those patterns place more pressure on traffic control and resource allocation across clusters.

The platform now supports inference-aware routing for agentic AI workflows, integration with the NVIDIA DOCA Platform Framework for BlueField deployment and lifecycle management, EVPN-VXLAN with dynamic VRFs for network-level multi-tenancy, and integrated security, token governance and observability in Kubernetes environments.

Multi-tenancy has become a key issue for cloud providers and large organisations that want to share GPU infrastructure across internal teams or external customers. The updated product is intended to help preserve performance isolation and service consistency in those environments.

Market pressure

The commercial pressure behind these changes is straightforward. AI infrastructure remains expensive, and operators are looking for ways to lower the cost of each unit of output without degrading user experience.

That has led vendors to focus on what some in the sector describe as AI factory economics, where the value of infrastructure is measured by sustained output and utilisation rather than raw installed hardware. In that context, routing, network overhead and latency become part of the financial equation as much as a technical one.

BIG-IP Next for Kubernetes is intended to act as a control layer for those decisions by governing token consumption, traffic flows and infrastructure use. F5 argues that organisations can increase returns from GPUs already in production instead of compensating for inefficiencies through overprovisioning.

The figures released alongside the update suggest vendors are trying to shift discussion of AI systems away from headline compute numbers and towards the practical cost of serving inference requests at scale. For cloud providers and enterprises building revenue-generating AI services, that puts more emphasis on throughput, latency and resource sharing than on model changes alone.

In the test results cited by F5, the performance gains were delivered within an existing infrastructure footprint and without model modifications.

Preferred Source

F5 boosts Kubernetes AI inference with NVIDIA BlueField-3

Top stories