F5 & Nvidia extend AI tie-up to cut inference costs
F5 has expanded its technical integration with Nvidia for AI inference infrastructure, pairing its BIG-IP Next for Kubernetes software with Nvidia's BlueField-3 data processing units (DPUs) as enterprise spending shifts from model training to deployment.
The companies are positioning the combined system around the economics of inference, focusing on token throughput, time to first token, and cost per token. Tokens are the units of output generated by AI models during inference. Higher throughput and lower latency can reduce infrastructure cost per unit of AI output.
Inference has become a larger line item in AI budgets as more organisations move from pilots to production services. Nvidia has argued that real-time deployment is becoming a bigger driver of AI infrastructure demand than training, particularly for interactive applications.
Integration focus
The expanded integration links BIG-IP Next for Kubernetes, which sits in the application traffic path in Kubernetes environments, with BlueField-3 DPUs. DPUs are specialised processors for networking and security tasks. Offloading this work can reduce host CPU overhead and ease data movement to GPUs.
BIG-IP Next for Kubernetes now uses statistics and telemetry from Nvidia components to guide routing and traffic management for inference requests. It draws on Nvidia NIM statistics, runtime signals from Nvidia Dynamo, and GPU telemetry to direct requests to what it determines are the most appropriate accelerators.
The goal is higher sustained GPU utilisation, which matters because GPUs are often the most expensive component in AI stacks. The companies said the approach reduces "re-compute"-repeated work caused by queuing, timeouts, or suboptimal routing decisions under heavy load.
"AI infrastructure is no longer just about access to GPU or scaling their deployments. It has evolved into maximizing economic output per accelerator," said Kunal Anand, Chief Product Officer, F5.
Independent testing
Performance figures cited by the companies come from testing validated by The Tolly Group, a research firm that benchmarks networking and security products. In those tests, BIG-IP Next for Kubernetes running with BlueField-3 DPUs delivered up to a 40% increase in token throughput, a 61% faster time to first token, and a 34% reduction in overall request latency.
These metrics reflect key concerns in production AI environments. Time to first token is critical for user-facing applications because it measures the delay before a model begins responding. Latency captures the broader end-to-end experience, while throughput indicates how much output a platform can produce over time.
The companies also said the improvements did not require model changes. That matters for enterprises that want infrastructure gains without revisiting model tuning or application logic.
"NVIDIA's accelerated computing infrastructure coupled with F5's AI-aware Application Delivery and Security Platform unlocks superior AI factory tokenomics-delivering scalable and cost-effective inference without making any changes to the models," said Kevin Deierling, SVP, Networking, NVIDIA.
Multi-tenant demand
Alongside performance, F5 and Nvidia are pitching the integrated approach for secure multi-tenant AI environments. Multi-tenancy is increasingly important as enterprises share GPU resources across departments and GPU-as-a-service providers build platforms for multiple customers.
In this release, the companies described support for inference-aware routing for agentic AI workflows. They also cited integration with Nvidia's DOCA Platform Framework for BlueField deployment and lifecycle management. The announcement highlighted EVPN-VXLAN networking with dynamic virtual routing and forwarding for network-level tenant separation, plus integrated security, token governance, and observability in Kubernetes environments.
The broader context is a shift toward more persistent, context-aware agent-based applications. These systems can generate different traffic patterns than conventional request-response services, including bursts of inference calls that stress routing, queuing, and resource isolation across shared infrastructure.
Operational implications
F5 is positioning BIG-IP Next for Kubernetes as a control layer for AI infrastructure, where routing, encryption offload, traffic management, and security policy affect cost per token. BlueField DPUs handle more packet processing and cryptographic work that would otherwise compete for CPU cycles on servers connected to GPUs.
This reflects a broader shift in AI infrastructure design toward tuning the entire path between an application and a model. The aim is to keep GPUs focused on inference rather than waiting on bottlenecks in networking, security inspection, or orchestration.
F5 and Nvidia said they will continue to provide tools and best practices for inference-architecture optimisation, focusing on extracting more value from existing GPU deployments as demand for real-time AI services grows.