AI

Optimizing AI Efficiency: NVIDIA and Google Collaborate to Cut Inference Costs

Published

3 hours ago

April 23, 2026

Google Cloud and NVIDIA logos as, at the Google Cloud Next conference, the companies outlined their hardware roadmap designed to address the cost of AI inference at scale.

During the Google Cloud Next conference, Google and NVIDIA unveiled their hardware roadmap targeted at addressing the cost of AI inference at scale.

The companies introduced the new A5X bare-metal instances, powered by NVIDIA Vera Rubin NVL72 rack-scale systems. This architecture, developed through hardware and software codesign, aims to reduce inference cost per token by up to ten times compared to previous generations, while also achieving ten times higher token throughput per megawatt.

To prevent processing delays when connecting thousands of processors, the A5X instances tackle this hardware challenge by combining NVIDIA ConnectX-9 SuperNICs with Google Virgo networking technology.

This setup can scale to 80,000 NVIDIA Rubin GPUs within a single site cluster and up to 960,000 GPUs across a multisite deployment. Managing operations at this scale requires sophisticated workload management to ensure efficient data routing across nearly a million parallel processors.

Mark Lohmeyer, VP and GM of AI and Computing Infrastructure at Google Cloud, emphasized the importance of running demanding workloads on an integrated, AI-optimized infrastructure stack for the next decade of AI advancements.

Addressing data governance concerns, Google Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs are now available on Google Distributed Cloud. This deployment method enables organizations to keep sensitive frontier models within their controlled environments to meet compliance mandates.

With the incorporation of NVIDIA Confidential Computing, data remains encrypted to ensure secure training environments. The introduction of Confidential G4 VMs equipped with NVIDIA RTX PRO 6000 Blackwell GPUs provides regulated industries with high-performance hardware while maintaining data privacy standards.

For agentic AI training, NVIDIA Nemotron 3 Super is integrated into the Gemini Enterprise Agent Platform, offering tools for developers to customize reasoning and multimodal models for agentic tasks.

The Managed Training Clusters on the Gemini Enterprise Agent Platform streamline operational overhead by automating cluster sizing, failure recovery, and job execution, allowing data science teams to focus on model quality.

In the realm of legacy architecture integration and physical simulations, NVIDIA’s AI infrastructure and physical AI libraries on Google Cloud support heavy industry and manufacturing applications by providing tools for simulating real-world workflows.

By leveraging NVIDIA NIM microservices, developers can deploy vision-based agents and robots to interpret and navigate physical environments, advancing industrial digital twin capabilities.

The collaboration between NVIDIA and Google Cloud aims to provide a computing foundation that facilitates the transition of experimental agents and simulations into production systems for securing fleets and optimizing factories.

For more information on AI and big data, explore the upcoming AI & Big Data Expo events hosted by TechEx, co-located with leading technology events such as the Cyber Security & Cloud Expo.

AI News, powered by TechForge Media, offers insights into enterprise technology events and webinars for industry professionals.