Google TPU8 Splits Training and Inference, Pioneering a Fine-Grained Paradigm for AI Memory

By: M 2026-06-04 10:10 (UTC+0)

If stacking "big memory" were the universal truth for solving AI inference bottlenecks, why doesn't Google, who owns Gemini and invests over $100 billion annually, follow this approach?

With its eighth-generation TPU, Google breaks from the tradition of single-chip iteration by splitting training and inference into two distinct chips: the TPU 8t designed for large-scale training, and the TPU 8i optimized for inference and AI agents. This enables precise matching of the different demands of AI scenarios.

Why Did Google Choose to Separate Training and Inference?

As AI models evolve from simple chatbots to more complex AI agents and agentic AI, computational requirements are undergoing a fundamental shift. AI workloads are increasingly diverging. With model size growing exponentially, using the same chip for both training and inference — tasks with conflicting demands — has become increasingly challenging.

· Training: Requires continuous, massive reads of petabyte-scale datasets from storage, demanding higher bandwidth and greater throughput.

· Inference: Requires extremely fast random reads of fragmented KV cache fragments per request, demanding lower latency and higher concurrency.

The optimal designs for training and inference are inherently mutually exclusive: optimizing throughput tends to increase latency while optimizing latency reduces peak throughput.

Thus, Google split the TPU 8 series into the TPU 8t for training and the TPU 8i for inference, each excelling in its own role, significantly improving efficiency and cost-effectiveness.

TPU 8t and TPU 8i: Same Origin, Target-Specific Design

Both chips are Google's first to feature the custom Axion Arm CPU, use TSMC's 2nm process, and employ fourth-generation liquid cooling. Performance per watt is doubled from previous generations. Mass production is expected by the end of 2027.

TPU 8t is designed for training, focusing on efficiently getting data from storage to ultra-large clusters of chips.

· Faster storage access: Introduces TPU Direct RDMA and TPU Direct Storage, enabling direct data transfer between TPU memory and network interface cards, and direct access between TPUs and high-speed management storage — no longer requiring CPU involvement. Petabyte-scale datasets can be directly streamed to the chip, with storage access speeds up to 10x faster than the previous generation.

· Greater scalability: A single superpod integrates 9,600 chips; a single training cluster can scale to over one million TPU chips. Virgo Network can link over 134,000 TPU 8t chips with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. This fabric delivers over 1.7K ExaFlops with near-linear scaling performance.

· Better performance: Compared to the previous generation, a single pod delivers 3x the FP4 performance at 121 EFlops. With 2x scale-up bandwidth on the inter-chip interconnect (ICI) and up to 4x raw scale-out DCN bandwidth compared to the previous generation, TPU 8t drastically reduces data bottlenecks.

Source: Google, Made by CFM

TPU 8i is designed for inference, focusing on keeping data as close to the compute core as possible, reducing backend storage access.

· Larger on-chip cache: Features 384MB of SRAM and 288GB of HBM memory, over 3x the previous generation, allowing larger KV caches to reside entirely on the chip, significantly reducing core wait times during long-context decoding.

· Lower latency: By integrating a specialized CAE, TPU 8i further reduces the on-chip latency of collectives by 5x. The new Boardfly architecture embeds network connections directly into the compute chip, reducing data movement between nodes and cutting latency for communication-intensive workloads by up to 50%.

· Higher bandwidth and compute performance: For modern mixture-of-experts (MoE) models, interconnect bandwidth doubles to 19.2 Tb/s. A single pod scales to 1,152 chips, delivering 11.6 EFlops of FP8 compute performance — a significant improvement over the previous generation.

Source: Google, Made by CFM

The two chips of TPU8 are not simply designed by stacking parameters. Instead, their hardware specifications are tailored to match task requirements, with both featuring a notable increase in memory capacity compared to previous-generation products. The TPU 8t is a "super warehouse" for training, aggregating 9,600 chips for a total memory of 2PB. The TPU 8i is a "high-speed cache" for inference, with 331.8TB total memory — lower than the TPU 8t — but each chip has a larger 288GB memory and extremely high 384MB on-chip cache. TPU 8i's 8601 GB/s HBM bandwidth is about 32% higher than TPU 8t's 6528 GB/s.

Make the Right Chip, Not the Strongest Chip

Google's new differentiator lies mainly in system architecture innovation and system-level cost efficiency, rather than pursuing extreme per-chip performance like NVIDIA. NVIDIA's Rubin GPU offers higher bandwidth, more FP4 capabilities, and more NVLink features per GPU than the eighth-generation TPU, giving it a clear lead in single-chip performance. Groq's 3 LPU inference chip delivers 150 TB/s of SRAM bandwidth and 2.5 TB/s of expansion bandwidth with 500MB of SRAM. While its on-chip cache and bandwidth exceed those of the TPU 8i, the LPU must work alongside GPUs and has significantly higher cost than the TPU 8i.

However, on the one hand, Google completely separates training and inference chips for extreme optimization, and as inference scales continuously, the economics of custom ASICs outperform general-purpose GPUs. On the other hand, with its custom architecture, a single training cluster can scale to 9,600 chips (far beyond 72 GPUs in an NVL72), and the cluster maintains extremely high effective compute time even at massive scale by optimizing the system software stack. According to official data, the performance-per-dollar of TPU 8t and TPU 8i is 2.7x and 80% higher, respectively, than the previous Ironwood TPU.

Source: NVIDIA, Google, Made by CFM

NVIDIA uses "top-tier hardware for all scenarios", relying on high-end storage advantages to handle all kinds of AI data read/write needs.

Google takes a "scenario-customized approach", abandoning one-size-fits-all storage and innovating system architectures for the two core scenarios of training and inference. This balances cost and power while meeting the massive, high-frequency, and differentiated storage demands of the AI agent era.

Google's split of training and inference ends the crude model of using a single chip for all storage needs across scenarios. It pushes AI memory beyond the era of generic adaptation into a new phase of scenario-driven, fine-grained customization.

In the new AI era, the optimal chip is not the strongest one, but the right one for the job. The optimal memory solution is not extreme parameter stacking, but precise scenario matching.

Price Center

View All

Newsflash

View All

3 days ago

Hot News

View All

Google TPU8 Splits Training and Inference, Pioneering a Fine-Grained Paradigm for AI Memory

Latest News