AI
Revolutionizing AI Training: AMD GPUs Hit Milestone Achievement
Zyphra, AMD, and IBM collaborated for a year to test the compatibility of AMD’s GPUs and platform for large-scale AI model training, resulting in the creation of ZAYA1.
Together, the three companies successfully trained ZAYA1, which is recognized as the first significant Mixture-of-Experts foundation model developed entirely on AMD GPUs and networking. This achievement signifies a shift away from dependency on NVIDIA for scaling AI in the market.
The training of the model utilized AMD’s Instinct MI300X chips, Pensando networking technology, and ROCm software, all operated within IBM Cloud’s infrastructure. Notably, the setup was conventional in nature, resembling an enterprise cluster configuration, but without the use of NVIDIA components.
Zyphra highlights that ZAYA1’s performance matches or exceeds that of established open models in various domains such as reasoning, mathematics, and coding. This development provides businesses facing GPU supply constraints or escalating prices with a valuable alternative that maintains high performance capabilities.
How Zyphra Leveraged AMD GPUs to Reduce Costs Without Compromising AI Training Performance
When planning training budgets, organizations typically prioritize factors like memory capacity, communication speed, and consistent iteration times over sheer theoretical throughput.
The MI300X GPUs, with 192GB of high-bandwidth memory per unit, afford engineers flexibility for initial training runs without immediate reliance on heavy parallel processing. This simplifies projects that are otherwise intricate and time-consuming to optimize.
Zyphra configured each node with eight MI300X GPUs interconnected via InfinityFabric, with each GPU paired with a Pollara network card. A dedicated network manages dataset reads and checkpointing, maintaining a straightforward design that minimizes switch costs and ensures stable iteration times.
ZAYA1: A High-Performing AI Model
The ZAYA1-base model activates 760 million parameters out of a total of 8.3 billion and underwent training across three stages using 12 trillion tokens. The model leverages compressed attention, a refined routing system for token allocation, and lighter residual scaling for deeper layer stability.
Employing a combination of Muon and AdamW, Zyphra optimized Muon for AMD hardware by fusing kernels and reducing unnecessary memory traffic to prevent dominance during each iteration. Batch sizes were progressively increased, contingent on swift token delivery via storage pipelines.
These strategies culminate in an AI model trained on AMD hardware that competes with larger counterparts like Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. The MoE structure allows only a fraction of the model to run at a time, aiding in memory management during inference and reducing serving costs.
For instance, a bank could develop a specialized model for investigations without requiring complex parallelism initially. The ample memory capacity of MI300X GPUs allows for iterative improvements, while ZAYA1’s compressed attention accelerates prefill time during evaluation.
Optimizing ROCm for AMD GPUs
Transitioning a mature NVIDIA-based workflow to ROCm posed challenges for Zyphra. Instead of blindly porting components, the team meticulously assessed the behavior of AMD hardware and adjusted model dimensions, GEMM patterns, and microbatch sizes to align with MI300X’s computing range.
For optimal performance, InfinityFabric necessitates all eight GPUs in a node to participate in collective operations, while Pollara achieves peak throughput with larger messages, prompting Zyphra to size fusion buffers accordingly. Long-context training, spanning from 4k to 32k tokens, relied on ring attention for sharded sequences and tree attention during decoding to circumvent bottlenecks.
Storage considerations were also pragmatic, with smaller models emphasizing IOPS and larger models requiring sustained bandwidth. Zyphra consolidated dataset shards to minimize scattered reads and enhanced per-node page caches for faster checkpoint recovery, crucial during extended runs involving rewinds.
Ensuring Cluster Stability
Extended training jobs are prone to imperfections. Zyphra’s Aegis service monitors logs and system metrics, detecting anomalies like NIC glitches or ECC errors, and autonomously rectifying issues. The team extended RCCL timeouts to prevent brief network interruptions from disrupting entire jobs.
Checkpointing is distributed across all GPUs to prevent bottlenecking. Zyphra reports significantly faster saves compared to standard methods, enhancing uptime and reducing operational burden.
Implications of the ZAYA1 AMD Training Milestone for AI Procurement
The report draws a clear distinction between NVIDIA’s ecosystem and AMD’s equivalents, highlighting differences such as NVLINK vs. InfinityFabric, NCCL vs. RCCL, cuBLASLt vs. hipBLASLt, among others. The authors assert that the AMD stack is now robust enough for large-scale model development.
This does not imply that enterprises should replace existing NVIDIA clusters entirely. A more pragmatic approach involves retaining NVIDIA for production purposes while utilizing AMD for stages that benefit from MI300X GPU memory capacity and ROCm’s flexibility. This diversifies supplier risk and boosts overall training capacity without significant disruption.
Ultimately, the recommendations include treating model shape as adjustable, designing networks based on actual training operations, prioritizing fault tolerance to preserve GPU hours, and modernizing checkpointing to enhance training continuity.
These insights stem from Zyphra, AMD, and IBM’s experience in training a substantial MoE AI model on AMD GPUs, offering organizations a valuable blueprint for expanding AI capabilities beyond reliance on a single vendor.
Discover More: Google’s Commitment to Enhanced AI Infrastructure in the Coming Years
Interested in learning more from industry experts about AI and big data? Explore the AI & Big Data Expo scheduled in Amsterdam, California, and London. This comprehensive event is part of TechEx and is co-located with other leading technology events like the Cyber Security Expo. Click here for more details.
AI News is brought to you by TechForge Media. Explore upcoming enterprise technology events and webinars here.
-
Facebook5 months agoEU Takes Action Against Instagram and Facebook for Violating Illegal Content Rules
-
Facebook6 months agoWarning: Facebook Creators Face Monetization Loss for Stealing and Reposting Videos
-
Facebook6 months agoFacebook Compliance: ICE-tracking Page Removed After US Government Intervention
-
Facebook4 months agoFacebook’s New Look: A Blend of Instagram’s Style
-
Facebook4 months agoFacebook and Instagram to Reduce Personalized Ads for European Users
-
Facebook6 months agoInstaDub: Meta’s AI Translation Tool for Instagram Videos
-
Facebook4 months agoReclaim Your Account: Facebook and Instagram Launch New Hub for Account Recovery
-
Apple5 months agoMeta discontinues Messenger apps for Windows and macOS

