As a result, we need to increase the staging memory size for buffering more data. To implement this, NVIDIA chose shared memory as the staging memory for Tensor Cores, which explains why shared memory increased but register file size remained constant. However, Blackwell��s shared memory size didn��t increase from Hopper., Where Hopper-powered servers like the NVIDIA H200 optimize existing workflows, Blackwell redefines what��s possible. In this piece, we��ll dissect Blackwell��s revolutionary design, contrast it directly with today��s Hopper-based NVIDIA AI servers, and map its strategic implications for enterprises., With the 3.x redesign, CUTLASS aimed to maximize coverage of the space of GEMM implementations through a hierarchical system of composable, orthogonal building blocks, while also improving code readability and extending support to later NVIDIA architectures such as Hopper and Blackwell., Yes, NVIDIA Blackwell is up to 2.5 times faster than Hopper, offering a performance boost through advancements like the second-generation Transformer Engine, a decompression engine and a much faster chip-to-chip interconnect speed., The landscape of AI computing is witnessing a revolutionary transformation with NVIDIA��s latest advancement from Hopper to Blackwell architectures. In March 2025, NVIDIA unveiled the Blackwell Ultra AI factory platform, marking a pivotal shift toward the age of AI reasoning., .