NPU (Hardware Accelerator for Neural Networks)

Posted on 2025-04-15 In SNU , 4-1 , 하드웨어시스템설계 Reading time ≈ 6 mins.

Pipelined tile based multiplications

Tile Pipelining

We need to multiply large matrices with small systolic array.
Systolic array use tile pipelining! It keeps get next tile's input to remove idle cycles.

Convolution on GPU

We use BLAS (Basic Linear Algebra Subprograms) library for matrix multiplication.

cuBLAS and cuDNN both duplicate input image into large array, so we can conmpute convolution with matrix multiplication.
cuBLAS actually created large matrix, while cuDNN stores original image and reformat when it goes to GPU cache.
cuDNN better utilizes main memory at a small cost of software's on-the-fly data reformatting!

However, in hardware accelerator, hardware should reformat image.
In TPU architecture, systolic data setup will convert 2D window into 1D vector to perform matrix multiplication.

Weight stationary in Systolic Array

Google found out memory bandwidth determines performance. Why?

Reason 1: M*V multiplication of MLP and RNN is bandwidth-intensive operation because weights are used only once.

c.f. M*V multiplication is highly used in recommendation system.

Solution 1: Large batch changes matrix-vector multiplication to matrix-matrix multiplication which offers higher data reuse.

In case of interactive service, (e.g. recommendation) batch formation will delay user service therby limit batch size.

Solution 2: Weight stationary in Systolic Array

Instead of pipelining weights, we load weights at once in systolic array.

During MM, we need high bandwidth only for one of the two matrices from the internal buffer.
Thus, in case of low memory bandwidth, weight stationary can offer much better performance than output stationary!

Internal buffer -> high bandwidth, provide each tile
DRAM -> low bandwidth, we store each tile into systolic array and reuse it.

Problem 1: latency exists! We should hide latency of reading weights with computation.
We should compare latency of reading the weights of next tile from slow DRAM vs. latency of systolic array multiplication on current tile.

Problem 2: We need large partial sum buffer and accumulate partial sums.

Hiding latency of reading weights

Due to low bandwidth of DRAM, systolic array have to wait until DRAM read the next tile.

We can use higher DRAM bandwidth, better DRAM bandwidth utilization (e.g. int8), or ... just increase systolic array computation time?!?!?!

Computing larger matrix can reuse DRAM's tile more, so we can hide latency.
By having larger batch size, we make input matrix taller, thus reusing DRAM's tile more!
In case of interactive service, we can process multiple user's input as a batch.

Memory-Compute Balance

Memory-Compute Balance: At ideal case, computation time and memory access time should be same!
If memory access is longer, we have memory bottleneck.
If computation takes longer, we have compute bottleneck.
This is why batch size is important!

e.g. Unlike TPUv1, TPUv2 used High Bandwidth Memory (HBM) and 16-bit bfloat.
Since memory access time is faster, we should make computation time faster to maintain the memory-compute balance.

TPUv2 also use bfloat16 instead of float32.
Unlike float16, bfloat16 have same exponent as float32.
Theorically, represenatable value range are same with float32! (From $10^{-38}$ to $3 \cdot 10^{38}$ )

Improvements in TPUs

TPUv2

TPUv2 have interconnect router that can directly connect multiple TUP chips.

TPUv2 uses two TPU core, PCIe queues (to connect with CPU, GPUs can't run without CPU's command!), and bfloat16.

TPU core has vector unit. Vector unit have 32K x 32bit vector memory, can perform elementwise operation with 2 ALUs, and send and retrieve vectors from matrix multipy unit.

TPUv2 has 128x128 systolic array!
Smaller matrix have better utilization, larger matrix have better data reuse.
By considering 256x256, 128x128, 64x64, they chose 128x128 as the best option.

TPUv3

TPUv3 is similar to TPUv2.

Now we use 2 systolic arrays per TPU Core, HBM with better bandwidth and capacity, interconnect router with more nodes, etc.

TPU Board

Each TPU Board has 4 TPU chips.
TPUs are interconnected in 2D Torus.
Each TPU Board is connected to eac Host (CPU) wiht PCI-e.

TPUv3 Scaling

Google run benchmarks with 6 production applications they're using, 2 of each MLP, CNN, RNN.

RNN, CNN had linear scaling, but MLP (recommendation system) didn't had linear scaling.

TPUv4i

TPUv4i have single core (Tensor core) with 4 systolic arrays connected to vector memory.
CMEM is a cache between HBM and VMEM.
It can cache previously computed matrix!

Each systolic array compute 32x32 matrix multiplication with 4x4 dot product.
4 inputs and 4 weights are computed at once in 4 input-adder tree.
By removing 3 accumulators, We could reduce 40% in area and 25% in power with respect to 128x128 systolic array.

TPUv4i also used bfloat16 and int8.
Problem: poor segmentation degrades the result of the image quality enhancement.

Vision for driving assistance and autonomy

Tesla Hydranet has 8 cameras, 16 timesteps, and 32 batch size.
We need to process 4096 HD images in a single forward pass!

Teslar FSD Chip

First chip to actually accelerate convolution!
They used systolic array with 8bit input.

To provide and store large input datas, Teslar have large SRAMs with high bandwidth.

ExaPOD: Teslar Supercomputer

Self-driving cars need lots of training data. Tesla used 1.5PB for final dataset!

D1 Training chip

Single D1 Training chip has 5x5 chips that can be connected in four directions to arrange in tiles.
It is connected to CPU with PCIe, and every training data is stored in CPU's memory.
There is no large memory in D1 chip! Therefore we want to avoid data repetation.

Parameters for two layers are distributed across two tiles, splitting across channels.
Inputs are shared across four tiles, splitting across batches. (1/4 of batch)
First layer's parameter are replicated to create full parameters in two tiles.
These parameters are replicated into other two tiles.
First layer is run
Second layer's parameter are replicated into other two tiles. (1 copy per 2 tiles for MP)
Remove replicated first parameters and input
Split each first layer's output (per 1/4 of batch) into two tiles, spliting across channels
Second layer is run (output two partial sums per 1/4 of batch)
Add each two partial sums (per 1/4 of batch)
Remove replicated second parameters
Now we've done two layers of CNNs, and parameters are still in chips! (can be reused)

Dojo compiler

Most engineers in Teslar is software engineer!
Because their design was unique, they had to make their own compiler for pytorch.

Result was better than NVIDIA's solution!

Other NPUs

AMD MI300X Series

Use 8 HBM stacks for memory
Stacks of silicon dies for compute
Tend to have smaller server size, and consumes less power

SambaNova's Chiplet based Accelerator

Best chiplet-based inference accelerator
Large silicon die in the middle, surrounded by DRAM and HBM

Cerebras Wafer Scale Engine (WSE)

Proposes wafer as a large silicon chip
c.f. Usually you cut wafer to get desired amount of chips
About 1 million cores are inside one wafer!
Zero skipping - faster computation for sparse weights

Zero skipping

Significant portion of input values in a CNN is zero!
Pruning and ReLU makes zero values.

SwarmX

Motivation: matrix-matrix multiplication can be viewed as a sum of scalar-vector multiplication.

If scalar is zero, we skip!

NVIDIA Tensor Core

Tensor Core performs 64 multiplications per cycle. (4 instances of 4x4 outer product multiplier)
This can multiply 4x4 matrix in 1 cycle.

Each outer product multiplier gets row/column of matrix.
In 1 cycle, outer product performs outer product, (similar to systolic array) and adder tree adds outer product's elements.

Zero Skipping in Nvidia A100

For each 4 parameter, we prune 2 parameters.
We can shrink parameters into half, but we need memory for non-zero indices.
Sparse Tensor Core will select input activation at non-zero indices with mux.

We need 2 bits for indices - 12.5% for 16bit data, 25% overhead for 8bit data

When training, We prune half of parameters and update weight only for parameters that aren't pruned.
Sparse Tensor Cores will multiply 16x32 and 32x8 matrix in 2 cycles.

Samsung NPU v1

Exynos has zero weight skipping!

For each kernel element, we produce intermediate output by multiplying kernel element and corresponding input feature map.

If kernel is 3x3, we need 9 cycles. ' However, if kernel has zero values, we can skip to save cycles.

This is the first commercial neural network accelerator!

Samsung NPU v2

Zero-activation skipping!

Feature-map-aware zero skipping: Move feature map and weight to fill in zero features, so we can perform less dot products.
Feature-map lossless compressor: If feature map has many zero features, we can compress into smaller feature map.

Precision in Nvidia A100 and H100

We use more smaller types - float8, int8, etc.

If the data size is halved, we only need to fetch half the weights (or we can compute twice the weight in the same time), so performance is doubled!

NVIDIA has found that smaller types don't affect accuracy as much for different networks.

Future of accelerator

Reasoning model needs more token generations.
But power limits data center; We should maximize number of token generated per power!

Power efficiency is becoming more and more important.
Probably the only hope for startup companies to beat big tech?

Modern GPU can access other GPU memories. (NVLINK)
GPU can now access 155TB of memory at high bandwidth!
Configurations may change in 1T parameter model training.
Network speed is also becoming important.