Notes on FCRC Plenary: Computing in the Foundation Model Era

June 25, 2023

Full video: https://www.youtube.com/watch?v=gADw3NtGDVE

Foundation Model Era

Examples: ChatGPT, Stable Diffusion
have billions of parameters and trained on huge amount of data (text or images).
In-context learning: you can use the same representation with minor customization using English text and you can get amazing accuracy on all sorts of tasks that you never train the model for.

Capabilities

Fix bugs
Generate Art
Design Drugs (Alpha-fold)

Model size is important

It can explain jokes:

Input: I tried 10000 random restarts of my neural network, but I was accused of overfitting. I guess no good seed goes unpunished.

1.3B model will just regurgitate the joke.

175B model (GPT-3) actually explain the joke.

Systems for foundation model

How do we provide scale and efficiency?

How do we achieve performance and programmability?

Solution 1: Hardcode the ML algorithm to a silicon chip -- Fixed-function ASIC-like performance.
Question: Can we attain both high efficiency together with the flexibility, like a general X86 processor.

Answer to the question: Need a vertically integrated solution (co-design)

ML Algorithm
Dataflow Compilers (the core to the execution model of these machine learning algorithm. Needs Programming Language Support)
(New hardware) reconfigurable dataflow architectures

ML algorithms

Transformers and Attention

Key: model the input sequence with QKV: Query, Key, Value
Typical sequence length: 1k-8k tokens
Goal: Increase the sequence length

Long Sequence Length Benefits

NLP: Large context required to understand books, plays, codebases.
CV: model can higher resolution of the input: most models have a relative low resolution.
Allow larger input

Attention is slow

Input: Q, K, V

Output: Softmax(Dropout(Mask( $QK^T$ ))) $V$

It's a quadratic.
The matrix will not fit in the on-chip memory. The standard implementation requires repeated access to off-chip memory.

FlashAttention is much faster

Fusing the different components of the attention algorithm together into one matrix and then tiling the matrix into pieces that would fit better into the on-chip memory.

You cannot tile the computation arbitrarily. You need to understand the algorithm.
2-4x improvement in performance on GPU and 10-20x reduction in memory required.
ChatGPT uses it.

Why don't we use sparsity to train machine learning models?

"They tried all kinds of ideas: lottery tickets, hashing schemes, dynamic sparsity masks but the net result is that they've slowed down the training or they lose accuracy."