Notes on FCRC Plenary: Computing in the Foundation Model Era
Full video: https://www.youtube.com/watch?v=gADw3NtGDVE
Foundation Model Era
- Examples: ChatGPT, Stable Diffusion
- have billions of parameters and trained on huge amount of data (text or images).
- In-context learning: you can use the same representation with minor customization using English text and you can get amazing accuracy on all sorts of tasks that you never train the model for.
Capabilities
- Fix bugs
- Generate Art
- Design Drugs (Alpha-fold)
Model size is important
It can explain jokes:
Input: I tried 10000 random restarts of my neural network, but I was accused of overfitting. I guess no good seed goes unpunished.
1.3B model will just regurgitate the joke.
175B model (GPT-3) actually explain the joke.
Systems for foundation model
How do we provide scale and efficiency?
How do we achieve performance and programmability?
- Solution 1: Hardcode the ML algorithm to a silicon chip -- Fixed-function ASIC-like performance.
- Question: Can we attain both high efficiency together with the flexibility, like a general X86 processor.
Answer to the question: Need a vertically integrated solution (co-design)
- ML Algorithm
- Dataflow Compilers (the core to the execution model of these machine learning algorithm. Needs Programming Language Support)
- (New hardware) reconfigurable dataflow architectures
ML algorithms
Transformers and Attention
- Key: model the input sequence with QKV: Query, Key, Value
- Typical sequence length: 1k-8k tokens
- Goal: Increase the sequence length
Long Sequence Length Benefits
- NLP: Large context required to understand books, plays, codebases.
- CV: model can higher resolution of the input: most models have a relative low resolution.
- Allow larger input
Attention is slow
Input: Q, K, V
Output: Softmax(Dropout(Mask()))
- It's a quadratic.
- The matrix will not fit in the on-chip memory. The standard implementation requires repeated access to off-chip memory.
FlashAttention is much faster
Fusing the different components of the attention algorithm together into one matrix and then tiling the matrix into pieces that would fit better into the on-chip memory.
- You cannot tile the computation arbitrarily. You need to understand the algorithm.
- 2-4x improvement in performance on GPU and 10-20x reduction in memory required.
- ChatGPT uses it.
Why don't we use sparsity to train machine learning models?
"They tried all kinds of ideas: lottery tickets, hashing schemes, dynamic sparsity masks but the net result is that they've slowed down the training or they lose accuracy."
- Sparsity is not hardware efficient. Hardware likes to work in block.
Pixelate Butterfly: monarch matrices -- doing sparsity in a structured way
- result: https://youtu.be/gADw3NtGDVE?t=1825
- They haven't tried in largest model yet. (2023-06-19)
Can we get rid of this quadratic attention algorithm?
- Monarch Mixer
- HyenaDNA: state space, signal processing method
- Based on FFT, FFT don't work very well in current ML accelerator hardware.
ML models are dataflow -- how compiler can optimize the data flow in these algorithms?
- Fusion
- Tiling
- Metapipelining: hierarchical pipelining
Sparse Abstract Machine
Mosaic: An Interoperable Compiler for Tensor Algebra
Reconfigurable Dataflow Architectures
Reconfigurable Dataflow Architecture: Pattern Compute Unit (2017)
If you have a vector problem then you should build a vector computer.
We have a dataflow problem. We should build a data flow computer.
- SambaNova is built on top of this
- Compare with Nvidia A100
SN10 results
Q & A
What should we focus on: efficient general purpose computing or specialized computing