Notes on FCRC Plenary: Computing in the Foundation Model Era

Full video: https://www.youtube.com/watch?v=gADw3NtGDVE

Foundation Model Era

  • Examples: ChatGPT, Stable Diffusion
  • have billions of parameters and trained on huge amount of data (text or images).
  • In-context learning: you can use the same representation with minor customization using English text and you can get amazing accuracy on all sorts of tasks that you never train the model for.

Capabilities

  • Fix bugs
  • Generate Art
  • Design Drugs (Alpha-fold)

Model size is important

It can explain jokes:

Input: I tried 10000 random restarts of my neural network, but I was accused of overfitting. I guess no good seed goes unpunished.

1.3B model will just regurgitate the joke.

175B model (GPT-3) actually explain the joke.

Systems for foundation model

How do we provide scale and efficiency?

How do we achieve performance and programmability?

  • Solution 1: Hardcode the ML algorithm to a silicon chip -- Fixed-function ASIC-like performance.
  • Question: Can we attain both high efficiency together with the flexibility, like a general X86 processor.

Answer to the question: Need a vertically integrated solution (co-design)

  1. ML Algorithm
  2. Dataflow Compilers (the core to the execution model of these machine learning algorithm. Needs Programming Language Support)
  3. (New hardware) reconfigurable dataflow architectures

ML algorithms

Transformers and Attention

  • Key: model the input sequence with QKV: Query, Key, Value
  • Typical sequence length: 1k-8k tokens
  • Goal: Increase the sequence length

Long Sequence Length Benefits

  • NLP: Large context required to understand books, plays, codebases.
  • CV: model can higher resolution of the input: most models have a relative low resolution.
  • Allow larger input

Attention is slow

Input: Q, K, V

Output: Softmax(Dropout(Mask(QKTQK^T)))VV

  • It's a quadratic.
  • The matrix will not fit in the on-chip memory. The standard implementation requires repeated access to off-chip memory.

FlashAttention is much faster

Fusing the different components of the attention algorithm together into one matrix and then tiling the matrix into pieces that would fit better into the on-chip memory.

  • You cannot tile the computation arbitrarily. You need to understand the algorithm.
  • 2-4x improvement in performance on GPU and 10-20x reduction in memory required.
  • ChatGPT uses it.

Why don't we use sparsity to train machine learning models?

"They tried all kinds of ideas: lottery tickets, hashing schemes, dynamic sparsity masks but the net result is that they've slowed down the training or they lose accuracy."

  • Sparsity is not hardware efficient. Hardware likes to work in block.

Pixelate Butterfly: monarch matrices -- doing sparsity in a structured way

Can we get rid of this quadratic attention algorithm?

  • Monarch Mixer
  • HyenaDNA: state space, signal processing method
    • Based on FFT, FFT don't work very well in current ML accelerator hardware.

ML models are dataflow -- how compiler can optimize the data flow in these algorithms?

  • Fusion
  • Tiling
  • Metapipelining: hierarchical pipelining

Sparse Abstract Machine

Mosaic: An Interoperable Compiler for Tensor Algebra

Reconfigurable Dataflow Architectures

Reconfigurable Dataflow Architecture: Pattern Compute Unit (2017)

If you have a vector problem then you should build a vector computer.

We have a dataflow problem. We should build a data flow computer.

SN10 results

Q & A

What are the key challenges on the PL side that supports this kind of new programming models and hardware design techniques?

Thoughts on leveraging the newest language models like GPT4 to help you do design and implementation of hardware algorithms and languages

Can you comment on how as system designers and hardware designers, how can we help provide some features to aid this work

The trade-off of two trends in the data flow architecture. Why do you choose functional units which require upfront configuration?

What should we focus on: efficient general purpose computing or specialized computing

What we should do first? First have a best algorithm and then you go down to improve the hardware, or have a good hardware and then you let the algorithm work best for the algorithm?

Can you share some lights on why it's difficult for ML accelerators to do FFTs? There are a lot of specific hardware that does FFTs.

How do you optimize the DRAM read in SambaNova?