Tom Goldstein @tomgoldsteincs, Twitter Profile

Tom Goldstein @tomgoldsteincs

2 years ago

Why have diffusion models displaced GANs so quickly? Consider the tale of the (very strange) first DALLE model. In 2021, diffusions were almost unheard of, yet the creators of DALLE had already rejected the GAN approach. Here’s why. 🧵

15 367 2K 0 843

Tom Goldstein @tomgoldsteincs

2 years ago

DALLE is an image model, but it was built like a language model. The model trained on image-caption pairs. Captions were encoded as 256 tokens. Images were broken into a 32x32 grid of patches, which were each encoded as a token. All tokens were merged into a single sequence.

2 12 131 0 15

Download Image

Tom Goldstein @tomgoldsteincs

2 years ago

A transformer-based "language" model was trained on these sequences, ignoring the fact that some tokens represent text and some represent patches. The model reads in a partial sequence of tokens, and predicts the next token in the sequence.

1 4 76 0 0

François Fleuret @francoisfleuret

2 years ago

@tomgoldsteincs A part of the magic comes from the IMO unexpected performance of quantized auto-encoders, used to translate images into tokens and back.

1 0 6 0 0