Generate data in random order? A masked language models as generative model? All this and more in "Autoregressive Diffusion Models" with @agritsenko @BastingsJasmijn @poolio @vdbergrianne @TimSalimans. For details see arxiv.org/abs/2110.02037. Some explanations below...
Training of Autoregressive Diffusion Models (ARDMs). In the basic form, they are trained like a masked language model, but where the _number_ of masked variables also varies. Each train step some variables are masked, and those are predicted from the remaining ones.
Sampling takes multiple steps. First a random generation order is picked. Then, one-by-one the model does a forward pass and samples a value. Those values are filled-in for the next forward pass.
Extensions include parallelization (using multiple predictions at once) and generation in stages (predicting more significant bits first).
ARDMs are conceptually very interesting as well. Their basic form lies at the intersection of "order agnostic ARMs" and "absorbing discrete diffusion". In fact, the _continuous-time limit_ of absorbing diffusion turns out to be an order agnostic ARM.