Consider the entropy of the distribution of token X in these examples: "Four X" ...

ttctciyf · 2025-05-01T06:59:56 1746082796

> in the second case we both know the only likely completion.

You two may, but I don't. 'Decades'? 'Months'? 'Wives'? 'Jobs'? 'Conservative PMs'?

orbital-decay · 2025-04-30T23:17:47 1746055067

Diffusion models are built around this type of internal lookahead from the start (accurate near prediction, progressively less accurate far prediction, step forward, repeat). They just do it in the coarse-to-fine direction, i.e. in a different dimension, and had more thought put into shortcuts and speed-accuracy tradeoffs in this process. RL is also used with both types of models. It's not immediately obvious that one must necessarily be more efficient.

byearthithatius · 2025-04-30T23:27:31 1746055651

Both are conditional distributions on the context of which they were requested so like you said in the second paragraph, the difference is not significant. I see what you mean though and maybe there are use cases then where Diffusion is preferable. To me it seems the context conditional and internal model is sufficient where this problem doesn't really occur.

nullc · 2025-04-30T23:52:00 1746057120

::nods:: in the case of diffusion though "conditional on its own (eventual) output" is more transparent and explicit.

As an example of one place that might make a difference is that some external syntax restriction in the sampler is going to enforce the next character after a space is "{".

Your normal AR LLM doesn't know about this restriction and may pick the tokens leading up to the "{" in a way which is regrettable given that there is going to be a {. The diffusion, OTOH, can avoid that error.

In the case where there isn't an artificial constraint on the sampler this doesn't come up because when its outputting the earlier tokens the AR model knows in some sense about it's own probability of outputting a { later on.

But in practice pretty much everyone engages in some amount of sampler twiddling, even if just cutting off low probability tokens.

As far as the internal model being sufficient, clearly it is or AR LLMs could hardly produce coherent English. But although it's sufficient it may not be particularly training or weight efficient.

I don't really know how these diffusion text models are trained so I can't really speculate, but it does seem to me that getting to make multiple passes might allow it less circuit depth. I think of it in terms of every AR step must expend effort predicting something about the following next few steps in order to output something sensible here, this has to be done over and over again, even though it doesn't change.

nullc · 2025-04-30T23:58:41 1746057521

Totally separate from this line of discussion is that if you want to use an LLM for, say, copyediting it's pretty obvious to me how a diffusion model could get much better results.

Like if you take your existing document and measure the probability of your actual word vs an AR model's output, varrious words are going to show up as erroneously improbable even when the following text makes them obvious. A diffusion model should just be able to score up the entire text conditioned on the entire text rather than just the text in front of it.