Unfortunately, I fear I'll always be cheap regarding model size. Instinctively I still am "*millions of parameters?!?!*"
@francoisfleuret For humongous data, you need something that can absorb a lot of info. For now, this is parameters. Otherwise, it's our current best way of making optimization easier. Hopefully we'll find better ways eventually.
@giffmana @francoisfleuret Parameter-efficient learning is often taken for granted. I wish we go beyond the "fitting" paradigm and learn more with less.
@ahatamiz1 @francoisfleuret You may like our distillation paper then, which does exactly that: scholar.google.ch/scholar?q=dist…
@giffmana @francoisfleuret This is great ! seems like FunMatch is the key. I'd be curious to try this approach, but wondering if you have any experiments with ViT ? paper shows an incredibly high performance of 82.8% top1 for ResNet50.. so ViT shall even get better..
@ahatamiz1 @francoisfleuret We started the project before we invented ViT, and didn't want to change everything midways. However, I've done some ad-hoc experiments with ViT and it works just as well. Didn't push them to their limits yet.
@giffmana @francoisfleuret This is amazing. I was going to employ KL for our GC ViT. But this now gives me all the motives to jumpstart the effort and try it. Thanks for the pointer to this great work.