@percyliang Yes, but as you know, "Foundation" is too close to "Foundational", and many of us find that troubling. That is why I'm proposing a more neutral term. For use, maybe we could just call them "Upstream models".

@giffmana @francoisfleuret @tdietterich @percyliang Pretty sure it is quite important... We wouldn't be able to scale close to that scale with fully supervised models.

@ggdupont @francoisfleuret @tdietterich @percyliang Funny you tell me that, because I have several papers doing exactly that...

@giffmana @ggdupont @tdietterich @percyliang You do not think the best strategy to train models for image understanding will be eventually mostly self-supervised?

@francoisfleuret @ggdupont @tdietterich @percyliang Only time will tell, but currently, this strategy performs comparatively poorly.

@giffmana @francoisfleuret @ggdupont @percyliang My impression was that self-supervised is competitive with supervised in computer vision. Is this wrong? In particular, doesn't self-supervised permit training on much more data?

@tdietterich @francoisfleuret @ggdupont @percyliang Yes it currently still fails: on imagenet-1k there are now competitive methods. But scaling *the same* data 10x, on imagenet-21k, they still fall far behind supervised. The stated goal, training on infinite web data, is superseded by (supervised!) image-text, works much better.

@giffmana @tdietterich @francoisfleuret @percyliang Given the same data AND the right labels, supervised learning does get better results. Does it get same level of generalisation/multitasking? (again for text selfsupervised allows more flexibility and scale higher, but I'm curious if it happens also on images)

@ggdupont @tdietterich @francoisfleuret @percyliang That's a great question/point. I think for small scale yes, for large scale, it's not clear/settled yet! (Mostly due to lack of good self-sup at scale)