@giffmana I have not been able to achieve comparable results fine-tuning CLIP image tower or even in22k supervised -> 1k weights without using it. The 1k val is higher and the OOD test set scores are better, ie more robust. I've done some hparam search, maybe not exhaustive enough?
@jbschiratti Every model that I train myself for timm is with the train script in the repo. For this CLIP fine-tune task I'm currently using this branch as it has SLURM support and the L/14 + H/14 need a 'few' GPUs github.com/rwightman/pyto…
@giffmana Will also getting lots of help from lots of help from other 🤗 folk for the doc push! My doc skills are still suspect, I keep getting distracted by shiny models...
typo, 'parameter or module' mappings. Here's a 'coarse' (ie by stage, not block) mapping of layer_id -> module name
Regex return by 'group_matcher', a method on every model, it's used internaly in the optimizer factory when you pass 'layer_decay'. New param groups will be created and timm LR sched will apply the decay factor... it can also be used manually to grab parameter or group mappings
I didn't just try to map parameter children / modules into a list (that isn't consistent across models). I sat down and wrote regex (ugh) for every single model to appropriately map stem / block / stage / heads to meaningful 'layers', either blocks or 'coarse' stages
Also known as discriminative LR decay, this applies a decaying LR to the model params as you move away from the head. It's very useful for fine-tuning from large pretrain dataset (or semi/unsupervised train -> supervised) without blowing away properties from pretrain.
I've been told timm has a lot of hidden features. Yes, the docs need improving, that's a WIP! Curious about one of those features I've been using a lot lately in CLIP ViT fine-tuning? Every model in timm, when used with optimizer factory supports layer-wise LR decay.
@averma12 I believe right now the model in the hub has to be a 'CLIPModel' to support the zero-shot image classification task. ie it needs to be an image-text model that allows building a 'zero-shot' classifier via text prompts for a set of classes. huggingface.co/models?pipelin…
All CLIP models on the Hugging Face Hub now have a snazzy zero-shot widget, including the latest LAION-2B trained B/32, L/14, H/14, and g/14 🥳 twitter.com/mishig25/statu…
@simonw The comments re 'committing code' are a bit off for today's workflows. Most are using distributed version control (ie git, mercurial, etc). Branch branch branch, commit away. Worry about review before merge back stable branches, to make commits a gate would be nuts these days...
@sainingxie Saining, on the CLIP + LAION topic, I was going to push for ConvNeXT + LAION-2B runs if we get the compute budget, have there been any experiments you're aware of that look promising? Any ideas re image model tower size vs text tower or hparam gotchas you might be aware of?
@sainingxie One immediate q I had reading that paper first pass was whether it'd make sense to apply L1 across multiple pairs of feat maps through the network. And also what'd happen if FT + L1 feat loss was done in same session.. ie loss = classification_loss + l1_feat_map_loss
@sainingxie I have not tried that (yet). Do have the paper open in a tab somewhere and thought the idea was worth trying, seems simple enough. First things first I needed to see where I could get FT results for a LAION related study.
Weights are not release yet, I will do that as a series. Hopefully soon. The source vit models are in timm though, with a `_clip_laion2b` suffix. They use same model hub entries as OpenCLIP and HF Transformers models.
The FT LAION models have differing OOD capability than others. Comparing the ImageNet-Sketch / Renditions results, the direct in1k FT L/14 and H/14 top the charts for those two test sets while being meh at in1k itself.
The H/14 behaviour mirrors the feeling when training from scratch on 'smaller' datasets with ViT w/o adding lots of augreg tricks (anyone who's tried will be familiar). It'll be interesting to see if in22k gives the extra data needed to FT well for 1k target.
Further experiments are ongoing, I'm hoping for some magic with H/14 and intermediate FT. For the direct FT H/14 was peaking very early in the LR schedule. Increasing augreg helped to barely pass L/14 and pushed the peak back a bit but I couldn't extend gains to a longer sched.
One of my aims was to pass BEiT results, so digging in to their FT process more, I decided to try intermediate FT on ImageNet-22k. The best BEiT weights are via two stage adaptation. 83.3 @ 224 B/32, 85 @ 384 B/32, and now 87.9 @ 224 L/14. Looking much better!
For first runs I fine-tuned directly from the image tower CLIP weights (loadable in timm now via the HF hub models). This went okay, but I was hoping for more. I squeezed 87.4 @ 224 L/14, 87.8 @ 336, 82.2 @ 224 B/32, 84.4 @ 384 B/32. H/14 @ 224 only 87.6. Not bad, but not wow.
I've been plugging away on some ViT fine-tuning experiments this week while slowly recovering from a nasty cold. Part of the LAION-2B + CLIP exploration, I am fine-tuning the recently released weights to ImageNet. Some expectations were upended, but interesting weights inbound.
@michalwols @giffmana @ducha_aiki @rom1504 @ApacheArrow You can still address that by sharding with oversampling (increase in data for cheap storage still much less $ than random access). Filtering down is easy but limits to that. You can also get creative and tier data into different sets of shards, adjust mix on read. Not OOB though
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow The nature of cloud storage / spinning disks is that it usually doesn't make much sense to do partial reads. In the time it takes to do the seek/request, you can download more data than you need and throw away the rest, it's the seeks / RTT that are the limiting factor.
@michalwols @ducha_aiki @rom1504 @giffmana @ApacheArrow For training, record formats are optimal because you usually want to see every record. For analysis columnar / DB makes sense. So should store meta-data, extracted feature in that form and point to data in record blobs. Cover both use cases ....