I've been told timm has a lot of hidden features. Yes, the docs need improving, that's a WIP! Curious about one of those features I've been using a lot lately in CLIP ViT fine-tuning? Every model in timm, when used with optimizer factory supports layer-wise LR decay.
Also known as discriminative LR decay, this applies a decaying LR to the model params as you move away from the head. It's very useful for fine-tuning from large pretrain dataset (or semi/unsupervised train -> supervised) without blowing away properties from pretrain.
I didn't just try to map parameter children / modules into a list (that isn't consistent across models). I sat down and wrote regex (ugh) for every single model to appropriately map stem / block / stage / heads to meaningful 'layers', either blocks or 'coarse' stages