Our new paper uses insights from the classic 8-point algorithm to tweak modern ViTs, giving a clean end-to-end model that inputs two RGB images and outputs their relative translation + rotation.
It helps to think carefully about a model's inductive biases! twitter.com/_crockwell/sta…
Our new paper detects objects and predicts their 3D shape and layout, trained without any 3D shapes! Instead we use multiview 2D silhouettes + differentiable rendering losses.
This is an important step in scaling up 3D perception, where we can't rely on large-scale ground-truth. twitter.com/georgiagkioxar…
It was fun talking about the future of #CV and #AI and how the research community needs to unlock new applications. I am passionately against squeezing few percentage points on benchmarks, and in favor of solving novel and important problems. twitter.com/twimlai/status…
Wow, ResNet just passed AlexNet for total citations -- 102k vs 101k (per Google Scholar).
AlexNet citations have flattened since 2019, but ResNet continues to accelerate!
@ClementDelangue @kdexd @gauravkaul7 @zubinaysola We intentionally avoid data with people due to privacy concerns — both through subreddit selection and filtering with face detectors — so images with people don’t give great results; it usually ignores people entirely (as in your examples!)
😺 RedCaps is a large-scale dataset of 12M image-text pairs collected from Reddit for vision and vision-and-language tasks. The data is collected from a manually curated set of subreddits (350 total).
If we prompt the model to generate captions in the style of the /r/cakewin subreddit, then 2/5 captions know it's Elmo and another 2/5 recognize it as a cake for kids!
But a model trained on RedCaps has no problem recognizing the last one as a cake -- 5/5 captions recognize it as a cake, and one nails it as an "elmo cake"!
With 12M+ image-text pairs, our new RedCaps dataset is one of the largest public vision+language datasets.
We hope it will be useful for multimodal pretraining, image captioning, and more! twitter.com/kdexd/status/1…
A new exact algorithm for computing 3D IoU of batches of oriented 3D bounding boxes is now in Pytorch3D! 450x faster than previous methods! twitter.com/georgiagkioxar…
Spoiled by 2D, I was shocked to find out there are no good ways to compute exact IoU of oriented 3D boxes. So, we came up with a new algorithm which is exact, simple, efficient and batched. Naturally, we have C++ and CUDA support #PyTorch3D
Read more: tinyurl.com/27zywpws
@_ryli Sorry, we don’t release solutions since we reuse assignments year to year
There are a few other tricks -- 3D reprojection for spatial consistency, curriculum learning, sample reranking, and more. See the full paper for all the details: arxiv.org/abs/2108.05892
We want high-resolution 256 x 256 images. That would be very slow with standard autoregressive modeling on pixels. So we instead use a VQ-VAE (similar to DALL-E) and autoregressively generate a low-resolution 32x32 grid of latent codes, which a CNN decoder converts to an image.