shonenkov / CLIP-ODS
CLIP Object Detection, search object on image using natural language #Zeroshot #Unsupervised #CLIP #ODS
☆138Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for CLIP-ODS
- [NeurIPS 2022] Official PyTorch implementation of Optimizing Relevance Maps of Vision Transformers Improves Robustness. This code allows …☆127Updated last year
- PyTorch code for MUST☆104Updated last year
- Generate text captions for images from their embeddings.☆100Updated last year
- [NeurIPS 2023] This repository includes the official implementation of our paper "An Inverse Scaling Law for CLIP Training"☆297Updated 5 months ago
- A task-agnostic vision-language architecture as a step towards General Purpose Vision☆92Updated 3 years ago
- ☆47Updated 3 years ago
- Official repository for "Revisiting Weakly Supervised Pre-Training of Visual Perception Models". https://arxiv.org/abs/2201.08371.☆173Updated 2 years ago
- Release of ImageNet-Captions☆45Updated last year
- Pytorch implementation of LOST unsupervised object discovery method☆236Updated last year
- This repo contains documentation and code needed to use PACO dataset: data loaders and training and evaluation scripts for objects, parts…☆270Updated 9 months ago
- Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training☆132Updated last year
- Easily compute clip embeddings from video frames☆136Updated last year
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet☆209Updated last year
- Get hundred of million of image+url from the crawling at home dataset and preprocess them☆205Updated 5 months ago
- (CVPR 2022) Pytorch implementation of "Self-supervised transformers for unsupervised object discovery using normalized cut"☆298Updated last year
- GRiT: A Generative Region-to-text Transformer for Object Understanding (https://arxiv.org/abs/2212.00280)☆302Updated 10 months ago
- Reproducible scaling laws for contrastive language-image learning (https://arxiv.org/abs/2212.07143)☆153Updated 11 months ago
- ☆265Updated 2 years ago
- Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral☆238Updated last year
- ☆43Updated 3 years ago
- Let's make a video clip☆93Updated 2 years ago
- Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.☆364Updated last year
- An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities☆154Updated 2 years ago
- Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks, de…☆97Updated 2 years ago
- [NeurIPS 2022] The official implementation of "Learning to Discover and Detect Objects".☆108Updated last year
- L-Verse: Bidirectional Generation Between Image and Text☆108Updated last year
- [ECCV'22] Official repository of paper titled "Class-agnostic Object Detection with Multi-modal Transformer".☆300Updated last year
- Using pretrained encoder and language models to generate captions from multimedia inputs.☆95Updated last year
- ☆100Updated 9 months ago
- Optimized library for large-scale extraction of frames and audio from video.☆203Updated last year