hoanganhpham1006 / COSTLinks
This is the official implementation of the Video Dialog as Conversation about Objects Living in Space-Time paper
☆32Updated 3 years ago
Alternatives and similar repositories for COST
Users that are interested in COST are comparing it to the libraries listed below
Sorting:
- ☆14Updated 4 years ago
- source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT☆72Updated 3 years ago
- ☆131Updated 2 years ago
- A unified framework to jointly model images, text, and human attention traces.☆79Updated 4 years ago
- Use CLIP to represent video for Retrieval Task☆70Updated 4 years ago
- Archive of Tasks and Results of the Video Browser Showdown☆13Updated 8 months ago
- [ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources☆45Updated 3 years ago
- Data Release for VALUE Benchmark☆30Updated 3 years ago
- A task-agnostic vision-language architecture as a step towards General Purpose Vision☆92Updated 4 years ago
- Visual Language Transformer Interpreter - An interactive visualization tool for interpreting vision-language transformers☆97Updated 2 years ago
- Machine Reading Comprehension special for the Vietnamese language☆42Updated 3 years ago
- ☆26Updated 4 years ago
- 1st Place Solution in Google Universal Image Embedding☆67Updated 2 years ago
- 👨🏻💻 Code release for Vietnamese chatbot from scratch [Published in IEEE IMCOM 2022]☆17Updated last year
- TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers☆21Updated 3 years ago
- Repository for Multilingual-VQA task created during HuggingFace JAX/Flax community week.☆34Updated 4 years ago
- Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps[AAAI2021]☆57Updated 3 years ago
- Traffic Video Event Retrieval via Text Query using Vehicle Appearance and Motion Attributes☆10Updated 4 years ago
- ☆45Updated last year
- Code and data for ImageCoDe, a contextual vison-and-language benchmark☆41Updated last year
- Implementation of LaTr: Layout-aware transformer for scene-text VQA,a novel multimodal architecture for Scene Text Visual Question Answer…☆55Updated last year
- Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities☆80Updated 3 years ago
- ☆47Updated 6 months ago
- Repository for the paper "Data Efficient Masked Language Modeling for Vision and Language".☆18Updated 4 years ago
- A reading list of papers about Visual Question Answering.☆35Updated 3 years ago
- MLPs for Vision and Langauge Modeling (Coming Soon)☆27Updated 3 years ago
- Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training☆138Updated 2 years ago
- Code and Resources for the Transformer Encoder Reasoning Network (TERN) - https://arxiv.org/abs/2004.09144☆58Updated last year
- [BMVC22] Official Implementation of ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment"☆55Updated 3 years ago
- ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.☆85Updated 2 years ago