Thorin215 / FocusedAD
Repo of FocusedAD
☆12Updated last month
Alternatives and similar repositories for FocusedAD
Users that are interested in FocusedAD are comparing it to the libraries listed below
Sorting:
- Code for paper: Unified Text-to-Image Generation and Retrieval☆15Updated 10 months ago
- [ECCV'24 Oral] PiTe: Pixel-Temporal Alignment for Large Video-Language Model☆16Updated 3 months ago
- [ICML 2025] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models☆23Updated 2 weeks ago
- Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning☆22Updated last month
- This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs.☆18Updated 10 months ago
- This repo contains code for the paper "Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM"☆13Updated last month
- Official implementation of ECCV24 paper: POA☆24Updated 9 months ago
- A benchmark dataset and simple code examples for measuring the perception and reasoning of multi-sensor Vision Language models.☆18Updated 4 months ago
- [EMNLP 2024] Official code for "Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models"☆18Updated 7 months ago
- "Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs" 2023☆14Updated 5 months ago
- Project for SNARE benchmark☆11Updated 11 months ago
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆34Updated 4 months ago
- The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search"☆24Updated last week
- PyTorch code for "ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning"☆20Updated 6 months ago
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆20Updated 5 months ago
- ABC: Achieving Better Control of Multimodal Embeddings using VLMs☆11Updated last month
- The code for "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation" [CVPR2025]☆15Updated 2 months ago
- Implementation of the model: "(MC-ViT)" from the paper: "Memory Consolidation Enables Long-Context Video Understanding"☆21Updated last month
- ☆14Updated 7 months ago
- Official Repository of Personalized Visual Instruct Tuning☆28Updated 2 months ago
- ☆43Updated 3 weeks ago
- ☆41Updated 6 months ago
- ☆18Updated 3 weeks ago
- Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models☆28Updated last year
- ☆12Updated 4 months ago
- ☆10Updated 6 months ago
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains☆40Updated last week
- Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning☆15Updated 6 months ago
- Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"☆35Updated 9 months ago
- Official implementation of the paper "MMInA: Benchmarking Multihop Multimodal Internet Agents"☆43Updated 2 months ago