DAMO-NLP-SG / VideoRefer
[CVPR 2025] The code for "VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM"
☆193Updated last week
Alternatives and similar repositories for VideoRefer:
Users that are interested in VideoRefer are comparing it to the libraries listed below
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆303Updated 2 months ago
- Official repository of T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT☆79Updated last week
- The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".☆246Updated 4 months ago
- [ICML 2025 Spotlight]An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?☆139Updated 2 weeks ago
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree☆466Updated 4 months ago
- Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models☆153Updated 5 months ago
- GPT-ImgEval: Evaluating GPT-4o’s state-of-the-art image generation capabilities☆252Updated this week
- [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization☆563Updated 11 months ago
- ☆135Updated 4 months ago
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation☆114Updated 6 months ago
- Official code base for paper EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidan…☆104Updated last month
- A post-training method to enhance CLIP's fine-grained visual representations with generative models.☆48Updated last month
- ☆138Updated 2 weeks ago
- LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation☆36Updated 2 months ago
- ☆228Updated 5 months ago
- [ICLR 2025] BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities☆142Updated 3 months ago
- ✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy☆275Updated last month
- Code release for "UniVS: Unified and Universal Video Segmentation with Prompts as Queries" (CVPR2024)☆182Updated 5 months ago
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models☆94Updated last year
- (ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator☆107Updated last month
- Liquid: Language Models are Scalable and Unified Multi-modal Generators☆555Updated last month
- a family of versatile and state-of-the-art video tokenizers.☆382Updated last month
- [NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.☆293Updated 10 months ago
- [AAAI 2025] Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation☆145Updated 4 months ago
- Official implementation of Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data☆83Updated 6 months ago
- Multi-granularity Correspondence Learning from Long-term Noisy Videos [ICLR 2024, Oral]☆113Updated last year
- Ola: Pushing the Frontiers of Omni-Modal Language Model☆334Updated 2 months ago
- Video-Inpaint-Anything: This is the inference code for our paper CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, C…☆291Updated 7 months ago
- Evaluating text-to-image/video/3D models with VQAScore☆293Updated last month
- A collection of multimodal reasoning papers, codes, datasets, benchmarks and resources.☆199Updated last week