hanghuacs / FineCaption
☆22Updated last month
Alternatives and similar repositories for FineCaption:
Users that are interested in FineCaption are comparing it to the libraries listed below
- Code release for "SegLLM: Multi-round Reasoning Segmentation"☆55Updated this week
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆94Updated 2 weeks ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆45Updated 2 months ago
- Diffusion Powers Video Tokenizer for Comprehension and Generation☆38Updated last month
- ☆37Updated 3 months ago
- 🔥 Aurora Series: A more efficient multimodal large language model series for video.☆61Updated last month
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆32Updated 3 months ago
- Liquid: Language Models are Scalable Multi-modal Generators☆57Updated 3 weeks ago
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection☆40Updated last week
- [NeurIPS 2024] The official implement of research paper "FreeLong : Training-Free Long Video Generation with SpectralBlend Temporal Atten…☆34Updated last month
- Code for "VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement"☆39Updated last month
- ☆16Updated last year
- 🔥 [CVPR 2024] Official implementation of "See, Say, and Segment: Teaching LMMs to Overcome False Premises (SESAME)"☆30Updated 6 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆35Updated last week
- ☆14Updated 2 months ago
- Official PyTorch implementation - Video Motion Transfer with Diffusion Transformers☆32Updated last month
- Official implementation of MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis☆83Updated 5 months ago
- Official repository of "Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning"☆23Updated 3 weeks ago
- [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM☆64Updated 2 months ago
- Codes for Paper: Towards Semantic Equivalence of Tokenization in Multimodal LLM☆46Updated 3 months ago
- InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption 🔍☆28Updated 3 weeks ago
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆63Updated 2 months ago
- ☆20Updated 6 months ago
- [TCSVT 2024] Temporally Consistent Referring Video Object Segmentation with Hybrid Memory☆14Updated 2 months ago
- The official repository for paper "PruneVid: Visual Token Pruning for Efficient Video Large Language Models".☆22Updated 2 weeks ago
- Official Repository of Personalized Visual Instruct Tuning☆26Updated 2 months ago
- CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient☆75Updated last month
- The official code of the paper "PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction".☆50Updated this week
- This is the official repo for ByteVideoLLM/Dynamic-VLM☆18Updated 3 weeks ago