CuriseJia / FreeStyleRet
Precision Search through Multi-Style Inputs
☆45Updated last month
Related projects: ⓘ
- [ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions☆112Updated 2 months ago
- [NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT☆129Updated 4 months ago
- Video dataset dedicated to portrait-mode video recognition.☆35Updated 5 months ago
- The official implementation of RAR☆61Updated 5 months ago
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models☆93Updated last month
- Unified Multi-modal IAA Baseline and Benchmark☆68Updated 5 months ago
- Official implementation of MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis☆75Updated 2 months ago
- An unofficial implementation of the paper “DiffEdit: Diffusion-based semantic image editing with mask guidance”☆23Updated last year
- LAVIS - A One-stop Library for Language-Vision Intelligence☆47Updated last month
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆103Updated 3 weeks ago
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want☆52Updated 5 months ago
- ☆64Updated 4 months ago
- [CVPR 2024] Dynamic Prompt Optimizing for Text-to-Image Generation☆56Updated 2 months ago
- Explore the Limits of Omni-modal Pretraining at Scale☆80Updated 2 weeks ago
- UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing☆87Updated 5 months ago
- ☆113Updated 2 months ago
- Repository for 23'MM accepted paper "Curriculum-Listener: Consistency- and Complementarity-Aware Audio-Enhanced Temporal Sentence Groundi…☆36Updated 8 months ago
- ☆78Updated 8 months ago
- ☆42Updated 2 months ago
- [ECCV2024] Towards Reliable Advertising Image Generation Using Human Feedback☆29Updated last month
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆105Updated last month
- PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.☆174Updated 3 months ago
- Please refer to our official repo at https://github.com/IVGSZ/Flash-VStream.☆48Updated last month
- ☆28Updated 2 weeks ago
- Implements VAR+CLIP for image generation☆64Updated last month
- ECCV2024_Parrot Captions Teach CLIP to Spot Text☆58Updated last week
- ☆82Updated 2 months ago
- Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models☆45Updated 4 months ago
- Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment☆44Updated 3 months ago
- Lion: Kindling Vision Intelligence within Large Language Models☆52Updated 7 months ago