yliu-cs / PiTe
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
☆12Updated last month
Related projects ⓘ
Alternatives and complementary repositories for PiTe
- ☆12Updated last month
- 🔥 Aurora Series: A more efficient multimodal large language model series for video.☆41Updated 2 weeks ago
- ☕️ CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion☆27Updated 4 months ago
- Disentangled Pre-training for Human-Object Interaction Detection☆17Updated last week
- ☆31Updated 8 months ago
- Data-Efficient Multimodal Fusion on a Single GPU☆47Updated 6 months ago
- [NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.☆40Updated 3 weeks ago
- [CVPR 2024] "Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition"☆11Updated 8 months ago
- [CVPR2024] Official implementation of the paper: Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning☆36Updated 5 months ago
- [NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos☆29Updated last week
- Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"☆40Updated 4 months ago
- Official Repository of Personalized Visual Instruct Tuning☆23Updated last week
- UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model☆17Updated 3 months ago
- ☆35Updated last month
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆32Updated 4 months ago
- Official implement of MIA-DPO☆32Updated last week
- ☆17Updated 4 months ago
- Video Diffusion State Space Models☆19Updated 7 months ago
- [TCSVT 2024] Temporally Consistent Referring Video Object Segmentation with Hybrid Memory☆12Updated 3 weeks ago
- ☆39Updated 11 months ago
- [CVPR 2024] Improving language-visual pretraining efficiency by perform cluster-based masking on images.☆22Updated 5 months ago
- ☆20Updated 3 months ago
- [NeurIPS 2024] Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective☆38Updated last week
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compression☆38Updated 3 months ago
- [ECCV2024] Learning Video Context as Interleaved Multimodal Sequences☆29Updated last month
- SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image and Video Generation (arXiv: 2410.12761)☆18Updated 3 weeks ago
- Official source codes of "TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation"☆24Updated last month
- ☆57Updated last year
- ☆14Updated 3 weeks ago