AskYoutubeAI / AskVideos-VideoCLIP
☆50Updated last month
Related projects: ⓘ
- ☆62Updated 5 months ago
- Official implementation of "Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models"☆35Updated 8 months ago
- SATO: Stable Text-to-Motion Framework☆97Updated last month
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆115Updated 2 weeks ago
- ☆58Updated 10 months ago
- Code repo for "Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding"☆19Updated last month
- ☆38Updated 4 months ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆53Updated last month
- ☆47Updated 11 months ago
- (WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, B…☆77Updated last week
- Multimodal-Procedural-Planning☆90Updated last year
- ☆65Updated 3 months ago
- KokoMind: Can LLMs Understand Social Interactions?☆103Updated 11 months ago
- ☆65Updated last year
- ☆131Updated 3 weeks ago
- Implementation of the premier Text to Video model from OpenAI☆57Updated last week
- E5-V: Universal Embeddings with Multimodal Large Language Models☆148Updated 2 months ago
- Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scra…☆49Updated last year
- Incredibly descriptive audiovisual summaries for videos☆39Updated last month
- research work on multimodal cognitive ai☆54Updated 3 weeks ago
- ☆84Updated 8 months ago
- This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"☆115Updated 3 months ago
- a family of highly capabale yet efficient large multimodal models☆155Updated 3 weeks ago
- ☆132Updated 7 months ago
- An official codebase for paper " CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos (ICCV 23)"☆52Updated last year
- Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation☆128Updated 9 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆30Updated last month
- ☆83Updated last year
- LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1☆78Updated this week
- Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"☆85Updated 6 months ago