Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"
☆79Oct 29, 2025Updated 4 months ago
Alternatives and similar repositories for SAIL
Users that are interested in SAIL are comparing it to the libraries listed below
Sorting:
- On Path to Multimodal Generalist: General-Level and General-Bench☆18Jul 11, 2025Updated 7 months ago
- coded with and corrected by Google Anti-Gravity☆13Nov 23, 2025Updated 3 months ago
- ☆42Jul 9, 2025Updated 7 months ago
- [ICLR'26] Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology☆75Jan 26, 2026Updated last month
- [ICLR 26] The official code repository for the paper "Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions".☆15Feb 9, 2026Updated 2 weeks ago
- ☆10Jul 5, 2024Updated last year
- A tiny package supporting distributed computation of COCO metrics for PyTorch models.☆15Feb 28, 2023Updated 3 years ago
- Benchmarking Multi-Image Understanding in Vision and Language Models☆12Jul 29, 2024Updated last year
- ☆13Jan 22, 2025Updated last year
- This is an implementation of the paper "Are We Done with Object-Centric Learning?"☆12Sep 11, 2025Updated 5 months ago
- Collaborative retina modelling across datasets and species.☆18Updated this week
- [TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"☆147Nov 14, 2024Updated last year
- OpenVision (ICCV 2025), OpenVision 2 (CVPR 2026), and OpenVision 3☆457Feb 21, 2026Updated last week
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆368Jul 24, 2025Updated 7 months ago
- 🕵 Code for our EMNLP 2025 Main paper: "FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games"☆24Dec 14, 2025Updated 2 months ago
- [AAAI'25] CharacterBench: Benchmarking Character Customization of Large Language Models☆19Aug 1, 2025Updated 7 months ago
- Code For Our Work: DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries [ECCV-2024]☆14Jul 11, 2024Updated last year
- An environment for testing tensorflow models.☆14Oct 29, 2017Updated 8 years ago
- [NeurIPS 2025] Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models☆64Nov 27, 2025Updated 3 months ago
- Code for ICLR'24 workshop ME-FoMo-How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation☆38Oct 18, 2024Updated last year
- Official PyTorch implementation of RACRO (https://www.arxiv.org/abs/2506.04559)☆19Jul 1, 2025Updated 8 months ago
- ☆42Dec 6, 2025Updated 2 months ago
- Ola: Pushing the Frontiers of Omni-Modal Language Model☆385Jun 13, 2025Updated 8 months ago
- ☆31Aug 7, 2025Updated 6 months ago
- Official Project Page for HLA: Higher-order Linear Attention (https://arxiv.org/abs/2510.27258)☆45Jan 6, 2026Updated last month
- [AAAI 2025] Explore In-Context Segmentation via Latent Diffusion Models☆22Mar 25, 2025Updated 11 months ago
- Code and data for paper "Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation".☆23Oct 22, 2025Updated 4 months ago
- The official repo for LIFT: Language-Image Alignment with Fixed Text Encoders☆42Jun 10, 2025Updated 8 months ago
- Load any clip model with a standardized interface☆22Oct 20, 2025Updated 4 months ago
- TPDiff: Temporal Pyramid Video Diffusion Model☆25Mar 13, 2025Updated 11 months ago
- [AAAI 2022] DCAN: Improving Temporal Action Detection via Dual Context Aggregation☆17Nov 13, 2022Updated 3 years ago
- ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation (CVPR'25)☆18Apr 2, 2025Updated 11 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆95Mar 1, 2025Updated last year
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆1,548Jun 14, 2025Updated 8 months ago
- [ECCV 2024] Tokenize Anything via Prompting☆603Dec 11, 2024Updated last year
- [ICCV 2025] Official repo for "GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation"☆198Jan 7, 2026Updated last month
- ICML2025☆63Aug 28, 2025Updated 6 months ago
- A Unified Efficient Pyramid Transformer for Semantic Segmentation, ICCVW 2021☆31Oct 11, 2021Updated 4 years ago
- Un-*** 50 billions multimodality dataset☆23Sep 14, 2022Updated 3 years ago