albanie / foundation-models
Video descriptions of research papers relating to foundation models and scaling
☆30Updated 2 years ago
Alternatives and similar repositories for foundation-models:
Users that are interested in foundation-models are comparing it to the libraries listed below
- ☆43Updated 2 months ago
- ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models (ICLR 2024, Official Implementation)☆15Updated last year
- ☆31Updated last year
- Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch☆100Updated last year
- ☆64Updated last year
- PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)☆33Updated 2 years ago
- Code for the paper titled "CiT Curation in Training for Effective Vision-Language Data".☆78Updated 2 years ago
- https://arxiv.org/abs/2209.15162☆49Updated 2 years ago
- ☆23Updated 5 months ago
- Code release for "Improved baselines for vision-language pre-training"☆60Updated 10 months ago
- ☆29Updated 2 years ago
- Code for experiments for "ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy"☆101Updated 6 months ago
- Holistic evaluation of multimodal foundation models☆43Updated 7 months ago
- ☆24Updated last year
- Patching open-vocabulary models by interpolating weights☆91Updated last year
- M4 experiment logbook☆57Updated last year
- ☆23Updated 5 months ago
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆117Updated 5 months ago
- Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models☆43Updated 9 months ago
- Original code base for On Pretraining Data Diversity for Self-Supervised Learning☆13Updated 2 months ago
- ☆117Updated 2 years ago
- ☆49Updated last year
- [ICLR 2025] Source code for paper "A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegr…☆69Updated 3 months ago
- ☆51Updated 9 months ago
- A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.☆90Updated 3 months ago
- Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.☆156Updated 11 months ago
- This is a PyTorch implementation of the paperViP A Differentially Private Foundation Model for Computer Vision☆36Updated last year
- Official repository for the General Robust Image Task (GRIT) Benchmark☆53Updated last year
- Code for “Pretrained Language Models as Visual Planners for Human Assistance”☆60Updated last year
- The official repository for our paper "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns …☆16Updated last year