reka-ai / reka-vibe-evalLinks
Multimodal language model benchmark, featuring challenging examples
☆171Updated 6 months ago
Alternatives and similar repositories for reka-vibe-eval
Users that are interested in reka-vibe-eval are comparing it to the libraries listed below
Sorting:
- This repository is maintained to release dataset and models for multimodal puzzle reasoning.☆95Updated 4 months ago
- LL3M: Large Language and Multi-Modal Model in Jax☆72Updated last year
- M4 experiment logbook☆58Updated last year
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆138Updated 8 months ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆205Updated 10 months ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆149Updated last month
- Self-Alignment with Principle-Following Reward Models☆162Updated 2 months ago
- This is the official repository for Inheritune.☆111Updated 5 months ago
- Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch☆177Updated 3 weeks ago
- ☆96Updated 9 months ago
- ☆70Updated 4 months ago
- Language models scale reliably with over-training and on downstream tasks☆97Updated last year
- Code for the EMNLP 2024 paper "Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps"☆128Updated 11 months ago
- Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding☆52Updated 7 months ago
- [ICCV 2025] Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.☆144Updated this week
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]☆138Updated 7 months ago
- Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073☆29Updated last year
- Benchmarking LLMs with Challenging Tasks from Real Users☆228Updated 8 months ago
- ☆98Updated last year
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆86Updated 9 months ago
- Python Library to evaluate VLM models' robustness across diverse benchmarks☆208Updated last week
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆102Updated 4 months ago
- Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya☆117Updated this week
- Scaling Computer-Use Grounding via UI Decomposition and Synthesis☆85Updated 3 weeks ago
- Official Implementation of ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay☆88Updated last month
- ☆135Updated 8 months ago
- PASTA: Post-hoc Attention Steering for LLMs☆121Updated 7 months ago
- Public Inflection Benchmarks☆68Updated last year
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆54Updated last month
- The HELMET Benchmark☆156Updated 2 months ago