RAIVNLab / mnms
m&ms: A Benchmark to Evaluate Tool-Use for multi-step multi-modal tasks
☆30Updated 5 months ago
Related projects: ⓘ
- Web-grounded natural language instructions☆11Updated 3 weeks ago
- ☆46Updated 2 weeks ago
- ☆31Updated 8 months ago
- ☆14Updated 2 months ago
- ☆42Updated 5 months ago
- Generating diverse counterfactual data for Natural Language Understanding tasks using Large Language Models (LLMs). The generator support…☆34Updated last year
- Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant se…☆51Updated last year
- Lightweight tool to identify Data Contamination in LLMs evaluation☆39Updated 6 months ago
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆51Updated 5 months ago
- ☆44Updated 8 months ago
- Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning with LLMs☆29Updated 7 months ago
- Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073☆24Updated 2 months ago
- Repository for paper Tools Are Instrumental for Language Agents in Complex Environments☆32Updated 8 months ago
- Source code of "Reasons to Reject? Aligning Language Models with Judgments"☆54Updated 6 months ago
- Restore safety in fine-tuned language models through task arithmetic☆25Updated 5 months ago
- Self-Explore to avoid ️the p️️it! Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards☆39Updated 4 months ago
- NLPBench: Evaluating NLP-Related Problem-solving Ability in Large Language Models☆9Updated 10 months ago
- [ICML'24] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks☆20Updated 7 months ago
- ☆80Updated 9 months ago
- InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising☆32Updated 2 months ago
- Reproduction of "RLCD Reinforcement Learning from Contrast Distillation for Language Model Alignment☆63Updated last year
- ☆32Updated 10 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆86Updated 3 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆45Updated 6 months ago
- The official implementation of paper "Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agen…☆20Updated 6 months ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆46Updated 5 months ago
- [COLM'24] "How Easily do Irrelevant Inputs Skew the Responses of Large Language Models?"☆18Updated last week
- my commonly-used tools☆46Updated last month
- This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"☆38Updated 2 months ago
- ☆14Updated last week