A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆87Aug 10, 2024Updated last year
Alternatives and similar repositories for BIG-Bench-Mistake
Users that are interested in BIG-Bench-Mistake are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Open-source repository for the OOPSLA'24 paper "CYCLE: Learning to Self-Refine Code Generation"☆10Mar 8, 2024Updated 2 years ago
- Code for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning https://arxiv.org/pdf/2410.01044☆35Oct 3, 2024Updated last year
- EMNLP 2022: Analyzing and Evaluating Faithfulness in Dialogue Summarization☆13Mar 20, 2025Updated last year
- FeedbackQA: Improving Question Answering Post-Deployment with Interactive Feedback☆12Jul 13, 2022Updated 3 years ago
- An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors☆27Mar 2, 2026Updated last month
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- DSTC9 Submission☆16Apr 12, 2021Updated 5 years ago
- ☆17Aug 1, 2024Updated last year
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- The Lean Theorem Proving Environment☆15May 7, 2023Updated 2 years ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆169Oct 11, 2024Updated last year
- code for Scaling Laws of RoPE-based Extrapolation☆73Oct 16, 2023Updated 2 years ago
- ☆54Aug 25, 2023Updated 2 years ago
- ☆27Jun 11, 2025Updated 10 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆78Oct 9, 2025Updated 6 months ago
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)☆27Oct 3, 2025Updated 6 months ago
- ACL24☆11Jun 7, 2024Updated last year
- A dashboard for exploring timm learning rate schedulers☆20Nov 22, 2024Updated last year
- Official implementation of ECCV24 paper: POA☆24Aug 8, 2024Updated last year
- ☆136May 8, 2025Updated 11 months ago
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.☆794Oct 4, 2024Updated last year
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆24Oct 10, 2023Updated 2 years ago
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆63Apr 18, 2024Updated 2 years ago
- ACL 2022: Just Rank: Rethinking Evaluation with Word and Sentence Similarities☆35Dec 14, 2022Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- [ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large …☆25May 29, 2024Updated last year
- Repository having the code and models from the paper: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student traini…☆13Mar 18, 2024Updated 2 years ago
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆62Oct 21, 2024Updated last year
- ☆131Jul 8, 2024Updated last year
- Computationally Modelling Resisting Strategies in Persuasive Conversations☆12Feb 6, 2022Updated 4 years ago
- ☆27Nov 25, 2025Updated 4 months ago
- Code for paper: Long cOntext aliGnment via efficient preference Optimization☆24Oct 10, 2025Updated 6 months ago
- Seamless Voice Interactions with LLMs☆12Oct 28, 2023Updated 2 years ago
- Improving word mover’s distance by leveraging self-attention matrix (Published in EMNLP 2023 Findings)☆10Mar 10, 2026Updated last month
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- [ICLR 2025] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist☆35Oct 23, 2024Updated last year
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?☆57Apr 17, 2023Updated 3 years ago
- Proof recording for Lean 3☆27Sep 30, 2021Updated 4 years ago
- Redundancy Undermines the Trustworthiness of Self-Interpretable GNNs, International Conference on Machine Learning (ICML), 2025☆15Jun 23, 2025Updated 9 months ago
- ☆34Mar 21, 2026Updated 3 weeks ago
- Scratchpad/Chain-of-Thought Prompts☆12Jun 6, 2022Updated 3 years ago
- ☆13Sep 27, 2022Updated 3 years ago