A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆87Aug 10, 2024Updated last year
Alternatives and similar repositories for BIG-Bench-Mistake
Users that are interested in BIG-Bench-Mistake are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Open-source repository for the OOPSLA'24 paper "CYCLE: Learning to Self-Refine Code Generation"☆10Mar 8, 2024Updated 2 years ago
- Code for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning https://arxiv.org/pdf/2410.01044☆35Oct 3, 2024Updated last year
- EMNLP 2022: Analyzing and Evaluating Faithfulness in Dialogue Summarization☆13Mar 20, 2025Updated last year
- FeedbackQA: Improving Question Answering Post-Deployment with Interactive Feedback☆12Jul 13, 2022Updated 3 years ago
- A flexible & scalable MLLM-based AIGC detection pipeline☆31Oct 27, 2025Updated 5 months ago
- NordVPN Special Discount Offer • AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors☆26Mar 2, 2026Updated 3 weeks ago
- DSTC9 Submission☆16Apr 12, 2021Updated 4 years ago
- ☆16Aug 1, 2024Updated last year
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- The Lean Theorem Proving Environment☆15May 7, 2023Updated 2 years ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆168Oct 11, 2024Updated last year
- code for Scaling Laws of RoPE-based Extrapolation☆73Oct 16, 2023Updated 2 years ago
- ☆54Aug 25, 2023Updated 2 years ago
- ACL24☆11Jun 7, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- ☆133May 8, 2025Updated 10 months ago
- Evolutionary Search for expert-level performance on any task with environmental feedback☆14Oct 12, 2025Updated 5 months ago
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.☆789Oct 4, 2024Updated last year
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆24Oct 10, 2023Updated 2 years ago
- Pytorch implementation for "Compressed Context Memory For Online Language Model Interaction" (ICLR'24)☆63Apr 18, 2024Updated last year
- ACL 2022: Just Rank: Rethinking Evaluation with Word and Sentence Similarities☆35Dec 14, 2022Updated 3 years ago
- HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation [ACL 2023]☆14Jul 11, 2023Updated 2 years ago
- [ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large …☆25May 29, 2024Updated last year
- Repository having the code and models from the paper: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student traini…☆13Mar 18, 2024Updated 2 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.☆62Oct 21, 2024Updated last year
- ☆132Jul 8, 2024Updated last year
- Computationally Modelling Resisting Strategies in Persuasive Conversations☆12Feb 6, 2022Updated 4 years ago
- ☆27Nov 25, 2025Updated 4 months ago
- Code for paper: Long cOntext aliGnment via efficient preference Optimization☆24Oct 10, 2025Updated 5 months ago
- Seamless Voice Interactions with LLMs☆12Oct 28, 2023Updated 2 years ago
- [ICLR 2025] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist☆35Oct 23, 2024Updated last year
- Redundancy Undermines the Trustworthiness of Self-Interpretable GNNs, International Conference on Machine Learning (ICML), 2025☆14Jun 23, 2025Updated 9 months ago
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?☆57Apr 17, 2023Updated 2 years ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Proof recording for Lean 3☆27Sep 30, 2021Updated 4 years ago
- Scratchpad/Chain-of-Thought Prompts☆12Jun 6, 2022Updated 3 years ago
- ☆13Sep 27, 2022Updated 3 years ago
- [ACL 2024] CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and …☆103Jul 29, 2024Updated last year
- Dataflow-guided retrieval augmentation for repository-level code completion, ACL 2024 (main)☆34Mar 24, 2025Updated last year
- Code for the EMNLP 2020 paper "Re-examining the Role of Schema Linking in Text-to-SQL".☆28Nov 23, 2020Updated 5 years ago
- Evaluation on Logical Reasoning and Abstract Reasoning Challenges☆29Apr 21, 2025Updated 11 months ago