arnav-gudibande / koala-test-set
The test set for Koala
☆45Updated last year
Related projects ⓘ
Alternatives and complementary repositories for koala-test-set
- The data processing pipeline for the Koala chatbot language model☆117Updated last year
- Small and Efficient Mathematical Reasoning LLMs☆71Updated 9 months ago
- ☆170Updated last year
- ☆75Updated last year
- Official implementation for 'Extending LLMs’ Context Window with 100 Samples'☆73Updated 9 months ago
- ☆111Updated last month
- [NAACL 2024] Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models☆82Updated 8 months ago
- Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond"☆58Updated 7 months ago
- ☆63Updated 2 years ago
- Official code for ACL 2023 (short, findings) paper "Recursion of Thought: A Divide and Conquer Approach to Multi-Context Reasoning with L…☆42Updated last year
- Code for ICLR 2024 paper "CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets"☆47Updated 5 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆41Updated 8 months ago
- ☆175Updated last year
- Self-Alignment with Principle-Following Reward Models☆148Updated 8 months ago
- [ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use☆69Updated 7 months ago
- A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.☆75Updated 9 months ago
- Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format☆25Updated last year
- ☆44Updated 5 months ago
- Implementation of the paper: "Answering Questions by Meta-Reasoning over Multiple Chains of Thought"☆92Updated 9 months ago
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?☆57Updated last year
- Evaluating LLMs with CommonGen-Lite☆84Updated 7 months ago
- CodeUltraFeedback: aligning large language models to coding preferences☆65Updated 4 months ago
- A set of utilities for running few-shot prompting experiments on large-language models☆112Updated last year
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆38Updated 3 weeks ago
- The GitHub repo for Goal Driven Discovery of Distributional Differences via Language Descriptions☆68Updated last year
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆73Updated 3 months ago
- Code of ICLR paper: https://openreview.net/forum?id=-cqvvvb-NkI☆91Updated last year
- [ICLR 2024] COLLIE: Systematic Construction of Constrained Text Generation Tasks☆52Updated last year
- This is the official repository of the paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI"☆85Updated last month