Shangyint / langProBeLinks
☆26Updated last week
Alternatives and similar repositories for langProBe
Users that are interested in langProBe are comparing it to the libraries listed below
Sorting:
- LOFT: A 1 Million+ Token Long-Context Benchmark☆225Updated 7 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆136Updated last year
- ☆62Updated 8 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆246Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆132Updated last year
- ☆77Updated last year
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆227Updated last year
- ☆242Updated last year
- A package to generate summaries of long-form text and evaluate the coherence of these summaries. Official package for our ICLR 2024 paper…☆128Updated last year
- BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent☆169Updated last month
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Models☆119Updated 6 months ago
- A simple unified framework for evaluating LLMs☆261Updated 9 months ago
- [ICLR 2025] BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval☆189Updated 4 months ago
- Reproducible, flexible LLM evaluations☆337Updated last week
- ☆107Updated last year
- ☆56Updated last year
- AI Logging for Interpretability and Explainability🔬☆140Updated last year
- Code implementation of synthetic continued pretraining☆148Updated last year
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents☆230Updated 6 months ago
- The HELMET Benchmark☆198Updated 2 months ago
- ☆32Updated last year
- ☆187Updated 7 months ago
- [NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…☆129Updated last year
- Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.☆154Updated 5 months ago
- Inspecting and Editing Knowledge Representations in Language Models☆119Updated 2 years ago
- ☆55Updated last year
- [EMNLP 2023] Adapting Language Models to Compress Long Contexts☆328Updated last year
- [ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use☆107Updated last year
- Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"☆78Updated 2 years ago
- [ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"☆78Updated last year