ConsequentAI / fnevalLinks

Functional Benchmarks and the Reasoning Gap

☆88

Alternatives and similar repositories for fneval

Users that are interested in fneval are comparing it to the libraries listed below

Sorting:

SalesforceAIResearch / LaTRO
☆120Updated 6 months ago
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
kanishkg / stream-of-search
Repository for the paper Stream of Search: Learning to Search in Language
☆150Updated 6 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
allenai / infinigram-api
☆77Updated 2 weeks ago
casper-hansen / OpenCoconut
OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.
☆173Updated 7 months ago
oriyor / assistantbench
Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"
☆60Updated 8 months ago
SALT-NLP / demonstrated-feedback
☆126Updated 10 months ago
allenai / DataDecide
☆32Updated 3 weeks ago
JacobPfau / fillerTokens
☆67Updated last year
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆60Updated 11 months ago
Alex-Gurung / ReasoningNCP
Official repo for Learning to Reason for Long-Form Story Generation
☆68Updated 4 months ago
allenai / CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
☆91Updated last year
google-deepmind / mishax
☆138Updated 4 months ago
architsharma97 / dpo-rlaif
☆100Updated last year
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆228Updated last month
msclar / formatspread
Code accompanying "How I learned to start worrying about prompt formatting".
☆109Updated 2 months ago
r-three / phatgoose
Code for PHATGOOSE introduced in "Learning to Route Among Specialized Experts for Zero-Shot Generalization"
☆87Updated last year
joshuacnf / Ctrl-G
☆89Updated 7 months ago
ScalingIntelligence / Archon
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
☆177Updated 5 months ago
hamishivi / EasyLM
Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…
☆75Updated last year
teknium1 / LLM-Benchmark-Logs
Just a bunch of benchmark logs for different LLMs
☆119Updated last year
InflectionAI / Inflection-Benchmarks
Public Inflection Benchmarks
☆68Updated last year
IBM / ModuleFormer
ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…
☆223Updated last year
KaiNylund / lm-weights-encode-time
☆69Updated last year
sher222 / LeReT
Learning to Retrieve by Trying - Source code for Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
☆49Updated 9 months ago
google / sycophancy-intervention
Scripts for generating synthetic finetuning data for reducing sycophancy.
☆114Updated 2 years ago
princeton-nlp / USACO
Can Language Models Solve Olympiad Programming?
☆118Updated 7 months ago
neelsjain / BYOD
The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"
☆107Updated last year
ruiqi-zhong / D5
The GitHub repo for Goal Driven Discovery of Distributional Differences via Language Descriptions
☆70Updated 2 years ago