huggingface / lm-evaluation-harnessLinks
A framework for few-shot evaluation of language models.
☆33Updated 3 months ago
Alternatives and similar repositories for lm-evaluation-harness
Users that are interested in lm-evaluation-harness are comparing it to the libraries listed below
Sorting:
- Verifiers for LLM Reinforcement Learning☆60Updated 2 months ago
- Codebase accompanying the Summary of a Haystack paper.☆78Updated 9 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 9 months ago
- ☆51Updated 7 months ago
- Official Code Repository for the paper "Distilling LLM Agent into Small Models with Retrieval and Code Tools"☆104Updated 2 weeks ago
- Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".☆170Updated 2 weeks ago
- Complex Function Calling Benchmark.☆114Updated 5 months ago
- DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents☆82Updated this week
- Source code of the paper: RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering [F…☆66Updated last year
- Code for EMNLP 2024 paper "Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning"☆54Updated 8 months ago
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆77Updated 8 months ago
- ☆60Updated 2 weeks ago
- ☆123Updated 8 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆137Updated 7 months ago
- ☆39Updated 11 months ago
- Train your own SOTA deductive reasoning model☆94Updated 3 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆223Updated 7 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- ☆124Updated 2 months ago
- The first dense retrieval model that can be prompted like an LM☆73Updated last month
- ☆57Updated 8 months ago
- ☆118Updated 9 months ago
- ☆115Updated 4 months ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆91Updated 2 months ago
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)☆47Updated 5 months ago
- ☆86Updated last month
- ☆71Updated last year
- ☆62Updated 11 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆157Updated 2 weeks ago
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆48Updated 6 months ago