gkamradt / LLMTest_NeedleInAHaystackLinks

Doing simple retrieval from LLM models at various context lengths to measure accuracy

☆1,951

Alternatives and similar repositories for LLMTest_NeedleInAHaystack

Users that are interested in LLMTest_NeedleInAHaystack are comparing it to the libraries listed below

Sorting:

tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,816Updated 7 months ago
huggingface / lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆1,766Updated this week
argilla-io / distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆2,821Updated this week
hendrycks / test
Measuring Massive Multitask Language Understanding | ICLR 2021
☆1,461Updated 2 years ago
jquesnelle / yarn
YaRN: Efficient Context Window Extension of Large Language Models
☆1,536Updated last year
huggingface / datatrove
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆2,516Updated this week
openai / prm800k
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,032Updated 2 years ago
XueFuzhao / OpenMoE
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,568Updated last year
tencent-ailab / persona-hub
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
☆1,249Updated 5 months ago
stanford-crfm / helm
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …
☆2,367Updated this week
allenai / open-instruct
AllenAI's post-training codebase
☆3,083Updated this week
trotsky1997 / MathBlackBox
☆1,028Updated 7 months ago
allenai / dolma
Data and tools for generating and inspecting OLMo pre-training data.
☆1,279Updated last week
maitrix-org / llm-reasoners
A library for advanced large language model reasoning
☆2,193Updated last month
prometheus-eval / prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
☆974Updated 3 months ago
stanfordnlp / pyreft
Stanford NLP Python library for Representation Finetuning (ReFT)
☆1,500Updated 5 months ago
S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,844Updated last year
philschmid / deep-learning-pytorch-huggingface
☆1,254Updated 5 months ago
lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆884Updated last month
AkariAsai / self-rag
This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai,…
☆2,147Updated last year
mlfoundations / dclm
DataComp for Language Models
☆1,342Updated 4 months ago
lucidrains / self-rewarding-lm-pytorch
Implementation of the training framework proposed in Self-Rewarding Language Model, from MetaAI
☆1,394Updated last year
THUDM / AgentBench
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆2,704Updated 6 months ago
huggingface / search-and-learn
Recipes to scale inference-time compute of open models
☆1,110Updated 2 months ago
FasterDecoding / Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,583Updated last year
NVIDIA / RULER
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
☆1,205Updated last week
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,068Updated 3 weeks ago
ContextualAI / HALOs
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
☆873Updated 2 weeks ago
magpie-align / magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆736Updated 4 months ago
yuchenlin / LLM-Blender
[ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the dive…
☆952Updated 9 months ago