booydar / babilongLinks

BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.

☆215

Alternatives and similar repositories for babilong

Users that are interested in babilong are comparing it to the libraries listed below

Sorting:

huggingface / llm-swarm
Manage scalable open LLM inference endpoints in Slurm clusters
☆273Updated last year
dwzhu-pku / PoSE
Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)
☆204Updated last year
imoneoi / multipack
Multipack distributed sampler for fast padding-free training of LLMs
☆201Updated last year
princeton-nlp / HELMET
The HELMET Benchmark
☆178Updated 2 months ago
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆250Updated 11 months ago
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆242Updated 11 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆257Updated this week
RulinShao / retrieval-scaling
Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".
☆216Updated 2 months ago
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆251Updated 6 months ago
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆117Updated last year
TIGER-AI-Lab / LongICLBench
Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]
☆108Updated 8 months ago
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆218Updated 4 months ago
dwzhu-pku / LongEmbed
LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)
☆144Updated 11 months ago
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆233Updated 3 months ago
da03 / Internalize_CoT_Step_by_Step
☆195Updated 6 months ago
whyNLP / LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆155Updated 6 months ago
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆314Updated last year
SALT-NLP / demonstrated-feedback
☆128Updated last year
dust-tt / llama-ssp
Experiments on speculative sampling with Llama models
☆125Updated 2 years ago
lm-sys / llm-decontaminator
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆311Updated last year
IBM / ModuleFormer
ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…
☆224Updated last month
SalesforceAIResearch / LaTRO
☆122Updated 8 months ago
Re-Align / URIAL
☆312Updated last year
HazyResearch / zoology
Understand and test language model architectures on synthetic tasks.
☆233Updated last month
sail-sg / oat
🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.
☆539Updated this week
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆163Updated last year
p-lambda / dsir
DSIR large-scale data selection framework for language model training
☆261Updated last year
bminixhofer / zett
Code for Zero-Shot Tokenizer Transfer
☆138Updated 9 months ago
Digitous / LLM-SLERP-Merge
Spherical Merge Pytorch/HF format Language Models with minimal feature loss.
☆138Updated 2 years ago
princeton-nlp / ProLong
Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
☆231Updated last month