normster / llm_rulesLinks
RuLES: a benchmark for evaluating rule-following in language models
☆224Updated 3 months ago
Alternatives and similar repositories for llm_rules
Users that are interested in llm_rules are comparing it to the libraries listed below
Sorting:
- Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)☆189Updated last year
- The official evaluation suite and dynamic data release for MixEval.☆241Updated 6 months ago
- NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day☆254Updated last year
- ☆149Updated last year
- Manage scalable open LLM inference endpoints in Slurm clusters☆257Updated 10 months ago
- Evaluating LLMs with fewer examples☆156Updated last year
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆159Updated last year
- Scaling Data-Constrained Language Models☆334Updated 8 months ago
- Extract full next-token probabilities via language model APIs☆248Updated last year
- Fast bare-bones BPE for modern tokenizer training☆156Updated 2 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆198Updated last month
- This is the repo for the paper Shepherd -- A Critic for Language Model Generation☆218Updated last year
- ☆120Updated 8 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆219Updated last year
- Code for the paper "Fishing for Magikarp"☆155Updated 2 weeks ago
- ☆288Updated last year
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆171Updated 4 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆302Updated last year
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆108Updated last year
- Simple next-token-prediction for RLHF☆226Updated last year
- ☆97Updated 11 months ago
- Replicating O1 inference-time scaling laws☆87Updated 6 months ago
- Improving Alignment and Robustness with Circuit Breakers☆207Updated 8 months ago
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- Functional Benchmarks and the Reasoning Gap☆86Updated 8 months ago
- ☆517Updated 6 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆193Updated 6 months ago
- Multipack distributed sampler for fast padding-free training of LLMs☆188Updated 9 months ago
- ☆75Updated last month
- A repository for research on medium sized language models.☆495Updated 3 weeks ago