carlini / yet-another-applied-llm-benchmark
A benchmark to evaluate language models on questions I've previously asked them to solve.
☆916Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for yet-another-applied-llm-benchmark
- Fine-tune mistral-7B on 3090s, a100s, h100s☆702Updated last year
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…☆1,634Updated this week
- Deep learning for dummies. All the practical details and useful utilities that go into working with real models.☆715Updated last month
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,045Updated this week
- System 2 Reasoning Link Collection☆693Updated 3 weeks ago
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆811Updated this week
- ReFT: Representation Finetuning for Language Models☆1,159Updated 2 weeks ago
- Automatically evaluate your LLMs in Google Colab☆559Updated 6 months ago
- ☆448Updated 7 months ago
- A comprehensive repository of reasoning tasks for LLMs (and beyond)☆282Updated last month
- Minimalistic large language model 3D-parallelism training☆1,260Updated this week
- Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard a…☆798Updated 2 weeks ago
- ☆935Updated 2 weeks ago
- ☆470Updated 2 months ago
- ☆718Updated 2 months ago
- Implementation of the training framework proposed in Self-Rewarding Language Model, from MetaAI☆1,336Updated 7 months ago
- TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.☆1,824Updated 2 weeks ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆1,565Updated 3 months ago
- Optimizing inference proxy for LLMs☆1,563Updated this week
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.☆1,095Updated last week
- nanoGPT style version of Llama 3.1☆1,246Updated 3 months ago
- [ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling☆1,529Updated 4 months ago
- Automated Design of Agentic Systems☆1,038Updated this week
- A bibliography and survey of the papers surrounding o1☆754Updated this week
- Generate textbook-quality synthetic LLM pretraining data☆488Updated last year
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆517Updated 2 weeks ago
- ☆2,746Updated 2 months ago
- ☆641Updated this week