jamesmurdza / agentevalLinks

Automated testing and benchmarking for code generation agents.

☆18

Alternatives and similar repositories for agenteval

Users that are interested in agenteval are comparing it to the libraries listed below

Sorting:

toufunao / SCM4LLMs
☆32Updated 2 years ago
shoggoth13 / agents-deconstructed
☆57Updated 2 years ago
stunningpixels / lou-eval
Track the progress of LLM context utilisation
☆54Updated 6 months ago
argilla-io / notus
Notus is a collection of fine-tuned LLMs using SFT, DPO, SFT+DPO, and/or any other RLHF techniques, while always keeping a data-first app…
☆169Updated last year
SebastianBodza / EnsembleForecasting
Using multiple LLMs for ensemble Forecasting
☆16Updated last year
Arize-ai / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆105Updated last month
akjindal53244 / Arithmo
Small and Efficient Mathematical Reasoning LLMs
☆72Updated last year
automix-llm / automix
Mixing Language Models with Self-Verification and Meta-Verification
☆109Updated 10 months ago
BerriAI / bettertest
☆73Updated last year
tanyuqian / cappy
NeurIPS 2023 - Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
☆44Updated last year
official-elinas / zeus-llm-trainer
Zeus LLM Trainer is a rewrite of Stanford Alpaca aiming to be the trainer for all Large Language Models
☆69Updated 2 years ago
reactorsh / ambrosia
clean up your LLM datasets
☆113Updated 2 years ago
VikParuchuri / classified
Score LLM pretraining data with classifiers
☆54Updated last year
davanstrien / data-for-fine-tuning-llms
☆80Updated last year
kookaburracodes / investor-education-chatchain
Not financial advice.
☆27Updated 2 years ago
togethercomputer / Llama-2-7B-32K-Instruct
☆85Updated 2 years ago
1rgs / tokenwiz
A clone of OpenAI's Tokenizer page for HuggingFace Models
☆45Updated last year
S1M0N38 / dspy-arxiv
Explore the use of DSPy for extracting features from PDFs 🔎
☆47Updated last year
aymeric-roucher / LongContext_vs_RAG_NeedleInAHaystack
Comparing retrieval abilities from GPT4-Turbo and a RAG system on a toy example for various context lengths
☆35Updated last year
emrgnt-cmplxty / zero-shot-replication
☆73Updated 2 years ago
venuv / LangSynth
Conduct consumer interviews with synthetic focus groups using LLMs and LangChain
☆43Updated 2 years ago
matthewrenze / jhu-concise-cot
The Benefits of a Concise Chain of Thought on Problem Solving in Large Language Models
☆22Updated 11 months ago
fsndzomga / baby_agi_dspy
a version of baby agi using dspy and typed predictors
☆17Updated last year
nateraw / replicate-examples
☆74Updated last year
yoheinakajima / autofinetune
auto fine tune of models with synthetic data
☆75Updated last year
Technoculture / personal-graph
Simple Graph Memory for AI applications
☆89Updated 5 months ago
mobarski / alpaca-libre
Reimplementation of the task generation part from the Alpaca paper
☆118Updated 2 years ago
kumar-shridhar / Screws
SCREWS: A Modular Framework for Reasoning with Revisions
☆27Updated 2 years ago
allenai / CommonGen-Eval
Evaluating LLMs with CommonGen-Lite
☆91Updated last year
yoheinakajima / asymmetrix
☆132Updated 2 years ago