jamesmurdza / agenteval
Automated testing and benchmarking for code generation agents.
☆18Updated last year
Alternatives and similar repositories for agenteval:
Users that are interested in agenteval are comparing it to the libraries listed below
- Using multiple LLMs for ensemble Forecasting☆16Updated last year
- ☆57Updated last year
- ☆48Updated 6 months ago
- Writing Blog Posts with Generative Feedback Loops!☆47Updated last year
- A public implementation of the ReLoRA pretraining method, built on Lightning-AI's Pytorch Lightning suite.☆33Updated last year
- a version of baby agi using dspy and typed predictors☆17Updated last year
- ☆24Updated last year
- The Benefits of a Concise Chain of Thought on Problem Solving in Large Language Models☆22Updated 5 months ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆39Updated last year
- LLM reads a paper and produce a working prototype☆52Updated 3 weeks ago
- Conduct consumer interviews with synthetic focus groups using LLMs and LangChain☆43Updated last year
- Testing paligemma2 finetuning on reasoning dataset☆18Updated 4 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆90Updated 3 months ago
- Track the progress of LLM context utilisation☆54Updated 3 weeks ago
- Simple examples using Argilla tools to build AI☆52Updated 5 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆99Updated last year
- ☆32Updated last year
- Claude API Test Project☆87Updated last year
- ☆66Updated 11 months ago
- Data preparation code for CrystalCoder 7B LLM☆44Updated 11 months ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆49Updated 9 months ago
- Scripts to create your own moe models using mlx☆89Updated last year
- ☆77Updated 11 months ago
- ☆48Updated last year
- ☆33Updated 2 years ago
- A framework for evaluating function calls made by LLMs☆37Updated 9 months ago
- NeurIPS 2023 - Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer☆43Updated last year
- Comparing retrieval abilities from GPT4-Turbo and a RAG system on a toy example for various context lengths☆35Updated last year
- OpenMindedChatbot is a Proof Of Concept that leverages the power of Open source Large Language Models (LLM) with Function Calling capabil…☆29Updated last year
- Zeus LLM Trainer is a rewrite of Stanford Alpaca aiming to be the trainer for all Large Language Models☆69Updated last year