GoodAI / goodai-ltm-benchmarkLinks
A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you need to evaluate your own agents. See more in the blogpost:
☆82Updated last year
Alternatives and similar repositories for goodai-ltm-benchmark
Users that are interested in goodai-ltm-benchmark are comparing it to the libraries listed below
Sorting:
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆91Updated 11 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆111Updated last year
- Just a bunch of benchmark logs for different LLMs☆119Updated last year
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆102Updated 4 months ago
- ☆63Updated 6 months ago
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆133Updated last year
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆95Updated 2 months ago
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆89Updated 2 weeks ago
- Source code for our paper: "SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals".☆69Updated last year
- ☆105Updated last year
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆190Updated 9 months ago
- LLM reads a paper and produce a working prototype☆60Updated 8 months ago
- Train your own SOTA deductive reasoning model☆107Updated 9 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆174Updated 11 months ago
- accompanying material for sleep-time compute paper☆118Updated 7 months ago
- The first dense retrieval model that can be prompted like an LM☆89Updated 7 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆66Updated last year
- Functional Benchmarks and the Reasoning Gap☆90Updated last year
- ☆136Updated 9 months ago
- ☆41Updated last year
- ☆55Updated last year
- Evaluating LLMs with CommonGen-Lite☆93Updated last year
- ☆86Updated 2 years ago
- Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT)☆125Updated 10 months ago
- Official code for the paper "ADaPT: As-Needed Decomposition and Planning with Language Models"☆90Updated last year
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆156Updated 10 months ago
- Track the progress of LLM context utilisation☆55Updated 8 months ago
- ☆105Updated 11 months ago
- ☆68Updated 6 months ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆115Updated 8 months ago