GoodAI / goodai-ltm-benchmarkLinks
A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you need to evaluate your own agents. See more in the blogpost:
☆81Updated 11 months ago
Alternatives and similar repositories for goodai-ltm-benchmark
Users that are interested in goodai-ltm-benchmark are comparing it to the libraries listed below
Sorting:
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆91Updated 10 months ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆93Updated 2 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated last year
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆89Updated this week
- The first dense retrieval model that can be prompted like an LM☆89Updated 7 months ago
- Functional Benchmarks and the Reasoning Gap☆90Updated last year
- ☆62Updated 5 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆173Updated 10 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆110Updated 11 months ago
- LLM reads a paper and produce a working prototype☆60Updated 7 months ago
- ☆68Updated last year
- Train your own SOTA deductive reasoning model☆107Updated 9 months ago
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆131Updated last year
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆189Updated 9 months ago
- ☆86Updated last year
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆84Updated 8 months ago
- ☆105Updated 11 months ago
- Track the progress of LLM context utilisation☆55Updated 7 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆66Updated 11 months ago
- Training an LLM to use a calculator with multi-turn reinforcement learning, achieving a **62% absolute increase in evaluation accuracy**.☆60Updated 7 months ago
- Google Deepmind's PromptBreeder for automated prompt engineering implemented in langchain expression language.☆159Updated last year
- An automated tool for discovering insights from research papaer corpora☆137Updated last year
- Evaluating LLMs with CommonGen-Lite☆93Updated last year
- ☆102Updated last year
- Code and data for the paper "Why think step by step? Reasoning emerges from the locality of experience"☆62Updated 8 months ago
- Doing simple retrieval from LLM models at various context lengths to measure accuracy☆106Updated 2 months ago
- 🔧 Compare how Agent systems perform on several benchmarks. 📊🚀☆102Updated 4 months ago
- Lean implementation of various multi-agent LLM methods, including Iteration of Thought (IoT)☆123Updated 9 months ago
- ☆55Updated last year
- ☆126Updated 6 months ago