GoodAI / goodai-ltm-benchmark
A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you need to evaluate your own agents. See more in the blogpost:
☆70Updated 5 months ago
Alternatives and similar repositories for goodai-ltm-benchmark
Users that are interested in goodai-ltm-benchmark are comparing it to the libraries listed below
Sorting:
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆80Updated 7 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆90Updated 3 months ago
- Just a bunch of benchmark logs for different LLMs☆119Updated 9 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆104Updated 5 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆110Updated 8 months ago
- ☆114Updated 2 months ago
- Functional Benchmarks and the Reasoning Gap☆86Updated 7 months ago
- Track the progress of LLM context utilisation☆54Updated last month
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆173Updated 2 months ago
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆123Updated 11 months ago
- ☆48Updated last year
- A strongly typed Python DSL for developing message passing multi agent systems☆52Updated last year
- ☆81Updated 4 months ago
- LILO: Library Induction with Language Observations☆86Updated 8 months ago
- ☆80Updated last month
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆77Updated 2 months ago
- Evaluating LLMs with CommonGen-Lite☆90Updated last year
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆54Updated 5 months ago
- OpenCoconut implements a latent reasoning paradigm where we generate thoughts before decoding.☆172Updated 4 months ago
- Training an LLM to use a calculator with multi-turn reinforcement learning, achieving a **62% absolute increase in evaluation accuracy**.☆37Updated last week
- Steer LLM outputs towards a certain topic/subject and enhance response capabilities using activation engineering by adding steering vecto…☆235Updated 3 months ago
- ☆46Updated this week
- ☆82Updated last year
- Official code for the paper "ADaPT: As-Needed Decomposition and Planning with Language Models"☆78Updated last year
- ☆50Updated 5 months ago
- Harness used to benchmark aider against SWE Bench benchmarks☆71Updated 10 months ago
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆65Updated last month
- ☆66Updated 11 months ago
- The first dense retrieval model that can be prompted like an LM☆72Updated last week
- An example implementation of RLHF (or, more accurately, RLAIF) built on MLX and HuggingFace.☆26Updated 10 months ago