lechmazur / confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
☆151Updated last week
Alternatives and similar repositories for confabulations
Users that are interested in confabulations are comparing it to the libraries listed below
Sorting:
- Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLM…☆49Updated last week
- ☆288Updated last month
- ☆117Updated 2 weeks ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆140Updated 2 months ago
- Benchmark that evaluates LLMs using 651 NYT Connections puzzles extended with extra trick words☆85Updated last week
- Guaranteed Structured Output from any Language Model via Hierarchical State Machines☆128Updated 2 weeks ago
- AI management tool☆115Updated 6 months ago
- Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI☆223Updated last year
- Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"☆70Updated last week
- Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a sm…☆51Updated last week
- This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, moti…☆207Updated last week
- Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claud…☆26Updated last month
- klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs☆71Updated 7 months ago
- ☆114Updated 4 months ago
- ☆157Updated 10 months ago
- LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each oth…☆32Updated last month
- Train your own SOTA deductive reasoning model☆92Updated 2 months ago
- A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private co…☆266Updated this week
- Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies a…☆36Updated last month
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆30Updated last month
- Low-Rank adapter extraction for fine-tuned transformers models☆173Updated last year
- II-Researcher: a new open-source framework designed to aid building search / research agents☆248Updated last week
- Distributed Inference for mlx LLm☆91Updated 9 months ago
- llm-consortium orchestrates mulitple LLMs, iteratively refines & achieves consensus.☆249Updated 2 weeks ago
- Easily view and modify JSON datasets for large language models☆75Updated 2 months ago
- smolLM with Entropix sampler on pytorch☆151Updated 6 months ago
- ☆130Updated 3 weeks ago
- Fast parallel LLM inference for MLX☆187Updated 10 months ago
- ☆202Updated 3 weeks ago
- Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform☆87Updated 3 weeks ago