lechmazur / confabulationsLinks

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

☆229

Alternatives and similar repositories for confabulations

Users that are interested in confabulations are comparing it to the libraries listed below

Sorting:

jd-3d / SOLOBench
☆134Updated 5 months ago
EQ-bench / EQ-Bench
A benchmark for emotional intelligence in large language models
☆365Updated last year
Mihaiii / backtrack_sampler
An easy-to-understand framework for LLM samplers that rewind and revise generated tokens
☆145Updated 7 months ago
sam-paech / antislop-sampler
☆315Updated 2 months ago
TheProxyCompany / proxy-structuring-engine
Guaranteed Structured Output from any Language Model via Hierarchical State Machines
☆145Updated this week
lechmazur / nyt-connections
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
☆147Updated 3 weeks ago
migtissera / Sensei
Generate Synthetic Data Using OpenAI, MistralAI or AnthropicAI
☆221Updated last year
adobe-research / NoLiMa
Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"
☆157Updated 2 months ago
lechmazur / writing
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, moti…
☆310Updated 2 weeks ago
av / klmbr
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
☆79Updated last year
TC-Zheng / ActuosusAI
AI management tool
☆121Updated 11 months ago
lechmazur / step_game
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLM…
☆73Updated last month
menloresearch / ReZero
☆156Updated 5 months ago
sam-paech / slop-forensics
☆270Updated 4 months ago
QuixiAI / OpenChatML
☆162Updated 2 months ago
lechmazur / generalization
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a sm…
☆63Updated 2 weeks ago
chigkim / Ollama-MMLU-Pro
☆102Updated last month
d0rc / deepdive
Conduct in-depth research with AI-driven insights : DeepDive is a command-line tool that leverages web searches and AI models to generate…
☆42Updated last year
matteoserva / GraphLLM
☆207Updated last month
jukofyork / transplant-vocab
Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.
☆42Updated last month
adobe-research / dynasaur
Official repository for "DynaSaur: Large Language Agents Beyond Predefined Actions"
☆350Updated 9 months ago
kerekovskik / autologic
autologic is a Python package that implements the SELF-DISCOVER framework proposed in the paper SELF-DISCOVER: Large Language Models Self…
☆60Updated last year
lechmazur / elimination_game
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private co…
☆290Updated last month
teknium1 / ShareGPT-Builder
☆115Updated 9 months ago
bold84 / cot_proxy
Smart proxy for LLM APIs that enables model-specific parameter control, automatic mode switching (like Qwen3's /think and /no_think), and…
☆50Updated 4 months ago
xhedit / quantkit
cli tool to quantize gguf, gptq, awq, hqq and exl2 models
☆76Updated 9 months ago
7ozzam / cohere-toolkit-with-openai
Cohere Toolkit is a collection of prebuilt components enabling users to quickly build and deploy RAG applications.
☆29Updated 8 months ago
abgulati / hf-waitress
Serving LLMs in the HF-Transformers format via a PyFlask API
☆71Updated last year
codelion / ellora
Enhancing LLMs with LoRA
☆159Updated last month
h2oai / enterprise-h2ogpte
Client Code Examples, Use Cases and Benchmarks for Enterprise h2oGPTe RAG-Based GenAI Platform
☆90Updated last month