qcri / LLMeBenchLinks

Benchmarking Large Language Models

☆98

Alternatives and similar repositories for LLMeBench

Users that are interested in LLMeBench are comparing it to the libraries listed below

Sorting:

simran-khanuja / awesome-cultural-nlp
Resources for cultural NLP research
☆101Updated 3 months ago
chaitanyamalaviya / ExpertQA
[Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers
☆131Updated last year
amazon-science / mintaka
Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)
☆114Updated 2 years ago
viswavi / few-shot-clustering
☆78Updated 10 months ago
microsoft / HaDes
Token-level Reference-free Hallucination Detection
☆96Updated 2 years ago
microsoft / Multilingual-Evaluation-of-Generative-AI-MEGA
Code for Multilingual Eval of Generative AI paper published at EMNLP 2023
☆70Updated last year
ParticleMedia / RAGTruth
Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"
☆192Updated 8 months ago
mbzuai-nlp / bactrian-x
A Multilingual Replicable Instruction-Following Model
☆94Updated 2 years ago
McGill-NLP / instruct-qa
Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"
☆86Updated 11 months ago
salesforce / AuditNLG
AuditNLG: Auditing Generative AI Language Modeling for Trustworthiness
☆102Updated 6 months ago
kasnerz / tabgenie
A multi-purpose toolkit for table-to-text generation: web interface, Python bindings, CLI commands.
☆55Updated last year
allenai / wimbd
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆223Updated 8 months ago
google-research / true
Code and data accompanying the paper "TRUE: Re-evaluating Factual Consistency Evaluation".
☆81Updated 3 weeks ago
google-research-datasets / Attributed-QA
We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in …
☆54Updated 2 years ago
MaLA-LM / GlotEval
GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way
☆14Updated 3 weeks ago
bigscience-workshop / lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
☆105Updated 2 years ago
allenai / peS2o
Pretraining Efficiently on S2ORC!
☆165Updated 9 months ago
faridlazuarda / cultural-llm-papers
A curated list of research papers and resources on Cultural LLM.
☆46Updated 10 months ago
facebookresearch / ResponsibleNLP
Repository for research in the field of Responsible NLP at Meta.
☆202Updated 2 months ago
primeqa / clapnq
☆41Updated 6 months ago
zetaalphavector / InPars
Inquisitive Parrots for Search
☆194Updated 2 months ago
allenai / catwalk
This project studies the performance and robustness of language models and task-adaptation methods.
☆150Updated last year
naver / bergen
Benchmarking library for RAG
☆219Updated 3 weeks ago
dreji18 / Fairness-in-AI
Detecting Bias and ensuring Fairness in AI solutions
☆98Updated 2 years ago
guyfe / Tweetsumm
A dataset focused on summarization of dialogs, which represents the rich domain of Twitter customer care conversations
☆32Updated last year
OSU-NLP-Group / AttrScore
Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"
☆56Updated 2 years ago
SalesforceAIResearch / FaithEval
☆45Updated last month
worldbank / GISTEmbed
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings
☆43Updated last year
microsoft / llm-data-creation
Model, Code & Data for the EMNLP'23 paper "Making Large Language Models Better Data Creators"
☆135Updated last year
asahi417 / lmppl
Calculate perplexity on a text with pre-trained language models. Support MLM (eg. DeBERTa), recurrent LM (eg. GPT3), and encoder-decoder …
☆162Updated last month