night-chen / ToolQALinks

ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.

☆279

Alternatives and similar repositories for ToolQA

Users that are interested in ToolQA are comparing it to the libraries listed below

Sorting:

princeton-nlp / ALCE
[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627
☆497Updated last year
shizhediao / active-prompt
Source code for the paper "Active Prompting with Chain-of-Thought for Large Language Models"
☆245Updated last year
TIGER-AI-Lab / Program-of-Thoughts
Data and Code for Program of Thoughts [TMLR 2023]
☆287Updated last year
GAIR-NLP / auto-j
Generative Judge for Evaluating Alignment
☆247Updated last year
freshllms / freshqa
Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)
☆375Updated last week
raunak-agarwal / instruction-datasets
Datasets for Instruction Tuning of Large Language Models
☆257Updated last year
AI21Labs / in-context-ralm
☆291Updated last year
Ber666 / ToolkenGPT
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)
☆262Updated last year
kaistAI / CoT-Collection
[EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
☆246Updated last year
bhargaviparanjape / language-programmes
☆173Updated 2 years ago
sambanova / toolbench
ToolBench, an evaluation suite for LLM tool manipulation capabilities.
☆163Updated last year
facebookresearch / Shepherd
This is the repo for the paper Shepherd -- A Critic for Language Model Generation
☆217Updated 2 years ago
nelson-liu / lost-in-the-middle
Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
☆360Updated last year
yueyu1030 / AttrPrompt
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
☆153Updated last year
anchen1011 / FireAct
FireAct: Toward Language Agent Fine-tuning
☆282Updated 2 years ago
ParticleMedia / RAGTruth
Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"
☆204Updated 10 months ago
glgh / awesome-llm-human-preference-datasets
A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.
☆380Updated 2 years ago
xlang-ai / Binder
[ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"
☆322Updated 2 years ago
zorazrw / filco
[Preprint] Learning to Filter Context for Retrieval-Augmented Generaton
☆198Updated last year
suzgunmirac / BIG-Bench-Hard
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆517Updated last year
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆313Updated last year
salesforce / DialogStudio
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI
☆515Updated 8 months ago
Leezekun / Directional-Stimulus-Prompting
[NeurIPS 2023] Codebase for the paper: "Guiding Large Language Models with Directional Stimulus Prompting"
☆113Updated 2 years ago
thunlp / ChatEval
Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"
☆303Updated last year
veronica320 / Faithful-COT
Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".
☆163Updated last year
StonyBrookNLP / ircot
Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23
☆236Updated last year
DaoD / INTERS
This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning"
☆204Updated 10 months ago
nlpxucan / evol-instruct
☆274Updated 2 years ago
i-Eval / FairEval
☆140Updated 2 years ago
Spico197 / Humpback
🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.
☆139Updated 5 months ago