THUDM / ComplexFuncBench
Complex Function Calling Benchmark.
☆99Updated 3 months ago
Alternatives and similar repositories for ComplexFuncBench
Users that are interested in ComplexFuncBench are comparing it to the libraries listed below
Sorting:
- Codebase accompanying the Summary of a Haystack paper.☆77Updated 7 months ago
- Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".☆112Updated last week
- General Reasoner: Advancing LLM Reasoning Across All Domains☆77Updated this week
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆144Updated 3 weeks ago
- ☆72Updated 6 months ago
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 8 months ago
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆90Updated 2 months ago
- Reformatted Alignment☆115Updated 7 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆135Updated 6 months ago
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆56Updated last month
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆188Updated last week
- ☆117Updated 8 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆108Updated 8 months ago
- The first dense retrieval model that can be prompted like an LM☆71Updated this week
- ☆114Updated 2 months ago
- minimal GRPO implementation from scratch☆88Updated last month
- ☆100Updated last month
- ☆65Updated 2 months ago
- This is the official repository for Inheritune.☆111Updated 3 months ago
- 🚢 Data Toolkit for Sailor Language Models☆90Updated 2 months ago
- Repo for "Z1: Efficient Test-time Scaling with Code"☆58Updated last month
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆199Updated this week
- Tina: Tiny Reasoning Models via LoRA☆192Updated 2 weeks ago
- The HELMET Benchmark☆142Updated 3 weeks ago
- ☆109Updated 3 months ago
- ☆120Updated 7 months ago
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; COLM 2024)☆47Updated 3 months ago
- Code for "Democratizing Reasoning Ability: Tailored Learning from Large Language Model", EMNLP 2023☆34Updated last year
- ☆45Updated last month