apple / ToolSandbox
☆127Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for ToolSandbox
- MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [EMNLP 2024]☆103Updated last month
- Code and Data for Tau-Bench☆201Updated 3 weeks ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆115Updated last week
- ☆116Updated 5 months ago
- Expert Specialized Fine-Tuning☆145Updated last month
- Codebase accompanying the Summary of a Haystack paper.☆72Updated 2 months ago
- Code repo for "Agent Instructs Large Language Models to be General Zero-Shot Reasoners"☆80Updated 2 months ago
- Benchmark baseline for retrieval qa applications☆95Updated 7 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆195Updated 2 weeks ago
- An Analytical Evaluation Board of Multi-turn LLM Agents☆250Updated 6 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆124Updated 3 weeks ago
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning☆178Updated last month
- Evaluating tool-augmented LLMs in conversation settings☆72Updated 5 months ago
- A simple unified framework for evaluating LLMs☆145Updated last week
- [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?☆107Updated 2 months ago
- [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement☆156Updated 7 months ago
- Evaluating LLMs with fewer examples☆134Updated 7 months ago
- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark☆106Updated last month
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆199Updated 6 months ago
- ☆103Updated 3 months ago
- AWM: Agent Workflow Memory☆205Updated last month
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆92Updated 5 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆180Updated 3 weeks ago
- NexusRaven-13B, a new SOTA Open-Source LLM for function calling. This repo contains everything for reproducing our evaluation on NexusRav…☆308Updated last year
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆110Updated 3 weeks ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆113Updated 5 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆146Updated 3 weeks ago
- ☆217Updated 3 months ago
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆144Updated 8 months ago
- Official Implementation of "Multi-Head RAG: Solving Multi-Aspect Problems with LLMs"☆175Updated 2 weeks ago