open-compass / CompassJudger
☆91Updated last month
Alternatives and similar repositories for CompassJudger:
Users that are interested in CompassJudger are comparing it to the libraries listed below
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆230Updated last month
- AutoCoA (Automatic generation of Chain-of-Action) is an agent model framework that enhances the multi-turn tool usage capability of reaso…☆75Updated 2 weeks ago
- ☆264Updated 8 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆53Updated 11 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆140Updated last week
- ☆262Updated 2 weeks ago
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆265Updated last year
- ☆101Updated 3 months ago
- [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement☆181Updated last year
- Official implementation of paper "On the Diagram of Thought" (https://arxiv.org/abs/2409.10038)☆177Updated this week
- ☆125Updated 3 weeks ago
- [ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning☆218Updated 2 months ago
- The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆148Updated 2 weeks ago
- Reformatted Alignment☆115Updated 6 months ago
- Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.☆148Updated last year
- ☆26Updated last month
- ☆84Updated last month
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆67Updated last month
- Generative Judge for Evaluating Alignment☆232Updated last year
- A Comprehensive Survey on Long Context Language Modeling☆113Updated last week
- MPO: Boosting LLM Agents with Meta Plan Optimization☆43Updated 3 weeks ago
- [NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs☆102Updated 3 months ago
- ☆44Updated 3 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆162Updated 2 weeks ago
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models☆39Updated 4 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆119Updated 4 months ago
- ☆115Updated 2 months ago
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆161Updated 2 weeks ago
- [NeurIPS2024] MATH-Vision dataset and code to measure multimodal mathematical reasoning capabilities.☆97Updated 3 weeks ago
- ☆138Updated 3 weeks ago