bytedance / BytevalKit-LLMLinks
☆29Updated 7 months ago
Alternatives and similar repositories for BytevalKit-LLM
Users that are interested in BytevalKit-LLM are comparing it to the libraries listed below
Sorting:
- WritingBench: A Comprehensive Benchmark for Generative Writing☆156Updated last month
- ☆23Updated last year
- 代码大模型 预训练&微调&DPO 数据处理 业界处理pipeline sota☆50Updated last year
- InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (ICML 2024)☆179Updated 8 months ago
- Flames is a highly adversarial benchmark in Chinese for LLM's harmlessness evaluation developed by Shanghai AI Lab and Fudan NLP Group.☆63Updated last year
- ☆165Updated 3 months ago
- Generative Judge for Evaluating Alignment☆250Updated 2 years ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆140Updated last year
- Official github repo for AutoDetect, an automated weakness detection framework for LLMs.☆46Updated last year
- [COLM'24] Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration☆32Updated last year
- The official repo for our paper: LegalAgentBench: Evaluating LLM Agents in Legal Domainl☆40Updated last year
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆255Updated last year
- [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!☆161Updated last week
- ☆182Updated 9 months ago
- Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中☆105Updated 9 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆213Updated 9 months ago
- ☆76Updated last year
- Open Source Implementation of Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evo…☆98Updated 6 months ago
- Multilingual safety benchmark for Large Language Models☆53Updated last year
- [ACL'24] WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations☆13Updated last year
- The demo, code and data of FollowRAG☆75Updated 7 months ago
- ☆432Updated 3 months ago
- [ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis☆178Updated 4 months ago
- ☆322Updated last year
- MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models☆58Updated 6 months ago
- A Comprehensive Benchmark for Software Development.☆127Updated last year
- GitHub page for "Large Language Model-Brained GUI Agents: A Survey"☆217Updated 7 months ago
- AN O1 REPLICATION FOR CODING☆334Updated last year
- Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework☆205Updated 3 weeks ago
- CodeRAG-Bench: Can Retrieval Augment Code Generation?☆167Updated last year