thu-coai / ShieldLM
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
☆134Updated 2 months ago
Related projects: ⓘ
- Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety.☆141Updated 2 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆81Updated this week
- Flames is a highly adversarial benchmark in Chinese for LLM's harmlessness evaluation developed by Shanghai AI Lab and Fudan NLP Group.☆30Updated 3 months ago
- [NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…☆281Updated 2 weeks ago
- Official github repo for AutoDetect, an automated weakness detection framework for LLMs.☆36Updated 2 months ago
- S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models☆31Updated 2 months ago
- SC-Safety: 中文大模型多轮对抗安全基准☆94Updated 6 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents☆57Updated last month
- ☆185Updated last month
- 复旦白泽大模型安全基准测试集(2024年夏季版)☆21Updated last month
- ☆180Updated 4 months ago
- MarkLLM: An Open-Source Toolkit for LLM Watermarking.☆246Updated last month
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆196Updated last year
- ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆63Updated 5 months ago
- ☆286Updated 2 months ago
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs☆156Updated 3 months ago
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆65Updated last month
- Accepted by IJCAI-24 Survey Track☆117Updated 3 weeks ago
- Generative Judge for Evaluating Alignment☆208Updated 8 months ago
- ☆260Updated 4 months ago
- Codes for our paper "RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation"☆95Updated last month
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]☆467Updated 4 months ago
- ☆109Updated 5 months ago
- BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).☆99Updated 10 months ago
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning☆162Updated 5 months ago
- [ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".☆81Updated this week
- Collection of training data management explorations for large language models☆266Updated last month
- ☆143Updated 9 months ago
- A reading list on LLM based Synthetic Data Generation 🔥☆105Updated last month
- The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models".☆203Updated last month