本文提出了一个基于“文心一言”的中国LLMs的安全评估基准,其中包括8种典型的安全场景和6种指令攻击类型。此外,本文还提出了安全评估的框架和过程,利用手动编写和收集开源数据的测试Prompts,以及人工干预结合利用LLM强大的评估能力作为“共同评估者”。
☆33Sep 1, 2023Updated 2 years ago
Alternatives and similar repositories for Safety-Evaluating
Users that are interested in Safety-Evaluating are comparing it to the libraries listed below
Sorting:
- LLM evaluation.☆16Nov 7, 2023Updated 2 years ago
- ☆16Jun 22, 2017Updated 8 years ago
- Data for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder"☆20Oct 26, 2023Updated 2 years ago
- SC-Safety: 中文大模型多轮对抗安全基准☆150Mar 15, 2024Updated last year
- Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety. [ACL 2024]☆273Jul 28, 2025Updated 7 months ago
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization☆29Jul 9, 2024Updated last year
- ☆11Dec 23, 2024Updated last year
- ☆12May 6, 2022Updated 3 years ago
- Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts,用于评估和提升大模型的安全性。☆1,132Feb 27, 2024Updated 2 years ago
- On the Robustness of GUI Grounding Models Against Image Attacks☆12Apr 8, 2025Updated 11 months ago
- BrainWash: A Poisoning Attack to Forget in Continual Learning☆12Apr 15, 2024Updated last year
- AIGC 系列报告 2022-2023☆11Feb 25, 2024Updated 2 years ago
- [ICLR 2022] Boosting Randomized Smoothing with Variance Reduced Classifiers☆12Mar 29, 2022Updated 3 years ago
- ☆14Mar 9, 2025Updated last year
- Implementation (in progress) of Dieng et al.'s TopicRNN intended to be used as a baseline and starting point.☆10Jun 26, 2018Updated 7 years ago
- Code for AISTATS'25 paper - On the Power of Adaptive Weighted Aggregation in Heterogeneous Federated Learning and Beyond☆13Sep 23, 2025Updated 5 months ago
- Implementation of our paper published in Springer's Signal, Image and Video Processing☆12Dec 5, 2020Updated 5 years ago
- ☆11Nov 12, 2024Updated last year
- This is the repository for the resources in CoNLL 2020 Paper "What Are You Trying Todo? Semantic Typing of Event Processes"☆11Jan 5, 2021Updated 5 years ago
- ☆14Feb 26, 2025Updated last year
- The implementation of our IEEE S&P 2024 paper "Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples".☆11Jun 28, 2024Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆107May 20, 2025Updated 9 months ago
- PRSA: Prompt Stealing Attacks against Real-World Prompt Services (USENIX Security '25)☆24Dec 25, 2025Updated 2 months ago
- Official implementation of SIGIR 2022 Paper "Task-Oriented Dialogue System as Natural Language Generation".☆14Apr 6, 2022Updated 3 years ago
- [Preprint] Backdoor Attacks on Federated Learning with Lottery Ticket Hypothesis☆10Sep 23, 2021Updated 4 years ago
- ☆16Nov 12, 2024Updated last year
- 2020-natural-language-processing-project☆10Dec 18, 2020Updated 5 years ago
- A research workbench for developing and testing attacks against large language models, with a focus on prompt injection vulnerabilities a…☆39Mar 2, 2026Updated last week
- The code and datasets of our ACM MM 2024 paper "Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed …☆11Sep 27, 2024Updated last year
- 运用cnn + highway network网络结构中文文本分类☆14Sep 25, 2017Updated 8 years ago
- [NeurIPS 2024] "Membership Inference on Text-to-image Diffusion Models via Conditional Likelihood Discrepancy"☆12Sep 15, 2025Updated 5 months ago
- Machine Learning Data Fairness and Bias☆13Mar 1, 2026Updated last week
- Multi-class adaboost algorithm samme☆10Nov 8, 2019Updated 6 years ago
- TensorFlow implementation of "Generating Sentences from a Continuous Space"☆11Sep 16, 2019Updated 6 years ago
- ☆22Jun 22, 2025Updated 8 months ago
- Face recognition with loss of softmax, sphereface, cosface, arcface in pytorch of python3☆10Apr 27, 2020Updated 5 years ago
- Unofficial implementation of EMNLP-IJCNLP19 paper "Event Detection with Multi-Order Graph Convolution and Aggregated Attention"☆14Jan 11, 2021Updated 5 years ago
- Code for the CVPR '23 paper, "Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning"☆10Jun 9, 2023Updated 2 years ago
- Code for Spectral Norm of Convolutional Layers with Circular and Zero Paddings and Efficient Bound of Lipschitz Constant for Convolutiona…☆15Feb 2, 2024Updated 2 years ago