Flossiee / HonestyLLM
[NeurIPS 2024] HonestLLM: Toward an Honest and Helpful Large Language Model
☆18Updated last month
Related projects ⓘ
Alternatives and complementary repositories for HonestyLLM
- 【ACL 2024】 SALAD benchmark & MD-Judge☆103Updated last month
- ☆31Updated 5 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆46Updated 3 months ago
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆58Updated last month
- ☆39Updated last week
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization☆109Updated 5 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆70Updated 2 months ago
- ☆88Updated 2 months ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆60Updated 7 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆83Updated 5 months ago
- An LLM can Fool Itself: A Prompt-Based Adversarial Attack (ICLR 2024)☆51Updated 4 months ago
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆168Updated 3 weeks ago
- LLM Unlearning☆123Updated last year
- A survey on harmful fine-tuning attack for large language model☆70Updated this week
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆81Updated 3 weeks ago
- ☆23Updated 5 months ago
- ☆153Updated 11 months ago
- [arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"☆120Updated 8 months ago
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆58Updated last month
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆45Updated 7 months ago
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆207Updated last year
- [NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling☆18Updated this week
- ☆12Updated 8 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆60Updated last month
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct☆111Updated this week
- BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).☆111Updated last year
- ☆106Updated 9 months ago
- A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey"…☆71Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆239Updated 8 months ago
- ☆15Updated 3 weeks ago