poloclub / llm-landscapeLinks

NeurIPS'24 - LLM Safety Landscape

☆31

Alternatives and similar repositories for llm-landscape

Users that are interested in llm-landscape are comparing it to the libraries listed below

Sorting:

sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 10 months ago
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆38Updated 3 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99Updated last year
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆89Updated last year
ejones313 / auditing-llms
☆59Updated 2 years ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆62Updated last year
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆65Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆89Updated 6 months ago
Vaidehi99 / InfoDeletionAttacks
☆47Updated 9 months ago
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
licong-lin / negative-preference-optimization
☆68Updated last year
thu-coai / SafeUnlearning
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆32Updated last year
locuslab / acr-memorization
☆37Updated 11 months ago
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆62Updated last year
decoding-comp-trust / comp-trust
Codebase for decoding compressed trust.
☆25Updated last year
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆63Updated 5 months ago
aengusl / latent-adversarial-training
☆46Updated last year
sail-sg / Cheating-LLM-Benchmarks
[ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)
☆84Updated last year
JasonForJoy / Model-Editing-Hurt
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆37Updated 5 months ago
franciscoliu / Awesome-GenAI-Unlearning
☆175Updated last week
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆62Updated last year
ethz-spylab / unlearning-vs-safety
☆25Updated last year
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
sail-sg / Agent-Smith
[ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
☆116Updated last year
bpwu1 / confidence-regulation-neurons
Confidence Regulation Neurons in Language Models (NeurIPS 2024)
☆14Updated 9 months ago
zjysteven / mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training data of LLMs
☆50Updated 5 months ago
Princeton-SysML / Jailbreak_LLM
☆188Updated last year
weichen-yu / LM-Extraction
☆43Updated 2 years ago
amazon-science / controlling-llm-memorization
☆38Updated 2 years ago