domenicrosati/representation-noising

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/domenicrosati/representation-noising)

domenicrosati / representation-noising

Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning

☆24

Alternatives and similar repositories for representation-noising

Users that are interested in representation-noising are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
git-disl / Vaccine
View on GitHub
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆51Jan 15, 2026Updated 6 months ago
rishub-tamirisa / tamper-resistance
View on GitHub
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆68Jun 9, 2025Updated last year
git-disl / Lisa
View on GitHub
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆29Sep 10, 2024Updated last year
boyiwei / alignment-attribution-code
View on GitHub
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆91Mar 30, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Breakend / SelfDestructingModels
View on GitHub
☆14Aug 9, 2023Updated 2 years ago
hsajjad / ConceptX
View on GitHub
Analyzing Latent Concept in Pre-trained Transformer Models
☆12Jul 18, 2022Updated 4 years ago
ybwang119 / label_recovery
View on GitHub
[ICLR 2024] Towards Elminating Hard Label Constraints in Gradient Inverision Attacks
☆14Feb 6, 2024Updated 2 years ago
fangjf1 / OpenSafeMLRM
View on GitHub
The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!
☆15Apr 8, 2025Updated last year
hammlab / PoisoningCertifiedDefenses
View on GitHub
How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)
☆14Jul 16, 2021Updated 5 years ago
TomSheng21 / R-TPT
View on GitHub
CVPR 2025 - R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
☆22Aug 28, 2025Updated 11 months ago
boyiwei / CoTaEval
View on GitHub
[NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models
☆17Jul 17, 2024Updated 2 years ago
render-examples / rust-graphql
View on GitHub
Rust GraphQL Server Example with Juniper and Rocket
☆11Apr 10, 2026Updated 3 months ago
git-disl / Booster
View on GitHub
This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturba…
☆41Mar 22, 2025Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
SchwinnL / circuit-breakers-eval
View on GitHub
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Apr 15, 2025Updated last year
ethz-spylab / jailbreak-tax
View on GitHub
☆24Feb 17, 2026Updated 5 months ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
ChnQ / TracingLLM
View on GitHub
☆30May 22, 2024Updated 2 years ago
kevinyaobytedance / llm_unlearn
View on GitHub
LLM Unlearning
☆185Oct 20, 2023Updated 2 years ago
EnnengYang / RepresentationSurgery
View on GitHub
Representation Surgery for Multi-Task Model Merging. ICML, 2024.
☆49Oct 10, 2024Updated last year
thomsn / easy_gene
View on GitHub
Easy genetic algorithm
☆14Mar 5, 2018Updated 8 years ago
zhliu0106 / learning-to-refuse
View on GitHub
Official Implementation of "Learning to Refuse: Towards Mitigating Privacy Risks in LLMs"
☆10Dec 13, 2024Updated last year
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
ruyimarone / data-portraits
View on GitHub
Documenting large text datasets 🖼️ 📚
☆14Dec 17, 2024Updated last year
UCSC-VLAA / STAR-1
View on GitHub
[AAAI'26 Oral] Official Implementation of STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
☆38Apr 7, 2025Updated last year
manifoldco / kubernetes-credentials
View on GitHub
Kubernetes CRD to load Manifold Credentials as Secrets
☆21Dec 20, 2019Updated 6 years ago
aengusl / latent-adversarial-training
View on GitHub
☆48Sep 29, 2024Updated last year
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year
pplonski / nlp-apps-mercury
View on GitHub
☆14Feb 22, 2022Updated 4 years ago
JonasGeiping / carving
View on GitHub
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆71Feb 22, 2024Updated 2 years ago
chuhac / Reasoning-to-Defend
View on GitHub
[EMNLP 2025] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
☆12Aug 22, 2025Updated 11 months ago
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
lxuechen / ml-swissknife
View on GitHub
An ML research codebase built with friends :)
☆25Aug 25, 2024Updated last year
ngocbh / trimkv
View on GitHub
[TrimKV] Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs - [DBTrimKV] Make Each Token Count: Towards Improving Lo…
☆15Updated this week
tanganke / subspace_fusion
View on GitHub
Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"
☆14Mar 28, 2024Updated 2 years ago
render-examples / create-react-app
View on GitHub
create-react-app deployed on Render
☆22Jul 12, 2024Updated 2 years ago
IBM / NeuralFuse
View on GitHub
[NeurIPS'24] "NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes" by Hao-Lun …
☆10Sep 18, 2025Updated 10 months ago
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
manifoldco / manifold-laravel
View on GitHub
Manifold configuration module for PHP framework Laravel
☆22Dec 12, 2018Updated 7 years ago