Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning
☆24Dec 12, 2024Updated last year
Alternatives and similar repositories for representation-noising
Users that are interested in representation-noising are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 11 months ago
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆50Jan 15, 2026Updated 5 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated last year
- This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)☆28Sep 10, 2024Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆90Mar 30, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!☆15Apr 8, 2025Updated last year
- [NDSS'25] The official implementation of safety misalignment.☆19Jan 8, 2025Updated last year
- [ICLR 2024] Towards Elminating Hard Label Constraints in Gradient Inverision Attacks☆14Feb 6, 2024Updated 2 years ago
- This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturba…☆39Mar 22, 2025Updated last year
- How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)☆14Jul 16, 2021Updated 4 years ago
- Source code of IPA, https://escholarship.org/uc/item/2p0805dq☆12Jun 27, 2024Updated last year
- ☆24Feb 17, 2026Updated 4 months ago
- The official code of Multi-player Nash Preference Optimization [ICLR 2026]☆35Feb 4, 2026Updated 4 months ago
- Unofficial implementation of the "3D Bounding Box Estimation Using Deep Learning and Geometry" paper in Tensorflow.☆13Aug 23, 2023Updated 2 years ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆83Mar 1, 2025Updated last year
- ☆24Dec 8, 2024Updated last year
- Official Implementation of "Learning to Refuse: Towards Mitigating Privacy Risks in LLMs"☆10Dec 13, 2024Updated last year
- LLM Unlearning☆185Oct 20, 2023Updated 2 years ago
- A pip library for calculating the Shapley Value for computing the marginal contribution of each client in a Federated Learning environmen…☆29Dec 10, 2023Updated 2 years ago
- Documenting large text datasets 🖼️ 📚☆14Dec 17, 2024Updated last year
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆66Sep 5, 2025Updated 9 months ago
- ☆15Feb 26, 2025Updated last year
- Official Implementation of "Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts" at EMNLP 202…☆13Oct 27, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- [NeurIPS 2025@FoRLM] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search☆17Jan 24, 2026Updated 4 months ago
- [ICLR 2025 SSI-FM] Self-Taught Self-Correction for Small Language Models☆11Sep 19, 2025Updated 8 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆71Feb 22, 2024Updated 2 years ago
- [EMNLP 2025] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking☆12Aug 22, 2025Updated 9 months ago
- Project of ACL 2025 "UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models"☆14Mar 25, 2025Updated last year
- Identification of the Adversary from a Single Adversarial Example (ICML 2023)☆10Jul 15, 2024Updated last year
- [WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews☆18Dec 14, 2025Updated 6 months ago
- ☆21May 14, 2025Updated last year
- Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"☆14Mar 28, 2024Updated 2 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- [AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilitie…☆11Feb 7, 2026Updated 4 months ago
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆13Jan 26, 2025Updated last year
- A package that extracts Persian time and date markers by applying regexes -- AACL 2022☆28Nov 29, 2022Updated 3 years ago
- Stochastic Multiple Target Sampling Gradient Descent (NeurIPS 2022)☆13Sep 19, 2022Updated 3 years ago
- ICCV 2021, We find most existing triggers of backdoor attacks in deep learning contain severe artifacts in the frequency domain. This Rep…☆48Apr 27, 2022Updated 4 years ago
- code and resources for our paper "Achieving Joint Training Accuracy in Continual Learning" in AAAI2025☆14Feb 25, 2025Updated last year
- A survey on harmful fine-tuning attack for large language model (ACM CSUR)☆245May 19, 2026Updated last month