domenicrosati / representation-noisingView external linksLinks
Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning
☆23Dec 12, 2024Updated last year
Alternatives and similar repositories for representation-noising
Users that are interested in representation-noising are comparing it to the libraries listed below
Sorting:
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆49Jan 15, 2026Updated last month
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 7 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!☆14Apr 8, 2025Updated 10 months ago
- ☆13Aug 9, 2023Updated 2 years ago
- [ICLR 2024] Towards Elminating Hard Label Constraints in Gradient Inverision Attacks☆14Feb 6, 2024Updated 2 years ago
- [NDSS'25] The official implementation of safety misalignment.☆17Jan 8, 2025Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆89Mar 30, 2025Updated 10 months ago
- Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting☆18Apr 15, 2025Updated 10 months ago
- CVPR 2025 - R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning☆21Aug 28, 2025Updated 5 months ago
- ☆24Aug 7, 2025Updated 6 months ago
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Jul 17, 2024Updated last year
- My personal web page☆11Oct 20, 2025Updated 3 months ago
- ☆24Dec 8, 2024Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆75Mar 1, 2025Updated 11 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Feb 22, 2024Updated last year
- Program and links to the material for the GloBIAS Training School 2025, Kobe, Japan.☆22Oct 27, 2025Updated 3 months ago
- MirMachine, a command line tool to detect microRNA homologs in genome sequences.☆13Dec 3, 2025Updated 2 months ago
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆59Sep 5, 2025Updated 5 months ago
- Un chat que construimos en vivo en https://twitch.tv/xabadu 📺🍅🔥☆10Mar 5, 2023Updated 2 years ago
- Documentation at☆14Mar 27, 2025Updated 10 months ago
- Colecciones para el tutorial Electrónica digital para Makers con FPGAs Libres☆11Dec 4, 2018Updated 7 years ago
- Chaos Magick Sigils☆15Jan 30, 2026Updated 2 weeks ago
- Add AI to the Linux terminal☆10Apr 28, 2024Updated last year
- Official Implementation of "Learning to Refuse: Towards Mitigating Privacy Risks in LLMs"☆10Dec 13, 2024Updated last year
- Identification of the Adversary from a Single Adversarial Example (ICML 2023)☆10Jul 15, 2024Updated last year
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- ☆44Oct 1, 2024Updated last year
- ICCV 2021, We find most existing triggers of backdoor attacks in deep learning contain severe artifacts in the frequency domain. This Rep…☆48Apr 27, 2022Updated 3 years ago
- ☆47Sep 29, 2024Updated last year
- Useful abstraction golang library for building AI-powered reasoning apps☆11May 31, 2023Updated 2 years ago
- [AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilitie…☆10Feb 7, 2026Updated last week
- [WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews☆17Dec 14, 2025Updated 2 months ago
- ☆12Jan 22, 2023Updated 3 years ago
- ☆14Dec 1, 2025Updated 2 months ago
- Simple Calculator: I created simple calculator to perform operations.☆13Jun 21, 2024Updated last year
- My notes on natural history, science, and technology.☆17Dec 21, 2025Updated last month
- ☆14Feb 26, 2025Updated 11 months ago
- Modular Matrix Exponentiation Cryptography☆10Nov 27, 2023Updated 2 years ago