LukeBailey181/obfuscated-activations

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/LukeBailey181/obfuscated-activations)

LukeBailey181 / obfuscated-activations

Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses

☆31

Alternatives and similar repositories for obfuscated-activations

Users that are interested in obfuscated-activations are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

TeunvdWeij / sandbagging
View on GitHub
☆20Nov 15, 2024Updated last year
JoshEngels / SAE-Dark-Matter
View on GitHub
Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"
☆23Feb 6, 2025Updated last year
SolidShen / RIPPLE_official
View on GitHub
☆20Feb 11, 2024Updated 2 years ago
thestephencasper / latent_adversarial_training
View on GitHub
☆24Jul 25, 2024Updated last year
ielab / vec2text-dense_retriever-threat
View on GitHub
Is Vec2Text Really a Threat toDense Retrieval Systems?
☆18Nov 29, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
PierrickPochelu / JaxDecompiler
View on GitHub
Jax Decompiler
☆16Apr 22, 2025Updated last year
j-towns / scanagram
View on GitHub
Tidy autoregressive inference in JAX
☆15Sep 1, 2025Updated 10 months ago
FlyingPumba / InterpBench
View on GitHub
A benchmark for mechanistic discovery of circuits in Transformers
☆17Dec 15, 2024Updated last year
Confirm-Solutions / flrt
View on GitHub
Fluent student-teacher redteaming
☆23Jul 25, 2024Updated last year
fzwark / Secure_LLM_System
View on GitHub
☆16Mar 9, 2025Updated last year
IBM / URET
View on GitHub
Universal Robustness Evaluation Toolkit (for Evasion)
☆32Sep 17, 2025Updated 10 months ago
Boyeep / Operating-System-2nd-Semester
View on GitHub
Operating Systems Semester 2 coursework covering Linux, shell scripting, process management, concurrency, and synchronization.
☆21Jun 11, 2026Updated last month
longtermrisk / openweights
View on GitHub
A python sdk for LLM finetuning and inference on runpod infrastructure
☆30May 12, 2026Updated 2 months ago
Blkalkin / Optimal-TestTime
View on GitHub
☆10Mar 24, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
VITA-Group / Random-Shuffling-BackdoorDetect
View on GitHub
[NeurIPS 2022] "Randomized Channel Shuffling: Minimal-Overhead Backdoor Attack Detection without Clean Datasets" by Ruisi Cai*, Zhenyu Zh…
☆21Oct 1, 2022Updated 3 years ago
AsaCooperStickland / situational-awareness-evals
View on GitHub
Measuring the situational awareness of language models
☆41Feb 12, 2024Updated 2 years ago
CryptoAILab / misalignment
View on GitHub
[NDSS'25] The official implementation of safety misalignment.
☆19Jan 8, 2025Updated last year
OscarXZQ / delta_activations
View on GitHub
Official code release for Delta Activations: A Representation for Finetuned Large Language Models
☆20Sep 5, 2025Updated 10 months ago
FossMec / Code-a-pookalam
View on GitHub
Code-a-pookalam competition at Govt. Model Engineering College
☆11Oct 30, 2019Updated 6 years ago
AlgebraLoveme / PIRA
View on GitHub
☆23Jul 12, 2026Updated last week
ashmaster / ctrl-c
View on GitHub
An app and a web client in order to send text from mobile to computer
☆11Jan 26, 2023Updated 3 years ago
ApolloResearch / sample
View on GitHub
Repository with sample code using Apollo's suggested engineering practices
☆15Dec 16, 2024Updated last year
RobertCsordas / onion_representations
View on GitHub
☆13Aug 19, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Gwinhen / DRUPE
View on GitHub
Distribution Preserving Backdoor Attack in Self-supervised Learning
☆20Jan 27, 2024Updated 2 years ago
atharva-naik / VADEC
View on GitHub
Codes and Datasets for our SIGIR 2021 Paper: "Understanding the Role of Affect Dimensions in Detecting Emotions from Tweets: A Multi-task…
☆12Apr 21, 2021Updated 5 years ago
EleutherAI / training-jacobian
View on GitHub
☆24Dec 11, 2024Updated last year
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆22Jun 12, 2026Updated last month
changjonathanc / llmproc
View on GitHub
LLMProc: Unix-inspired runtime that treats LLMs as processes.
☆33Jul 17, 2025Updated last year
KaiyuanZh / SOFT
View on GitHub
[USENIX Security 2025] SOFT: Selective Data Obfuscation for Protecting LLM Fine-tuning against Membership Inference Attacks
☆23Sep 18, 2025Updated 10 months ago
safety-research / false-facts
View on GitHub
☆50Jul 4, 2025Updated last year
NLie2 / what_features_jailbreak_LLMs
View on GitHub
☆18Mar 30, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
compsec-snu / pfi
View on GitHub
PFI: Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
☆31Mar 26, 2025Updated last year
ApolloResearch / apd
View on GitHub
Attribution-based Parameter Decomposition
☆35Jun 11, 2025Updated last year
JoshEngels / MultiDimensionalFeatures
View on GitHub
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆90Nov 27, 2024Updated last year
microsoft / TaskTracker
View on GitHub
TaskTracker is an approach to detecting task drift in Large Language Models (LLMs) by analysing their internal activations. It provides a…
☆92Sep 1, 2025Updated 10 months ago
RU-System-Software-and-Security / FeatureRE
View on GitHub
☆27Nov 9, 2022Updated 3 years ago
ApolloResearch / deception-detection
View on GitHub
☆44Feb 11, 2025Updated last year
nickboucher / imperceptible
View on GitHub
Bad Characters: Imperceptible NLP Attacks
☆36Apr 9, 2024Updated 2 years ago