zjunlp/steer-target-atoms

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/zjunlp/steer-target-atoms)

zjunlp / steer-target-atoms

[ACL 2025] Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

☆41

Alternatives and similar repositories for steer-target-atoms

Users that are interested in steer-target-atoms are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Wangyuhao06 / IKEA
View on GitHub
Implement of Implicit Knowledge Extraction Attack.
☆24Jul 14, 2026Updated last week
UCSC-REAL / FLAT
View on GitHub
[ICLR 2025] FLAT: LLM Unlearning via Loss Adjustment with Only Forget Data
☆14Feb 26, 2025Updated last year
AIRI-Institute / SAE-Reasoning
View on GitHub
☆99Mar 28, 2025Updated last year
nrimsky / CAA
View on GitHub
Steering Llama 2 with Contrastive Activation Addition
☆241May 23, 2024Updated 2 years ago
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
CaoYuanpu / BiPO
View on GitHub
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
☆50Jul 28, 2024Updated last year
fuyahuii / ConSK-GCN
View on GitHub
The PyTorch code for paper: "CONSK-GCN: Conversational Semantic- and Knowledge-Oriented Graph Convolutional Network for Multimodal Emotio…
☆13Oct 21, 2022Updated 3 years ago
Model-GLUE / Model-GLUE
View on GitHub
☆18Aug 19, 2024Updated last year
VovyH / MultiAgent-Search
View on GitHub
[2025-上海人工智能实验室书生实训营十佳、优秀项目]
☆43Sep 22, 2025Updated 10 months ago
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆23Jun 12, 2026Updated last month
ydyjya / SafetyHeadAttribution
View on GitHub
☆70Jun 1, 2025Updated last year
shaoshuo-ss / LeaFBench
View on GitHub
Official code for our paper "SoK: Large Language Model Copyright Auditing via Fingerprinting"
☆18Dec 31, 2025Updated 6 months ago
Doby-Xu / ST
View on GitHub
Official Code for CVPR 2024 paper: Permutation Equivariance of Transformers and Its Applications.
☆17Nov 12, 2024Updated last year
slavachalnev / SAE-TS
View on GitHub
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆29Nov 20, 2024Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
joeljang / RLPHF
View on GitHub
Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging
☆120Oct 23, 2023Updated 2 years ago
khhung-906 / Attention-Tracker
View on GitHub
Code for our NAACL2025 accepted paper: Attention Tracker: Detecting Prompt Injection Attacks in LLMs
☆28Sep 19, 2025Updated 10 months ago
aaronmueller / MIB
View on GitHub
Landing page for MIB: A Mechanistic Interpretability Benchmark
☆26Aug 15, 2025Updated 11 months ago
HanjiangHu / NBF-LLM
View on GitHub
The official code for "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks".
☆18Jun 24, 2026Updated last month
DPamK / BadAgent
View on GitHub
☆33Feb 27, 2025Updated last year
Astarojth / AgentAuditor-ASSEBench
View on GitHub
☆40May 29, 2026Updated last month
albert-y1n / PISmith
View on GitHub
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
☆22Jul 17, 2026Updated last week
FatemehShiri / Spatial-MM
View on GitHub
☆12Jan 10, 2025Updated last year
CharlesJW222 / FALCON
View on GitHub
FALCON: Fine-grained Activation Manipulation for LLM Unlearning (NeurIPS 2025)
☆26Jul 7, 2026Updated 2 weeks ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
HomuraT / CIDF
View on GitHub
Causal Inference-based Debiasing Framework for Knowledge Graph Completion
☆13Mar 19, 2024Updated 2 years ago
mochuishle / Thesis-Review-Skill
View on GitHub
Review your thesis from the perspective of a reviewer. Distilling Years of Experience from a University Instructor.
☆15May 3, 2026Updated 2 months ago
MikaStars39 / StableMask
View on GitHub
PyTorch implementation of StableMask (ICML'24)
☆15Jun 27, 2024Updated 2 years ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
qcznlp / uncertainty_attack
View on GitHub
☆23Sep 2, 2025Updated 10 months ago
cooperleong00 / Awesome-LLM-Interpretability
View on GitHub
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
☆308Jan 22, 2026Updated 6 months ago
Trustworthy-ML-Lab / CB-LLMs
View on GitHub
[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliabilit…
☆33Feb 5, 2026Updated 5 months ago
jayneelparekh / learn-to-steer
View on GitHub
[NeurIPS 2025] Official Implementation for Learning to Steer: Input-dependent Steering for Multimodal LLMs
☆19Dec 14, 2025Updated 7 months ago
princeton-nlp / ELIZA-Transformer
View on GitHub
[NAACL 2025] Representing Rule-based Chatbots with Transformers
☆23Feb 9, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
ZHITENGLI / AdaSVD
View on GitHub
PyTorch code for our paper "AdaSVD: Adaptive Singular Value Decomposition for Large Language Models"
☆15Mar 9, 2025Updated last year
ant-research / M2-Miner
View on GitHub
[ICLR 2026] M2-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
☆55Apr 22, 2026Updated 3 months ago
TurkuNLP / bert-eval
View on GitHub
☆10Oct 15, 2019Updated 6 years ago
mlwu22 / RED
View on GitHub
Implementation code for ACL2024：Advancing Parameter Efficiency in Fine-tuning via Representation Editing
☆15Apr 20, 2024Updated 2 years ago
Yunhao-Feng / BackdoorAgent
View on GitHub
BackdoorAgent is a stage-aware framework and benchmark that instruments LLM-agent workflows (planning, memory, tools) to systematically i…
☆43Mar 16, 2026Updated 4 months ago
Joluck / MiSS
View on GitHub
MiSS is a novel PEFT method that features a low-rank structure but introduces a new update mechanism distinct from LoRA, achieving an exc…
☆35Mar 9, 2026Updated 4 months ago
hwanchang00 / ChatInject
View on GitHub
[ICLR 2026] Official implementation of "ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents"
☆17Mar 23, 2026Updated 4 months ago