wollschlager / geometry-of-refusalView external linksLinks
Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
☆23Jul 31, 2025Updated 6 months ago
Alternatives and similar repositories for geometry-of-refusal
Users that are interested in geometry-of-refusal are comparing it to the libraries listed below
Sorting:
- This project proposed a method to defense against adversarial attack. By combining the proposed preprocessing method with an adversariall…☆10Oct 4, 2018Updated 7 years ago
- A tiny easily hackable implementation of a feature dashboard.☆15Oct 21, 2025Updated 3 months ago
- Fingerprint large language models☆49Jul 11, 2024Updated last year
- ☆25Jun 10, 2025Updated 8 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- Code for the AAAI 2023 paper "CodeAttack: Code-based Adversarial Attacks for Pre-Trained Programming Language Models☆35Apr 18, 2023Updated 2 years ago
- Code for reproducing the paper "Neural Networks Fail to Learn Periodic Functions and How to Fix It" as part of the ML Reproducibility Cha…☆11Apr 16, 2021Updated 4 years ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- Python package for Geometric / Clifford Algebra with Pytorch.☆14Jan 25, 2026Updated 3 weeks ago
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- ☆47Sep 29, 2024Updated last year
- TrustAgent: Towards Safe and Trustworthy LLM-based Agents☆56Feb 7, 2025Updated last year
- Official release of code for the paper RL is a hammer and LLMs are nails A simple RL approach to stronger prompt injection attacks☆39Updated this week
- [NAACL'25] Evaluating LLMs for Causal Queries☆13Feb 18, 2025Updated 11 months ago
- ☆17May 3, 2025Updated 9 months ago
- This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"☆47Oct 13, 2025Updated 4 months ago
- 这是一个面向币圈新手的入门速通指南集合,包括最全面的币圈区块链资源集合,包含各类工具导航,快速了解币圈常用术语和行话,详细的防骗指南,助你规避各类风险☆19Feb 10, 2026Updated last week
- Official Repo of Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents☆58Oct 28, 2025Updated 3 months ago
- Code for the CSF 2018 paper "Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting"☆39Jan 28, 2019Updated 7 years ago
- ☆14Jul 5, 2022Updated 3 years ago
- volumetric brush implementation, heavily inspired by toxiclibs☆14Dec 10, 2022Updated 3 years ago
- Code for Representation Bending Paper☆16Jul 15, 2025Updated 7 months ago
- ☆13Apr 17, 2025Updated 10 months ago
- Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"☆16Dec 4, 2024Updated last year
- Translate PDF to ePub by Gemini☆19Jun 18, 2025Updated 8 months ago
- This is the code for our paper: PLACES: Prompting Language Models for Social Conversation Synthesis☆11Feb 17, 2023Updated 3 years ago
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 7 months ago
- Complete educational guide with 100+ working code examples for building production-ready AI agents using the OpenAI Agents SDK. Learn t…☆15Jan 10, 2026Updated last month
- Project of ACL 2025 "UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models"☆15Mar 25, 2025Updated 10 months ago
- ☆13Jan 14, 2026Updated last month
- ☆11Sep 28, 2022Updated 3 years ago
- Codes for the ICLR 2022 paper: Trigger Hunting with a Topological Prior for Trojan Detection☆11Sep 19, 2023Updated 2 years ago
- For Competition on Adversarial Attacks and Defenses 2018☆40Jan 4, 2019Updated 7 years ago
- [EMNLP 2022] Distillation-Resistant Watermarking (DRW) for Model Protection in NLP☆13Aug 17, 2023Updated 2 years ago
- ☆13Aug 4, 2023Updated 2 years ago
- The official implementation of the W&B Models and Weave MCP server.☆27Feb 9, 2026Updated last week
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Feb 20, 2024Updated last year
- Concept Learning Dynamics☆16Oct 29, 2024Updated last year
- Jump ReLU☆11Apr 8, 2019Updated 6 years ago