☆26Sep 5, 2024Updated last year
Alternatives and similar repositories for sycophancy-to-subterfuge-paper
Users that are interested in sycophancy-to-subterfuge-paper are comparing it to the libraries listed below
Sorting:
- Hypercorn is an ASGI and WSGI Server based on Hyper libraries and inspired by Gunicorn.☆14Jan 12, 2026Updated last month
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆134Mar 9, 2024Updated 2 years ago
- ☆54Feb 13, 2024Updated 2 years ago
- Display and customize Markdown text in SwiftUI☆40Jan 28, 2025Updated last year
- ☆92Mar 4, 2024Updated 2 years ago
- ☆13Jul 5, 2024Updated last year
- ☆12Oct 23, 2022Updated 3 years ago
- Sparse Autoencoder Training Library☆55May 1, 2025Updated 10 months ago
- Notebooks accompanying Anthropic's "Toy Models of Superposition" paper☆137Sep 14, 2022Updated 3 years ago
- Situational Awareness Dataset☆46Dec 14, 2024Updated last year
- A TinyStories LM with SAEs and transcoders☆14Apr 3, 2025Updated 11 months ago
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆19Oct 11, 2024Updated last year
- ☆52Oct 23, 2023Updated 2 years ago
- ☆36Apr 30, 2024Updated last year
- ☆87Oct 8, 2025Updated 5 months ago
- Implementation of Influence Function approximations for differently sized ML models, using PyTorch☆16Sep 15, 2023Updated 2 years ago
- CycleQD is a framework for parameter space model merging.☆48Feb 1, 2025Updated last year
- Lightweight demo using the Anthropic Python SDK to experiment with Claude's Search and Retrieval capabilities over a variety of knowledge…☆180Jun 30, 2024Updated last year
- This library supports evaluating disparities in generated image quality, diversity, and consistency between geographic regions.☆20Jun 3, 2024Updated last year
- MCP Market☆25Apr 1, 2025Updated 11 months ago
- Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…☆23Dec 3, 2024Updated last year
- ☆26Feb 18, 2026Updated 2 weeks ago
- ☆30Jul 17, 2023Updated 2 years ago
- ☆349Nov 4, 2024Updated last year
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆27Nov 20, 2024Updated last year
- Open source replication of Anthropic's Crosscoders for Model Diffing☆64Oct 27, 2024Updated last year
- Service for quickly aliasing and redirecting to long URLs☆25Apr 26, 2023Updated 2 years ago
- Get up and running with the Gemini API using a simple Journaling App + Angular☆51Feb 13, 2026Updated 3 weeks ago
- ☆46Jan 17, 2026Updated last month
- Attribution-based Parameter Decomposition☆34Jun 11, 2025Updated 8 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆248Feb 27, 2026Updated last week
- ☆134Oct 28, 2023Updated 2 years ago
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆42Mar 23, 2023Updated 2 years ago
- [ICRA 2026] StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes☆20Feb 17, 2026Updated 3 weeks ago
- my personal mcp server☆13Apr 23, 2025Updated 10 months ago
- MiniMax-Provider-Verifier offers a rigorous, vendor-agnostic way to verify whether third-party deployments of the Minimax M2 model are co…☆29Feb 18, 2026Updated 2 weeks ago
- Project exploring 3D volumetric rendering of NEXRAD radar data.☆11Oct 23, 2023Updated 2 years ago
- OLMost every training recipe you need to perform data interventions with the OLMo family of models.☆66Updated this week
- Community maintained hardware plugin for vLLM on AWS Neuron☆24Feb 26, 2026Updated last week