yellowtownhz/sycophancy-interpretability

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yellowtownhz/sycophancy-interpretability)

yellowtownhz / sycophancy-interpretability

☆15

Alternatives and similar repositories for sycophancy-interpretability

Users that are interested in sycophancy-interpretability are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

DYR1 / MoGU
View on GitHub
Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.
☆18Jan 14, 2025Updated last year
aisa-group / decomposing-eval-awareness
View on GitHub
Decomposing and measuring evaluation awareness in existing benchmarks and our proposed EvalAwareBench.
☆19Jun 1, 2026Updated last month
vfleaking / PTST
View on GitHub
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 10 months ago
wangrongding / folder-print
View on GitHub
🌿快速生成文件夹目录结构，支持定义目录层级，支持生成到 markdown 文件。
☆13Oct 19, 2022Updated 3 years ago
centerforaisafety / mask
View on GitHub
Code for evaluating AI systems on the MASK honesty benchmark.
☆24Mar 6, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
tim-lawson / mlsae
View on GitHub
Multi-Layer Sparse Autoencoders (ICLR 2025)
☆30Feb 6, 2026Updated 5 months ago
meg-tong / sycophancy-eval
View on GitHub
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆130Oct 25, 2023Updated 2 years ago
zzeng13 / DISC
View on GitHub
Automatic Idiomatic Expression Detection
☆13Sep 26, 2021Updated 4 years ago
aladinD / SafeMERGE
View on GitHub
Code for SafeMERGE (ICLR 2025).
☆15Apr 1, 2025Updated last year
alexzhang13 / world-models-papers
View on GitHub
Selected list of papers on World Models that I found interesting and/or useful.
☆40Feb 8, 2025Updated last year
claws-lab / MisinfoCorrect
View on GitHub
Code and Data for WWW'23 paper Reinforcement Learning-based Counter-Misinformation Response Generation: A Case Study of COVID-19 Vaccine …
☆27Jun 28, 2023Updated 3 years ago
XuZhao0 / Model-Selection-Reasoning
View on GitHub
Model Selection with Large Language Models for Reasoning (EMNLP2023 Findings)
☆30Dec 23, 2023Updated 2 years ago
thu-coai / LongSafety
View on GitHub
[ACL 2025] LongSafety: Evaluating Long-Context Safety of Large Language Models
☆16Jun 18, 2025Updated last year
Zhang-Yihao / Adversarial-Representation-Engineering
View on GitHub
Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.
☆20Dec 6, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Liu-Hy / nlp-contrib-graph
View on GitHub
Official implementation of the winning system at SemEval-2021 Task 11 - NLP Contribution Graph (Best System Paper Award 🏆)
☆11Aug 24, 2025Updated 11 months ago
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆16Jul 22, 2025Updated last year
microsoft / MSMARCO-Passage-Ranking-Submissions
View on GitHub
Submission archive for the MS MARCO passage ranking leaderboard
☆13Apr 21, 2023Updated 3 years ago
engelen / vonmiseskde
View on GitHub
Python Von Mises Kernel Density Estimator implementation
☆11Jun 15, 2017Updated 9 years ago
H-TayyarMadabushi / AStitchInLanguageModels
View on GitHub
Data and Baselines for AStitchInLanguageModels dataset
☆13Oct 31, 2022Updated 3 years ago
ellenmellon / DIALKI
View on GitHub
DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization
☆10Aug 3, 2022Updated 3 years ago
XinnuoXu / AugNLG
View on GitHub
☆14May 26, 2021Updated 5 years ago
martiansideofthemoon / longeval-summarization
View on GitHub
Official repository for our EACL 2023 paper "LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization" (https…
☆45Aug 10, 2024Updated last year
amazon-science / wikiwiki-dataset
View on GitHub
☆11May 11, 2022Updated 4 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
jenni-ai / T2FW
View on GitHub
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights
☆20Oct 9, 2022Updated 3 years ago
chexov / image2pipe
View on GitHub
simple "image2pipe" ffmpeg wrapper for python
☆18Aug 30, 2024Updated last year
irenasaracay / model-equality-testing
View on GitHub
Test equality between a black-box LLM API and a reference distribution
☆20Oct 29, 2024Updated last year
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆23Jun 12, 2026Updated last month
yhzhu99 / tutorials
View on GitHub
tutorials
☆22Aug 12, 2022Updated 3 years ago
chicosirius / think-or-not
View on GitHub
☆22May 23, 2025Updated last year
taki0112 / denoising-diffusion-gan-Tensorflow
View on GitHub
Tensorflow implementation of "Tackling the Generative Learning Trilemma with Denoising Diffusion GANs" (ICLR 2022 Spotlight)
☆21Aug 3, 2022Updated 3 years ago
H-TayyarMadabushi / SemEval_2022_Task2-idiomaticity
View on GitHub
Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
☆16Feb 3, 2022Updated 4 years ago
john-hewitt / conditional-probing
View on GitHub
Codebase for running (conditional) probing experiments
☆21Nov 13, 2022Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
zhu-minjun / SafetyLock
View on GitHub
Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!
☆11Oct 16, 2024Updated last year
ckkissane / sae-transfer
View on GitHub
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
☆13Jul 18, 2024Updated 2 years ago
git-disl / Lisa
View on GitHub
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆29Sep 10, 2024Updated last year
Zhoues / Technical-Learning-Notes
View on GitHub
This repository contains all the notes I took in the learning process of all the technologies during my study! 这个仓库记录了我在本科期间学习各类技术的过程中记录…
☆21Mar 14, 2023Updated 3 years ago
d223302 / TRACT
View on GitHub
☆24Mar 21, 2025Updated last year
CALLMELARE / asoul-ui
View on GitHub
☆27Feb 7, 2023Updated 3 years ago
theeluwin / kata
View on GitHub
Let's study.
☆20Mar 30, 2026Updated 4 months ago