huashen218 / bidirectional-alignment-reading-listLinks

ICLR 2025 Workshop & CHI 2025 SIG: "Bidirectional Human-AI Alignment"

☆44

Alternatives and similar repositories for bidirectional-alignment-reading-list

Users that are interested in bidirectional-alignment-reading-list are comparing it to the libraries listed below

Sorting:

neubig / research-career-tools
☆164Updated 11 months ago
tonychenxyz / selfie
This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…
☆53Updated 11 months ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆140Updated 11 months ago
jianggy / MPI
This repo contains code for our NeurIPS 2023 spotlight paper: Evaluating and Inducing Personality in Pre-trained Language Models
☆55Updated last year
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆182Updated 6 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆138Updated 4 months ago
interp-reasoning / thought-anchors
⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.
☆89Updated last week
sotopia-lab / sotopia
Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)
☆257Updated last month
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆159Updated 8 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆118Updated last year
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆83Updated 8 months ago
lorenzkuhn / semantic_uncertainty
☆180Updated last year
MiaoXiong2320 / llm-uncertainty
code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"
☆135Updated last year
BunsenFeng / modular_pluralism
Modular Pluralism @ EMNLP 2024
☆20Updated last year
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆40Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆115Updated last month
evandez / REMEDI
Inspecting and Editing Knowledge Representations in Language Models
☆119Updated 2 years ago
activatedgeek / calibration-tuning
☆52Updated 7 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
HannahKirk / prism-alignment
The Prism Alignment Project
☆84Updated last year
KihoPark / linear_rep_geometry
☆108Updated 8 months ago
giorgiopiatti / GovSim
Governance of the Commons Simulation (GovSim)
☆59Updated 9 months ago
causalNLP / cladder
We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.
☆131Updated last year
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆66Updated 11 months ago
redwoodresearch / Easy-Transformer
☆128Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
ZFancy / awesome-activation-engineering
A curated list of resources for activation engineering
☆107Updated last month
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆191Updated last year
bowen-upenn / Agent_Rationality
[NAACL 2025] Towards Rationality in Language and Multimodal Agents: A Survey
☆34Updated 8 months ago
davidbau / baukit
☆237Updated last year