andyzoujm / representation-engineeringLinks

Representation Engineering: A Top-Down Approach to AI Transparency

☆886

Alternatives and similar repositories for representation-engineering

Users that are interested in representation-engineering are comparing it to the libraries listed below

Sorting:

likenneth / honest_llama
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
☆549Updated 8 months ago
openai / sparse_autoencoder
☆525Updated last year
kmeng01 / rome
Locating and editing factual associations in GPT (NeurIPS 2022)
☆671Updated last year
ContextualAI / HALOs
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
☆888Updated last week
kmeng01 / memit
Mass-editing thousands of facts into a transformer memory (ICLR 2023)
☆516Updated last year
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆815Updated last month
jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆985Updated this week
EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆631Updated this week
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆639Updated 4 months ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆530Updated 2 months ago
RUCAIBox / HaluEval
This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
☆514Updated last year
EdinburghNLP / awesome-hallucination-detection
List of papers on hallucination detection in LLMs.
☆966Updated 3 months ago
voidism / DoLa
Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"
☆518Updated 8 months ago
davidbau / baukit
☆231Updated last year
SinclairCoder / Instruction-Tuning-Papers
Reading list of Instruction-tuning. A trend starts from Natrural-Instruction (ACL 2022), FLAN (ICLR 2022) and T0 (ICLR 2022).
☆770Updated 2 years ago
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆272Updated last year
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆373Updated 11 months ago
yule-BUAA / MergeLM
Codebase for Merging Language Models (ICML 2024)
☆849Updated last year
teacherpeterpan / self-correction-llm-papers
This is a collection of research papers for Self-Correcting Large Language Models with Automated Feedback.
☆552Updated 11 months ago
vec2text / vec2text
utilities for decoding deep representations (like sentence embeddings) back to text
☆949Updated 2 months ago
cooperleong00 / Awesome-LLM-Interpretability
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
☆273Updated 6 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆278Updated 3 months ago
chujiezheng / chat_templates
Chat Templates for 🤗 HuggingFace Large Language Models
☆703Updated 9 months ago
suzgunmirac / BIG-Bench-Hard
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆514Updated last year
uclaml / SPIN
The official implementation of Self-Play Fine-Tuning (SPIN)
☆1,203Updated last year
jlko / semantic_uncertainty
Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).
☆371Updated last year
zjunlp / Prompt4ReasoningPapers
[ACL 2023] Reasoning with Language Model Prompting: A Survey
☆983Updated 4 months ago
madaan / self-refine
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
☆742Updated last year
jxzhangjhu / Awesome-LLM-Uncertainty-Reliability-Robustness
Awesome-LLM-Robustness: a curated list of Uncertainty, Reliability and Robustness in Large Language Models
☆786Updated 4 months ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆269Updated last year