peterljq / Parsimonious-Concept-EngineeringLinks

PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)

☆41

Alternatives and similar repositories for Parsimonious-Concept-Engineering

Users that are interested in Parsimonious-Concept-Engineering are comparing it to the libraries listed below

Sorting:

stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆150Updated 5 months ago
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆186Updated 7 months ago
activatedgeek / calibration-tuning
☆52Updated 7 months ago
locuslab / acr-memorization
☆37Updated 11 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆84Updated 8 months ago
MadryLab / DsDm
☆51Updated last year
jiahai-feng / binding-iclr
☆16Updated last year
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
sail-sg / Cheating-LLM-Benchmarks
[ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)
☆84Updated last year
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆64Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆123Updated 2 months ago
KihoPark / linear_rep_geometry
☆110Updated 9 months ago
ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆77Updated last year
tatsu-lab / test_set_contamination
☆41Updated 2 years ago
haotiansun14 / BBox-Adapter
Lightweight Adapting for Black-Box Large Language Models
☆24Updated last year
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆63Updated last year
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆38Updated 3 months ago
poloclub / llm-landscape
NeurIPS'24 - LLM Safety Landscape
☆33Updated last month
bpwu1 / confidence-regulation-neurons
Confidence Regulation Neurons in Language Models (NeurIPS 2024)
☆14Updated 10 months ago
dannyallover / overthinking_the_truth
☆29Updated last year
yfqiu-nlp / sea-llm
Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"
☆29Updated 11 months ago
roeehendel / icl_task_vectors
☆102Updated 2 years ago
tim-lawson / mlsae
Multi-Layer Sparse Autoencoders (ICLR 2025)
☆27Updated 9 months ago
javiferran / sae_entities
☆66Updated 8 months ago
ahans30 / goldfish-loss
[NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs
☆93Updated last year
efarrell1 / train_sparse_autoencoder
Trains Sparse Autoencoders based on outputs from language models
☆11Updated last year
katiekang1998 / reasoning_generalization
☆33Updated 10 months ago
sail-sg / dice
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
☆44Updated 7 months ago
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆57Updated last month
socialfoundations / tttlm
Test-time-training on nearest neighbors for large language models
☆48Updated last year