Uppaal / detox-editLinks

☆4

Alternatives and similar repositories for detox-edit

Users that are interested in detox-edit are comparing it to the libraries listed below

Sorting:

deeplearning-wisc / haloscope
source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"
☆48Updated 3 months ago
AlexanderVNikitin / kernel-language-entropy
Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)
☆24Updated 7 months ago
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆80Updated 3 months ago
jaechan-repo / muse_bench
☆26Updated 11 months ago
deeplearning-wisc / picle
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆25Updated last year
javiferran / sae_entities
☆54Updated 4 months ago
princeton-nlp / benign-data-breaks-safety
☆41Updated 9 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆59Updated 9 months ago
tanganke / weight-ensembling_MoE
Code for paper "Merging Multi-Task Models via Weight-Ensembling Mixture of Experts"
☆27Updated last year
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆59Updated last year
licong-lin / negative-preference-optimization
☆60Updated last year
nik-dim / tall_masks
Official repository of "Localizing Task Information for Improved Model Merging and Compression" [ICML 2024]
☆45Updated 8 months ago
EnnengYang / AdaMerging
AdaMerging: Adaptive Model Merging for Multi-Task Learning. ICLR, 2024.
☆87Updated 8 months ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆101Updated 4 months ago
JasonForJoy / Model-Editing-Hurt
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆35Updated last month
VITA-Group / SEAL
Official code for SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆30Updated 3 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆94Updated last year
UCSC-VLAA / vllm-safety-benchmark
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
☆81Updated last year
weixuan-wang123 / SADI
☆11Updated 4 months ago
yfqiu-nlp / sea-llm
Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"
☆24Updated 6 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆74Updated 4 months ago
ethz-spylab / unlearning-vs-safety
☆23Updated 9 months ago
swj0419 / muse_bench
☆22Updated 4 months ago
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆77Updated 9 months ago
which47 / LLMCL
Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning
☆34Updated 8 months ago
zepingyu0512 / in-context-mechanism
code for EMNLP 2024 paper: How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for M…
☆13Updated 8 months ago
ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆71Updated 9 months ago
milesaturpin / cot-unfaithfulness
☆46Updated last year
zzwjames / FailureLLMUnlearning
An official implementation of "Catastrophic Failure of LLM Unlearning via Quantization" (ICLR 2025)
☆27Updated 4 months ago
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆38Updated 8 months ago