wangyu-ustc / LargeScaleWashingLinks

The official implementation of the paper "Large Scale Knowledge Washing"

☆10

Alternatives and similar repositories for LargeScaleWashing

Users that are interested in LargeScaleWashing are comparing it to the libraries listed below

Sorting:

srzer / MOD
Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".
☆27Updated last year
dpaleka / stealing-part-lm-supplementary
Some code for "Stealing Part of a Production Language Model"
☆22Updated last year
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
ruiqi-zhong / nlparam
Augmenting Statistical Models with Natural Language Parameters
☆29Updated last year
deeplearning-wisc / args
☆46Updated last year
mbzuai-nlp / finchain
A symbolic benchmark for verifiable chain-of-thought financial reasoning. Includes executable templates, 58 topics across 12 domains, and…
☆20Updated 3 weeks ago
shuoli90 / Rank-Calibration
This is the repo for constructing a comprehensive and rigorous evaluation framework for LLM calibration.
☆13Updated last year
OSU-NLP-Group / AgentSafety
☆128Updated 2 weeks ago
Open-Social-World / autolibra
AutoLibra: Metric Induction for Agents from Open-Ended Human Feedback
☆16Updated last month
wangyu-ustc / LVChat
The official implementation of the paper **LVChat: Facilitating Long Video Comprehension**
☆14Updated last year
lawraa / LLM-Discussion
☆20Updated 2 weeks ago
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆84Updated last year
tatsu-lab / linguistic_calibration
Align your LM to express calibrated verbal statements of confidence in its long-form generations.
☆27Updated last year
PKU-Alignment / llms-resist-alignment
[ACL2025 Best Paper] Language Models Resist Alignment
☆36Updated 5 months ago
OpenBMB / CPO
☆23Updated last year
Junjie-Ye / RoTBench
[EMNLP 2024] RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
☆14Updated 6 months ago
OSU-NLP-Group / AgentAttack
☆22Updated last year
abhishekpanigrahi1996 / Skill-Localization-by-grafting
☆51Updated last year
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆86Updated last year
swj0419 / muse_bench
☆29Updated 8 months ago
TrustGen / TrustEval-toolkit
Toolkit for evaluating the trustworthiness of generative foundation models.
☆123Updated 2 months ago
yfqiu-nlp / sea-llm
Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"
☆28Updated 10 months ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
zhxieml / remiss-jailbreak
☆33Updated last year
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆163Updated 6 months ago
princeton-nlp / unintentional-unalignment
[ICLR 2025] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
☆31Updated 9 months ago
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆89Updated last year
qiancheng0 / ModelingAgent
☆18Updated 2 months ago
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆38Updated 2 months ago
uw-nsl / safechain
[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
☆25Updated 7 months ago