SPIN-UMass / Stealing-the-Decoding-Algorithms-of-Language-ModelsLinks

☆8

Alternatives and similar repositories for Stealing-the-Decoding-Algorithms-of-Language-Models

Users that are interested in Stealing-the-Decoding-Algorithms-of-Language-Models are comparing it to the libraries listed below

Sorting:

DYR1 / MoGU
Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.
☆15Updated 4 months ago
McGill-NLP / AdversarialTriggers
Code for "Universal Adversarial Triggers Are Not Universal."
☆17Updated last year
weizeming / momentum-attack-llm
☆21Updated 4 months ago
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆84Updated last year
David-Li0406 / AI-Supervision-Risk
☆20Updated 2 months ago
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆61Updated 4 months ago
Princeton-SysML / kNNLM_privacy
Official implementation of Privacy Implications of Retrieval-Based Language Models (EMNLP 2023). https://arxiv.org/abs/2305.14888
☆35Updated 11 months ago
princeton-nlp / benign-data-breaks-safety
☆41Updated 8 months ago
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆54Updated last year
eric-mitchell / serac
Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model
☆68Updated 2 years ago
ThuCCSLab / MergeGuard
[CCS-LAMPS'24] LLM IP Protection Against Model Merging
☆15Updated 7 months ago
swj0419 / muse_bench
☆21Updated 2 months ago
zhaoyiran924 / Probe-Sampling
[NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling
☆28Updated 6 months ago
declare-lab / resta
Restore safety in fine-tuned language models through task arithmetic
☆28Updated last year
mireshghallah / neighborhood-curvature-mia
☆21Updated last year
zjysteven / mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training data of LLMs
☆37Updated last week
Twilight92z / Quantize-Watermark
☆20Updated last year
Vaidehi99 / InfoDeletionAttacks
☆44Updated 3 months ago
deeplearning-wisc / picle
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆24Updated 11 months ago
agiresearch / TrustAgent
TrustAgent: Towards Safe and Trustworthy LLM-based Agents
☆41Updated 4 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆57Updated 8 months ago
JasonForJoy / Model-Editing-Hurt
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆35Updated last week
SALT-NLP / Efficient_Unlearning
☆38Updated last year
leix28 / prompt-universal-vulnerability
Implementation of the paper "Exploring the Universal Vulnerability of Prompt-based Learning Paradigm" on Findings of NAACL 2022
☆29Updated 2 years ago
decoding-comp-trust / comp-trust
Codebase for decoding compressed trust.
☆23Updated last year
locuslab / acr-memorization
☆34Updated 5 months ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆59Updated last year
ejones313 / auditing-llms
☆54Updated 2 years ago
SALT-NLP / chain-of-thought-bias
☆26Updated 8 months ago
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆63Updated last year