tmylla / REEF

The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source LLMs.

☆26

Alternatives and similar repositories for REEF:

Users that are interested in REEF are comparing it to the libraries listed below

ChnQ / TracingLLM
☆24Updated 6 months ago
AI4Good24 / PsySafe
☆34Updated 2 weeks ago
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆44Updated 5 months ago
thu-coai / SafeUnlearning
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆22Updated 5 months ago
UCSC-VLAA / vllm-safety-benchmark
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
☆72Updated last year
ChenWu98 / agent-attack
[Arxiv 2024] Adversarial attacks on multimodal agents
☆44Updated 5 months ago
princeton-nlp / benign-data-breaks-safety
☆23Updated 2 months ago
boyiwei / alignment-attribution-code
Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆62Updated 2 months ago
ShuoTang123 / MATRIX
Implementation of the MATRIX framework (ICML 2024)
☆42Updated 7 months ago
sail-sg / Attention-Sink
[ATTRIB @ NeurIPS 2024 Oral] When Attention Sink Emerges in Language Models: An Empirical View
☆36Updated 2 months ago
renqibing / ActorAttack
☆57Updated last month
pipilurj / MLLM-protector
The official repository for paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance"
☆31Updated 7 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆85Updated 6 months ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆48Updated 8 months ago
Dongping-Chen / MLLM-Judge
[ICML 2024 Oral] Official code repository for MLLM-as-a-Judge.
☆58Updated 3 weeks ago
renqibing / CodeAttack
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆30Updated last month
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆87Updated 3 weeks ago
yjw1029 / Self-Reminder
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆44Updated last year
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆48Updated 2 months ago
OpenKG-ORG / EasyDetect
An Easy-to-use Hallucination Detection Framework for LLMs.
☆48Updated 7 months ago
sail-sg / Cheating-LLM-Benchmarks
[SafeGenAi @ NeurIPS 2024] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
☆66Updated last month
Vaidehi99 / InfoDeletionAttacks
☆39Updated last year
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆110Updated 2 weeks ago
OSU-NLP-Group / AgentAttack
☆17Updated last month
sail-sg / MMCBench
☆27Updated 10 months ago
LZY-the-boys / Twin-Merging
[NeurIPS2024] Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging
☆43Updated 2 weeks ago
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆52Updated 4 months ago
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆66Updated 9 months ago
tatsu-lab / test_set_contamination
☆34Updated last year
hxhcreate / VLSBench
Data and Code for Paper VLSBench: Unveiling Visual Leakage in Multimodal Safety
☆25Updated last week