microsoft / ValueCompassLinks
☆25Updated 10 months ago
Alternatives and similar repositories for ValueCompass
Users that are interested in ValueCompass are comparing it to the libraries listed below
Sorting:
- ☆51Updated last year
- 【ACL 2024】 SALAD benchmark & MD-Judge☆158Updated 5 months ago
- ☆10Updated 10 months ago
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆106Updated last year
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆70Updated this week
- ☆28Updated 2 months ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆147Updated 4 months ago
- This repository contains a regularly updated paper list for LLMs-reasoning-in-latent-space.☆154Updated this week
- ☆51Updated 2 months ago
- AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)☆314Updated last month
- The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source…☆62Updated 7 months ago
- ☆135Updated 6 months ago
- Awesome-Long2short-on-LRMs is a collection of state-of-the-art, novel, exciting long2short methods on large reasoning models. It contains…☆244Updated 3 weeks ago
- Accepted by ECCV 2024☆149Updated 10 months ago
- ☆157Updated 11 months ago
- BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).☆155Updated last year
- ☆101Updated 7 months ago
- A survey on harmful fine-tuning attack for large language model☆205Updated this week
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆95Updated 3 months ago
- The awesome agents in the era of large language models☆68Updated last year
- This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"☆43Updated 9 months ago
- LLM hallucination paper list☆322Updated last year
- [ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Models☆63Updated 5 months ago
- awesome papers in LLM interpretability☆545Updated 2 weeks ago
- ☆106Updated 4 months ago
- awesome SAE papers☆44Updated 3 months ago
- Toolkit for evaluating the trustworthiness of generative foundation models.☆114Updated 2 weeks ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆83Updated 5 months ago
- ☆34Updated 11 months ago
- Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"☆69Updated 6 months ago