FanZT6 / FairMT-benchLinks
☆14Updated 11 months ago
Alternatives and similar repositories for FairMT-bench
Users that are interested in FairMT-bench are comparing it to the libraries listed below
Sorting:
- A curated list of resources for activation engineering☆123Updated 4 months ago
- [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the trustworthiness of generative foundation models.☆125Updated 5 months ago
- This repository contains a regularly updated paper list for LLMs-reasoning-in-latent-space.☆276Updated last week
- AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025 (Outstanding Paper)☆415Updated 3 months ago
- ☆174Updated 3 months ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆81Updated last week
- awesome SAE papers☆71Updated 8 months ago
- Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures☆30Updated 2 weeks ago
- [ICLR 2025] Released code for paper "Spurious Forgetting in Continual Learning of Language Models"☆59Updated 9 months ago
- LLM Unlearning☆181Updated 2 years ago
- ☆56Updated last year
- A curated list of personalized alignment resources (continually updated).☆57Updated 3 months ago
- ☆186Updated 3 weeks ago
- [ACL'25 Main] SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence! | 让你的LLM更好地利用上下文文档:一个基于注意力的简单方案☆24Updated 11 months ago
- A survey on harmful fine-tuning attack for large language model☆232Updated last month
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆173Updated 9 months ago
- [ICLR 2025] Code and Data Repo for Paper "Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation"☆93Updated last year
- ☆28Updated last month
- A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository agg…☆183Updated 3 months ago
- A versatile toolkit for applying Logit Lens to modern large language models (LLMs). Currently supports Llama-3.1-8B and Qwen-2.5-7B, enab…☆155Updated 5 months ago
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆59Updated 5 months ago
- A Diagnostic Guardrail Framework for AI Agent Safety and Security☆340Updated this week
- Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning …☆87Updated 5 months ago
- Awesome-Long2short-on-LRMs is a collection of state-of-the-art, novel, exciting long2short methods on large reasoning models. It contains…☆258Updated 6 months ago
- This is the official GitHub repository for our survey paper "Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language …☆171Updated 8 months ago
- ☆101Updated 7 months ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆89Updated 10 months ago
- Chain of Thoughts (CoT) is so hot! so long! We need short reasoning process!☆72Updated 10 months ago
- ☆64Updated 8 months ago
- [ICLR 2026] The implementation of paper "AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint"☆37Updated 2 months ago