Flossiee / HonestyLLMLinks

[NeurIPS 2024] HonestLLM: Toward an Honest and Helpful Large Language Model

☆29

Alternatives and similar repositories for HonestyLLM

Users that are interested in HonestyLLM are comparing it to the libraries listed below

Sorting:

SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆98Updated last year
Lordog / R-Judge
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)
☆92Updated 6 months ago
ChnQ / TracingLLM
☆30Updated last year
AI45Lab / REEF
The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source…
☆68Updated 10 months ago
tonychenxyz / selfie
This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…
☆54Updated 11 months ago
RobustNLP / DeRTa
A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.
☆71Updated 6 months ago
jinhaoduan / SAR
[ACL 2024] Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models
☆59Updated last year
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆166Updated 8 months ago
thu-coai / SafeUnlearning
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆32Updated last year
LivingFutureLab / DeltaBench
☆45Updated 8 months ago
pillowsofwind / Course-Correction
[EMNLP 2024] The official GitHub repo for the paper "Course-Correction: Safety Alignment Using Synthetic Preferences"
☆19Updated last year
uw-nsl / SafeDecoding
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆150Updated last year
baixianghuang / HalluEditBench
Can Knowledge Editing Really Correct Hallucinations? (ICLR 2025)
☆27Updated 3 months ago
PKU-Alignment / aligner
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆189Updated 10 months ago
TrustGen / TrustEval-toolkit
Toolkit for evaluating the trustworthiness of generative foundation models.
☆123Updated 3 months ago
zhxieml / remiss-jailbreak
☆33Updated last year
MingyuJ666 / LVLM-Safety
[FCS'24] LVLM Safety paper
☆19Updated 10 months ago
OSU-NLP-Group / AgentSafety
☆130Updated 3 weeks ago
kevinyaobytedance / llm_unlearn
LLM Unlearning
☆177Updated 2 years ago
Princeton-SysML / Jailbreak_LLM
☆188Updated 2 years ago
AI4Good24 / PsySafe
☆50Updated 9 months ago
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆84Updated last year
zepingyu0512 / neuron-attribution
code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models
☆47Updated last year
MingyuJ666 / The-Impact-of-Reasoning-Step-Length-on-Large-Language-Models
[ACL'24] Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correla…
☆46Updated 6 months ago
OSU-NLP-Group / AgentAttack
☆22Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆89Updated 6 months ago
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆159Updated 2 weeks ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆63Updated last year
ChenmienTan / malmen
☆36Updated last year
eric-ai-lab / MSSBench
[ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"
☆30Updated 5 months ago