ShaoShuai0605 / MisevolutionView external linksLinks
Official Repo of Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
☆58Oct 28, 2025Updated 3 months ago
Alternatives and similar repositories for Misevolution
Users that are interested in Misevolution are comparing it to the libraries listed below
Sorting:
- Project of ACL 2025 "UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models"☆15Mar 25, 2025Updated 10 months ago
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 7 months ago
- [NDSS'25] The official implementation of safety misalignment.☆17Jan 8, 2025Updated last year
- ☆23Oct 30, 2025Updated 3 months ago
- ☆19Jun 21, 2025Updated 7 months ago
- ☆44Oct 1, 2024Updated last year
- ☆24Dec 8, 2024Updated last year
- Code for the paper "AsFT: Anchoring Safety During LLM Fune-Tuning Within Narrow Safety Basin".☆35Jul 10, 2025Updated 7 months ago
- ☆30May 22, 2024Updated last year
- [AAAI 2026] Data and Code for Paper IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks☆40Nov 24, 2025Updated 2 months ago
- [COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free☆52Apr 6, 2025Updated 10 months ago
- CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics☆27Nov 1, 2025Updated 3 months ago
- The Oyster series is a set of safety models developed in-house by Alibaba-AAIG, devoted to building a responsible AI ecosystem. | Oyster …☆59Sep 11, 2025Updated 5 months ago
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆59Sep 5, 2025Updated 5 months ago
- [USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns☆13Mar 1, 2025Updated 11 months ago
- A framework for steering MoE models by detecting and controlling behavior-linked experts.☆29Sep 12, 2025Updated 5 months ago
- [ICLR 2024] This is the official implementation for the paper: "Beyond imitation: Leveraging fine-grained quality signals for alignment"☆10May 5, 2024Updated last year
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- [KDD'23] This is the code repo for our KDD'23 paper "DyGen: Learning from Noisy Labels via Dynamics-Enhanced Generative Modeling".☆11Jun 14, 2023Updated 2 years ago
- ☆11Oct 25, 2024Updated last year
- Identification of the Adversary from a Single Adversarial Example (ICML 2023)☆10Jul 15, 2024Updated last year
- ☆17May 3, 2025Updated 9 months ago
- [NeurIPS 2025@FoRLM] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search☆17Jan 24, 2026Updated 3 weeks ago
- [WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews☆17Dec 14, 2025Updated 2 months ago
- To mitigate position bias in LLMs, especially in long-context scenarios, we scale only one dimension of LLMs, reducing position bias and …☆11Jun 18, 2024Updated last year
- Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"☆14Mar 28, 2024Updated last year
- 一个学院网络规划与设计方案(同济大学20计科计算机网络课程设计)☆13Sep 13, 2023Updated 2 years ago
- [AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilitie…☆10Feb 7, 2026Updated last week
- ICML2025: One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework☆14Jun 24, 2025Updated 7 months ago
- ☆14Feb 26, 2025Updated 11 months ago
- Data Science Challenge☆11May 14, 2021Updated 4 years ago
- The officalimplement of dLLM-Factory☆25Jul 12, 2025Updated 7 months ago
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆49Jan 15, 2026Updated 3 weeks ago
- Representation Surgery for Multi-Task Model Merging. ICML, 2024.☆47Oct 10, 2024Updated last year
- ☆16Nov 18, 2024Updated last year
- ☆16Feb 17, 2025Updated 11 months ago
- Code for paper "Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators"☆16Dec 4, 2024Updated last year
- code of paper "Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM"☆14Nov 17, 2023Updated 2 years ago
- Official repository for WWW'24 paper "MemeCraft: Contextual and Stance-Driven Multimodal Meme Generation"☆12Jul 25, 2024Updated last year