chziakas / redevalLinks
A library for red-teaming LLM applications with LLMs.
☆28Updated last year
Alternatives and similar repositories for redeval
Users that are interested in redeval are comparing it to the libraries listed below
Sorting:
- Red-Teaming Language Models with DSPy☆235Updated 8 months ago
- A prompt injection game to collect data for robust ML research☆65Updated 9 months ago
- A subset of jailbreaks automatically discovered by the Haize Labs haizing suite.☆98Updated 6 months ago
- Code for the paper "Fishing for Magikarp"☆173Updated 5 months ago
- ☆165Updated 4 months ago
- ☆35Updated 11 months ago
- ☆19Updated last year
- A repository of Language Model Vulnerabilities and Exposures (LVEs).☆112Updated last year
- Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"☆49Updated last year
- ☆94Updated 11 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆65Updated last year
- ☆29Updated 5 months ago
- Code for reproducing our paper "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing"☆31Updated 7 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆109Updated 10 months ago
- Open Source Replication of Anthropic's Alignment Faking Paper☆50Updated 7 months ago
- ☆35Updated 5 months ago
- Papers about red teaming LLMs and Multimodal models.☆152Updated 5 months ago
- Sphynx Hallucination Induction☆53Updated 9 months ago
- The fastest Trust Layer for AI Agents☆144Updated 5 months ago
- Open Implementations of LLM Analyses☆107Updated last year
- ☆26Updated last year
- Efficient Dictionary Learning with Switch Sparse Autoencoders (SAEs)☆25Updated 11 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆34Updated 6 months ago
- Curation of prompts that are known to be adversarial to large language models☆184Updated 2 years ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆92Updated 11 months ago
- Code to break Llama Guard☆32Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆69Updated last year
- Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique☆18Updated last year
- Codebase accompanying the Summary of a Haystack paper.☆79Updated last year
- ☆55Updated last year