McGill-NLP / safearenaLinks
SafeArena is a benchmark for assessing the harmful capabilities of web agents
☆19Updated 6 months ago
Alternatives and similar repositories for safearena
Users that are interested in safearena are comparing it to the libraries listed below
Sorting:
- ☆57Updated 2 years ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆66Updated 11 months ago
- Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.☆150Updated last month
- [ICLR 2025] General-purpose activation steering library☆114Updated last month
- Steering Llama 2 with Contrastive Activation Addition☆191Updated last year
- Sparse probing paper full code.☆62Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆98Updated 2 years ago
- ☆164Updated 11 months ago
- ACL 2022: An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models.☆149Updated 2 months ago
- Inspecting and Editing Knowledge Representations in Language Models☆117Updated 2 years ago
- ☆179Updated last year
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Updated last year
- ☆234Updated last year
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆77Updated last year
- Repository for the Bias Benchmark for QA dataset.☆129Updated last year
- AbstainQA, ACL 2024☆28Updated last year
- ☆127Updated last year
- ☆116Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆95Updated 2 years ago
- This repository contains data, code and models for contextual noncompliance.☆24Updated last year
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆157Updated last week
- ☆98Updated last year
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆190Updated 8 months ago
- ☆189Updated 3 months ago
- ☆27Updated last year
- Function Vectors in Large Language Models (ICLR 2024)☆181Updated 6 months ago
- A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…☆394Updated 6 months ago
- ☆181Updated 11 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆117Updated last year
- ☆92Updated last year