felixbinder / introspection_self_prediction
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆12Updated 3 months ago
Alternatives and similar repositories for introspection_self_prediction:
Users that are interested in introspection_self_prediction are comparing it to the libraries listed below
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Updated last year
- ☆28Updated last year
- ☆17Updated last year
- The repository contains code for Adaptive Data Optimization☆20Updated 3 months ago
- Lottery Ticket Adaptation☆39Updated 4 months ago
- We introduce EMMET and unify model editing with popular algorithms ROME and MEMIT.☆16Updated 3 months ago
- Efficient Scaling laws and collaborative pretraining.☆16Updated 2 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆62Updated last week
- ☆31Updated 2 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆29Updated 2 months ago
- ☆22Updated 6 months ago
- ☆38Updated last year
- ☆14Updated 4 months ago
- ☆21Updated 5 months ago
- ☆34Updated last year
- Measuring the situational awareness of language models☆34Updated last year
- Aioli: A unified optimization framework for language model data mixing☆22Updated 2 months ago
- Code for "Merging Text Transformers from Different Initializations"☆19Updated 2 months ago
- Python package for generating datasets to evaluate reasoning and retrieval of large language models☆16Updated this week
- DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails☆17Updated last month
- ☆27Updated last year
- [ICLR 2025] "Training LMs on Synthetic Edit Sequences Improves Code Synthesis" (Piterbarg, Pinto, Fergus)☆17Updated last month
- Knowledge Unlearning for Large Language Models☆22Updated 3 weeks ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆26Updated 10 months ago
- https://footprints.baulab.info☆17Updated 5 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆67Updated last year
- ☆18Updated 8 months ago
- Tasks for describing differences between text distributions.☆16Updated 7 months ago
- ☆35Updated 2 years ago
- Minimum Description Length probing for neural network representations☆19Updated 2 months ago