egozverev / Should-It-Be-Executed-Or-Processed
Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.
☆52Updated 2 months ago
Alternatives and similar repositories for Should-It-Be-Executed-Or-Processed
Users that are interested in Should-It-Be-Executed-Or-Processed are comparing it to the libraries listed below
Sorting:
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆54Updated 5 months ago
- A better way of testing, inspecting, and analyzing AI Agent traces.☆35Updated last week
- Thorn in a HaizeStack test for evaluating long-context adversarial robustness.☆26Updated 9 months ago
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆31Updated last week
- ☆56Updated last week
- ☆48Updated 6 months ago
- Functional Benchmarks and the Reasoning Gap☆86Updated 7 months ago
- A framework for pitting LLMs against each other in an evolving library of games ⚔☆32Updated last month
- Small, simple agent task environments for training and evaluation☆18Updated 6 months ago
- The first dense retrieval model that can be prompted like an LM☆72Updated last week
- Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation☆29Updated 3 months ago
- Python library to use Pleias-RAG models☆46Updated 2 weeks ago
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆80Updated 7 months ago
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆53Updated 3 months ago
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆90Updated 3 months ago
- ☆20Updated 5 months ago
- ☆38Updated 2 months ago
- Evaluating LLMs with fewer examples☆153Updated last year
- ☆51Updated 6 months ago
- ☆87Updated last week
- Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment☆57Updated 8 months ago
- A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you…☆70Updated 5 months ago
- The code repository for the CURLoRA research paper. Stable LLM continual fine-tuning and catastrophic forgetting mitigation.☆44Updated 8 months ago
- Synthetic data derived by templating, few shot prompting, transformations on public domain corpora, and monte carlo tree search.☆32Updated 2 months ago
- Official repo for Learning to Reason for Long-Form Story Generation☆51Updated 3 weeks ago
- Code for RATIONALYST: Pre-training Process-Supervision for Improving Reasoning https://arxiv.org/pdf/2410.01044☆32Updated 7 months ago
- Clue inspired puzzles for testing LLM deduction abilities☆35Updated last month
- ☆46Updated this week
- RepoQA: Evaluating Long-Context Code Understanding☆108Updated 6 months ago
- The official evaluation suite and dynamic data release for MixEval.☆11Updated 7 months ago