JHU-CLSP / turking-benchLinks
Web-grounded natural language instructions
☆17Updated 9 months ago
Alternatives and similar repositories for turking-bench
Users that are interested in turking-bench are comparing it to the libraries listed below
Sorting:
- Evaluate the Quality of Critique☆36Updated last year
- ☆19Updated last year
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆43Updated 8 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆31Updated 7 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners☆84Updated 4 months ago
- ☆59Updated last year
- ☆27Updated 2 years ago
- This repository contains data, code and models for contextual noncompliance.☆24Updated last year
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆37Updated 8 months ago
- Improving Language Understanding from Screenshots. Paper: https://arxiv.org/abs/2402.14073☆30Updated last year
- [ICML'24] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks☆30Updated last year
- This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"☆32Updated last year
- [ICLR 2023] Code for our paper "Selective Annotation Makes Language Models Better Few-Shot Learners"☆111Updated 2 years ago
- ☆41Updated last year
- The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism☆30Updated last year
- ☆57Updated 4 months ago
- ☆14Updated last year
- [ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning☆27Updated last year
- Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆21Updated 6 months ago
- ☆14Updated 2 months ago
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆23Updated last year
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆49Updated last year
- ☆44Updated last year
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆48Updated last year
- Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding☆27Updated last year
- [EMNLP '23] Discriminator-Guided Chain-of-Thought Reasoning☆49Updated 11 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆123Updated last year
- [ICLR2025 Spotlight] Agent Trajectory Synthesis via Guiding Replay with Web Tutorials☆41Updated 7 months ago
- The official repository for the paper "From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning".☆66Updated 2 years ago