JHU-CLSP / turking-bench
Web-grounded natural language instructions
☆13Updated last month
Related projects ⓘ
Alternatives and complementary repositories for turking-bench
- Is In-Context Learning Sufficient for Instruction Following in LLMs?☆25Updated 5 months ago
- ☆19Updated last year
- Supporting code for ReCEval paper☆26Updated 2 months ago
- Analyzing LLM Alignment via Token distribution shift☆13Updated 9 months ago
- [EMNLP-2022 Findings] Code for paper “ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback”.☆24Updated last year
- Tasks for describing differences between text distributions.☆16Updated 3 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆49Updated 8 months ago
- The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism☆25Updated 4 months ago
- Code for our paper: "GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models"☆51Updated last year
- Code and data for paper "Context-faithful Prompting for Large Language Models".☆39Updated last year
- [ICML'24] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks☆22Updated 2 months ago
- The official repository for the paper "From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning".☆62Updated last year
- ☆40Updated 11 months ago
- Evaluate the Quality of Critique☆35Updated 5 months ago
- This repository contains data, code and models for contextual noncompliance.☆18Updated 4 months ago
- Directional Preference Alignment☆50Updated last month
- Repository for paper Tools Are Instrumental for Language Agents in Complex Environments☆32Updated last month
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆44Updated 10 months ago
- ☆46Updated 10 months ago
- Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"☆47Updated last month
- ☆29Updated 7 months ago
- [ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning☆20Updated 8 months ago
- ☆54Updated 2 months ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generation☆44Updated 11 months ago
- MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following☆13Updated 3 weeks ago
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆23Updated last year
- code for "Natural Language to Code Translation with Execution"☆39Updated 2 years ago
- ☆44Updated 2 months ago
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆19Updated 2 weeks ago
- Visual and Embodied Concepts evaluation benchmark☆21Updated last year