princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆194Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for intercode
- A set of utilities for running few-shot prompting experiments on large-language models☆113Updated last year
- Accepted by Transactions on Machine Learning Research (TMLR)☆119Updated last month
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆212Updated last month
- Enhancing AI Software Engineering with Repository-level Code Graph☆94Updated 2 months ago
- CRUXEval: Code Reasoning, Understanding, and Execution Evaluation☆114Updated last month
- An Analytical Evaluation Board of Multi-turn LLM Agents☆250Updated 6 months ago
- ✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024☆133Updated 3 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆138Updated 3 months ago
- Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)☆79Updated last year
- ☆81Updated 4 months ago
- [NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents☆276Updated 2 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆110Updated 3 weeks ago
- A codebase for "Language Models can Solve Computer Tasks"☆225Updated 6 months ago
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆155Updated 6 months ago
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆144Updated 8 months ago
- Chain-of-Hindsight, A Scalable RLHF Method☆220Updated last year
- ☆170Updated last year
- ☆86Updated last year
- RepoQA: Evaluating Long-Context Code Understanding☆100Updated 2 weeks ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)☆122Updated 3 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆235Updated 7 months ago
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆115Updated 7 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆52Updated last month
- A banchmark list for evaluation of large language models.☆68Updated 4 months ago
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training☆219Updated 5 months ago
- Code for Arxiv 2023: Improving Language Model Negociation with Self-Play and In-Context Learning from AI Feedback☆201Updated last year
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".☆222Updated 3 weeks ago
- An extensible benchmark for evaluating large language models on planning☆291Updated 6 months ago
- Data and code for "DocPrompting: Generating Code by Retrieving the Docs" @ICLR 2023☆231Updated 11 months ago
- This is the repo for the paper Shepherd -- A Critic for Language Model Generation☆213Updated last year