princeton-nlp / intercodeLinks

[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898

☆223

Alternatives and similar repositories for intercode

Users that are interested in intercode are comparing it to the libraries listed below

Sorting:

Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆169Updated 11 months ago
Zyq-scut / RLTF
Accepted by Transactions on Machine Learning Research (TMLR)
☆130Updated 10 months ago
sambanova / toolbench
ToolBench, an evaluation suite for LLM tool manipulation capabilities.
☆157Updated last year
qishenghu / InstructCoder
InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw
☆62Updated 10 months ago
niansong1996 / lever
Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)
☆89Updated 2 years ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆232Updated 2 months ago
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆151Updated 9 months ago
reasoning-machines / prompt-lib
A set of utilities for running few-shot prompting experiments on large-language models
☆122Updated last year
zorazrw / awesome-tool-llm
☆237Updated 11 months ago
lukasberglund / reversal_curse
☆291Updated last year
princeton-nlp / USACO
Can Language Models Solve Olympiad Programming?
☆119Updated 6 months ago
r2e-project / r2e
r2e: turn any github repository into a programming agent environment
☆129Updated 3 months ago
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆152Updated last year
aorwall / SWE-bench-docker
☆100Updated last year
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆235Updated 3 months ago
hkust-nlp / AgentBoard
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
☆332Updated last year
princeton-nlp / WebShop
[NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
☆379Updated 10 months ago
kohjingyu / search-agents
Code for the paper 🌳 Tree Search for Language Model Agents
☆208Updated last year
R2E-Gym / R2E-Gym
Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
☆136Updated 3 weeks ago
xlang-ai / DS-1000
[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".
☆251Updated 9 months ago
bhargaviparanjape / language-programmes
☆172Updated 2 years ago
SwiftSage / SwiftSage
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks
☆311Updated 9 months ago
posgnu / rci-agent
A codebase for "Language Models can Solve Computer Tasks"
☆234Updated last year
SWE-bench / SWE-smith
Scaling Data for SWE-agents
☆328Updated this week
evalplus / repoqa
RepoQA: Evaluating Long-Context Code Understanding
☆113Updated 9 months ago
anthropics / ConstitutionalHarmlessnessPaper
☆239Updated 2 years ago
openai / safety-rbr-code-and-data
Code and example data for the paper: Rule Based Rewards for Language Model Safety
☆190Updated last year
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆153Updated last year
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆207Updated last month
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆56Updated 9 months ago