xlang-ai / DS-1000Links

[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".

☆251

Alternatives and similar repositories for DS-1000

Users that are interested in DS-1000 are comparing it to the libraries listed below

Sorting:

Leolty / repobench
✨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024
☆169Updated 11 months ago
TIGER-AI-Lab / Program-of-Thoughts
Data and Code for Program of Thoughts [TMLR 2023]
☆280Updated last year
amazon-science / cceval
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
☆153Updated last year
facebookresearch / cruxeval
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation
☆151Updated 9 months ago
nuprl / MultiPL-E
A multi-programming language benchmark for LLMs
☆265Updated 2 weeks ago
Ber666 / ToolkenGPT
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)
☆264Updated last year
Zyq-scut / RLTF
Accepted by Transactions on Machine Learning Research (TMLR)
☆130Updated 9 months ago
thunlp / DebugBench
The repository for paper "DebugBench: "Evaluating Debugging Capability of Large Language Models".
☆79Updated last year
niansong1996 / lever
Code for paper "LEVER: Learning to Verifiy Language-to-Code Generation with Execution" (ICML'23)
☆89Updated 2 years ago
xlang-ai / Binder
[ICLR 2023] Code for the paper "Binding Language Models in Symbolic Languages"
☆320Updated last year
ntunlp / ExecEval
A distributed, extensible, secure solution for evaluating machine generated code with unit tests in multiple programming languages.
☆56Updated 9 months ago
sambanova / toolbench
ToolBench, an evaluation suite for LLM tool manipulation capabilities.
☆157Updated last year
nelson-liu / lost-in-the-middle
Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
☆354Updated last year
nickrosh / evol-teacher
Open Source WizardCoder Dataset
☆159Updated 2 years ago
open-compass / DevEval
A Comprehensive Benchmark for Software Development.
☆111Updated last year
zorazrw / awesome-tool-llm
☆237Updated 11 months ago
GAIR-NLP / auto-j
Generative Judge for Evaluating Alignment
☆244Updated last year
reddy-lab-code-research / PPOCoder
Code for the TMLR 2023 paper "PPOCoder: Execution-based Code Generation using Deep Reinforcement Learning"
☆114Updated last year
xingyaoww / mint-bench
Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…
☆128Updated last year
shuyanzhou / docprompting
Data and code for "DocPrompting: Generating Code by Retrieving the Docs" @ICLR 2023
☆248Updated last year
shrivastavadisha / repo_level_prompt_generation
☆124Updated 2 years ago
bigcode-project / octopack
🐙 OctoPack: Instruction Tuning Code Large Language Models
☆472Updated 5 months ago
princeton-nlp / intercode
[NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898
☆223Updated last year
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆309Updated 10 months ago
code-rag-bench / code-rag-bench
CodeRAG-Bench: Can Retrieval Augment Code Generation?
☆150Updated 8 months ago
ezelikman / STaR
Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)
☆206Updated 2 years ago
bigcode-project / the-stack-v2
Code for the curation of The Stack v2 and StarCoder2 training data
☆114Updated last year
xlang-ai / xlang-paper-reading
Paper collection on building and evaluating language model agents via executable language grounding
☆359Updated last year
bigcode-project / bigcode-evaluation-harness
A framework for the evaluation of autoregressive code generation language models.
☆971Updated last week
CoderEval / CoderEval
A collection of practical code generation tasks and tests in open source projects. Complementary to HumanEval by OpenAI.
☆149Updated 7 months ago