abacaj / code-eval
Run evaluation on LLMs using human-eval benchmark
β406Updated last year
Alternatives and similar repositories for code-eval:
Users that are interested in code-eval are comparing it to the libraries listed below
- Open Source WizardCoder Datasetβ157Updated last year
- π OctoPack: Instruction Tuning Code Large Language Modelsβ462Updated 2 months ago
- [ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".β240Updated 5 months ago
- β268Updated last year
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.β544Updated last year
- β¨ RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems - ICLR 2024β156Updated 8 months ago
- NexusRaven-13B, a new SOTA Open-Source LLM for function calling. This repo contains everything for reproducing our evaluation on NexusRavβ¦β315Updated last year
- Official repository for LongChat and LongEvalβ518Updated 10 months ago
- [NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generationβ304Updated last month
- A framework for the evaluation of autoregressive code generation language models.β930Updated 5 months ago
- CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)β136Updated 8 months ago
- β308Updated 10 months ago
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Modelβ517Updated 2 months ago
- Fine-tune SantaCoder for Code/Text Generation.β191Updated 2 years ago
- Code and data for "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" (ICLR 2024)β368Updated 7 months ago
- Generative Judge for Evaluating Alignmentβ235Updated last year
- A bagel, with everything.β318Updated last year
- A library for easily merging multiple LLM experts, and efficiently train the merged LLM.β470Updated 7 months ago
- [ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmarkβ375Updated 9 months ago
- β84Updated last year
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898β211Updated 11 months ago
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]β546Updated 4 months ago
- [COLM 2024] LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Compositionβ626Updated 8 months ago
- β314Updated 7 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Contextβ458Updated last year
- Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.β368Updated last year
- Codes for the paper "βBench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718β317Updated 6 months ago
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"β463Updated last year
- [ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuningβ650Updated 10 months ago
- β526Updated 7 months ago