Qurrent-AI / RES-Q
RES-Q: Evaluating the Code-Editing Capability of Large Language Model Systems at the Repository Scale
☆23Updated 2 months ago
Related projects: ⓘ
- Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.☆79Updated 2 weeks ago
- ☆91Updated last month
- Functional Benchmarks and the Reasoning Gap☆74Updated last month
- Just a bunch of benchmark logs for different LLMs☆112Updated last month
- Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions☆38Updated last month
- ☆75Updated 3 weeks ago
- Graph-based method for end-to-end code completion with context awareness on repository☆42Updated 2 weeks ago
- ☆222Updated last week
- Draw more samples☆159Updated 2 months ago
- Evaluating LLMs with CommonGen-Lite☆83Updated 6 months ago
- Steer LLM outputs towards a certain topic/subject and enhance response capabilities using activation engineering by adding steering vecto…☆192Updated 4 months ago
- Mixing Language Models with Self-Verification and Meta-Verification☆96Updated 10 months ago
- AWM: Agent Workflow Memory☆121Updated last week
- Harness used to benchmark aider against SWE Bench benchmarks☆44Updated 2 months ago
- Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents☆102Updated 3 months ago
- ☆74Updated 2 months ago
- Red-Teaming Language Models with DSPy☆116Updated 5 months ago
- ☆37Updated this week
- Enhancing AI Software Engineering with Repository-level Code Graph☆60Updated 3 weeks ago
- Aidan Bench attempts to measure <big_model_smell> in LLMs.☆64Updated this week
- Code for the paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆140Updated 3 months ago
- Manage scalable open LLM inference endpoints in Slurm clusters☆217Updated 2 months ago
- CodeSage: Code Representation Learning At Scale (ICLR 2024)☆76Updated 2 months ago
- ☆77Updated this week
- Scaling is a distributed training library and installable dependency designed to scale up neural networks, with a dedicated module for tr…☆38Updated 3 weeks ago
- RepoQA: Evaluating Long-Context Code Understanding☆96Updated this week
- ☆68Updated 2 months ago
- ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…☆218Updated 5 months ago
- ☆58Updated last week
- A new benchmark for measuring LLM's capability to detect bugs in large codebase.☆23Updated 3 months ago