OpenCoder-llm/opc_data_filtering

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/OpenCoder-llm/opc_data_filtering)

OpenCoder-llm / opc_data_filtering

Heuristic filtering framework for RefineCode

☆82

Alternatives and similar repositories for opc_data_filtering

Users that are interested in opc_data_filtering are comparing it to the libraries listed below

Sorting:

OpenCoder-llm / OpenCoder-llm
View on GitHub
The Open Cookbook for Top-Tier Code Large Language Model
☆2,043Dec 8, 2024Updated last year
richardodliu / OpenCodeEval
View on GitHub
☆50Aug 21, 2025Updated 6 months ago
zhenyuhe00 / SWE-Swiss
View on GitHub
SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution
☆104Sep 24, 2025Updated 5 months ago
dunzeng / MORE
View on GitHub
Code for EMNLP'24 paper - On Diversified Preferences of Large Language Model Alignment
☆16Aug 6, 2024Updated last year
AISE-TUDelft / Code4MeEvaluation
View on GitHub
Language Models for Code Completion: a Practical Evaluation
☆13Jan 19, 2024Updated 2 years ago
yyDing1 / ScaleQuest
View on GitHub
[ACL 2025] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLM…
☆68Oct 27, 2024Updated last year
microsoft / RedStone
View on GitHub
The RedStone repository includes code for preparing extensive datasets used in training large language models.
☆156Jan 22, 2026Updated last month
alon-albalak / online-data-mixing
View on GitHub
An implementation of online data mixing for the Pile dataset, based on the GPT-NeoX library.
☆13Jan 9, 2024Updated 2 years ago
UCSB-NLP-Chang / Prereq_tune
View on GitHub
Implementation for the paper "Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning"
☆11Jan 10, 2025Updated last year
xszheng2020 / memorization
View on GitHub
An Empirical Study of Memorization in NLP (ACL 2022)
☆13Jun 22, 2022Updated 3 years ago
step-law / steplaw
View on GitHub
☆211Oct 27, 2025Updated 4 months ago
angie-chen55 / pref-learning-ranking-acc
View on GitHub
☆13Jun 4, 2024Updated last year
RUC-GSAI / Llama-3-SynE
View on GitHub
Llama-3-SynE: A Significantly Enhanced Version of Llama-3 with Advanced Scientific Reasoning and Chinese Language Capabilities | 继续预训练提升 …
☆37May 31, 2025Updated 9 months ago
CodeCreator / WebOrganizer
View on GitHub
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆78May 2, 2025Updated 10 months ago
xingyaoww / mint-bench
View on GitHub
Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…
☆133Jun 4, 2024Updated last year
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆13Aug 8, 2025Updated 6 months ago
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
rioyokotalab / swallow-code-math
View on GitHub
Ongoing research project for code&math LLMs
☆27Jul 4, 2025Updated 7 months ago
banksy23 / XCoder
View on GitHub
☆36Jul 7, 2025Updated 7 months ago
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated 9 months ago
keezen / ntk_alibi
View on GitHub
NTK scaled version of ALiBi position encoding in Transformer.
☆69Aug 16, 2023Updated 2 years ago
yegcjs / mixinglaws
View on GitHub
☆109Jul 15, 2025Updated 7 months ago
OpenBMB / infllmv2_cuda_impl
View on GitHub
☆94Feb 11, 2026Updated 2 weeks ago
bigcode-project / octopack
View on GitHub
🐙 OctoPack: Instruction Tuning Code Large Language Models
☆478Feb 5, 2025Updated last year
liujch1998 / memo-trap
View on GitHub
☆22Jan 25, 2023Updated 3 years ago
a-m-team / a-m-models
View on GitHub
a-m-team's exploration in large language modeling
☆194May 29, 2025Updated 9 months ago
yiqingxyq / RepoST
View on GitHub
Code for "[COLM'25] RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"
☆23Mar 18, 2025Updated 11 months ago
ganler / code-r1
View on GitHub
Reproducing R1 for Code with Reliable Rewards
☆290May 5, 2025Updated 9 months ago
sail-sg / dice
View on GitHub
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
☆47Apr 15, 2025Updated 10 months ago
bytedance / SandboxFusion
View on GitHub
☆932Dec 11, 2025Updated 2 months ago
huggingface / cosmopedia
View on GitHub
☆565Nov 20, 2024Updated last year
Yanlewen / TradeTrap
View on GitHub
🧨 TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?
☆66Nov 27, 2025Updated 3 months ago
bigcode-project / bigcode-dataset
View on GitHub
☆489Aug 15, 2024Updated last year
ArtificialZeng / ChatGLM-Efficient-Tuning-Explained
View on GitHub
☆23Jul 17, 2023Updated 2 years ago
INK-USC / ReCross
View on GitHub
ReCross: Unsupervised Cross-Task Generalization via Retrieval Augmentation
☆24May 1, 2022Updated 3 years ago
HKUNLP / critic-rl
View on GitHub
[ICML 2025] Teaching Language Models to Critique via Reinforcement Learning
☆123May 6, 2025Updated 9 months ago
shizhediao / Post-Training-Data-Flywheel
View on GitHub
We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.
☆61Oct 3, 2024Updated last year
SIMONLQY / RethinkMCTS
View on GitHub
☆31Oct 2, 2024Updated last year
tml-epfl / icl-alignment
View on GitHub
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆32Jan 23, 2025Updated last year