yuleiqin/fantastic-data-engineering

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yuleiqin/fantastic-data-engineering)

yuleiqin / fantastic-data-engineering

Fantastic Data Engineering for Large Language Models

☆92

Alternatives and similar repositories for fantastic-data-engineering

Users that are interested in fantastic-data-engineering are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

GX-XinGao / GRA
View on GitHub
The Code and Script of "David's Slingshot: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis"
☆34Jun 13, 2025Updated last year
HypherX / Evolution-Analysis
View on GitHub
☆25Dec 13, 2024Updated last year
LHL3341 / MetaLadder
View on GitHub
MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer (EMNLP 2025)
☆12Apr 18, 2025Updated last year
Leey21 / CipherBank
View on GitHub
☆14Jun 13, 2025Updated last year
xiatingyu / SFT-DataSelection-at-scale
View on GitHub
☆34Feb 9, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Bolin97 / awesome-instruction-selector
View on GitHub
Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning
☆48Jan 22, 2026Updated 6 months ago
2003pro / TAGCOS
View on GitHub
This is the official implementation of TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
☆13Jul 21, 2024Updated 2 years ago
opendatalab / REST
View on GitHub
☆34Jul 15, 2025Updated last year
zmzhang2000 / trustworthy-alignment
View on GitHub
Official repository for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning
☆12Sep 2, 2024Updated last year
tianyi-lab / Superfiltering
View on GitHub
[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
☆189Jun 25, 2025Updated last year
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆261Apr 29, 2025Updated last year
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆600Dec 9, 2024Updated last year
ZifanL / TSDS
View on GitHub
Implementation of TSDS: Data Selection for Task-Specific Model Finetuning. An optimal-transport framework for selecting domain-specific a…
☆19Dec 25, 2024Updated last year
magpie-align / magpie
View on GitHub
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆875Mar 17, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
QizhiPei / MathFusion
View on GitHub
MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion (ACL 2025)
☆37Jul 16, 2025Updated last year
dual-view-molecule-pretraining / dmp
View on GitHub
☆11Jun 4, 2021Updated 5 years ago
luojie1024 / WeatherBot
View on GitHub
A simple Chinese weather robot (chatbot) built using Rasa technology stack (Rasa2.0, Rasa X, Slack)
☆29Dec 26, 2020Updated 5 years ago
icip-cas / SSO
View on GitHub
A scalable automated alignment method for large language models. Resources for "Aligning Large Language Models via Self-Steering Optimiza…
☆20Nov 21, 2024Updated last year
meowpass / FollowComplexInstruction
View on GitHub
Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…
☆55Jun 24, 2024Updated 2 years ago
TransfromerMeetsGraph / GNNLearner
View on GitHub
Solution of KDD cup 2021
☆11Jun 16, 2021Updated 5 years ago
Leey21 / Awesome-Long-CoT-Data
View on GitHub
Awesome Long-CoT Data
☆22Mar 26, 2025Updated last year
yale-nlp / refdpo
View on GitHub
☆16Jul 23, 2024Updated 2 years ago
wangyuxinwhy / lmclient
View on GitHub
Python client designed specifically for large-scale requests to the openai interface
☆23Feb 29, 2024Updated 2 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
pkulcwmzx / knowledge-boundary
View on GitHub
[ACL 2024] Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation
☆10May 26, 2024Updated 2 years ago
Leey21 / A-Data-Centric-Study
View on GitHub
☆18Mar 2, 2026Updated 4 months ago
tigerchen52 / awesome_role_of_small_models
View on GitHub
a curated list of the role of small models in the LLM era
☆111Sep 23, 2024Updated last year
GPT-Alternatives / gpt_alternatives
View on GitHub
☆75Oct 21, 2023Updated 2 years ago
icip-cas / awesome-auto-alignment
View on GitHub
Collection of papers for scalable automated alignment.
☆92Oct 22, 2024Updated last year
OFA-Sys / InsTag
View on GitHub
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆288Aug 20, 2023Updated 2 years ago
maitrix-org / dynamic-alignment-optimization
View on GitHub
[EMNLP'24 (Main)] DRPO(Dynamic Rewarding with Prompt Optimization) is a tuning-free approach for self-alignment. DRPO leverages a search-…
☆24Nov 17, 2024Updated last year
Rayrtfr / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆17Jul 29, 2023Updated 3 years ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆532Oct 20, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
savadikarc / wegeft
View on GitHub
WeGeFT: Weight‑Generative Fine‑Tuning for Multi‑Faceted Efficient Adaptation of Large Models
☆23Jul 10, 2025Updated last year
RUCAIBox / Slow_Thinking_with_LLMs
View on GitHub
A series of technical report on Slow Thinking with LLM
☆767Aug 13, 2025Updated 11 months ago
GAIR-NLP / OlympicArena
View on GitHub
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆106Mar 6, 2025Updated last year
instance-wise-ordered-transformer / IOT
View on GitHub
☆20Feb 26, 2021Updated 5 years ago
yangliuy / Intent-Aware-Ranking-Transformers
View on GitHub
Code on IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems (WWW 2020)
☆11Apr 18, 2021Updated 5 years ago
shizhediao / Post-Training-Data-Flywheel
View on GitHub
We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.
☆66Oct 3, 2024Updated last year
HCY123902 / atg-w-fg-rw
View on GitHub
☆10May 27, 2024Updated 2 years ago