lmmlzn/Awesome-LLMs-Datasets

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lmmlzn/Awesome-LLMs-Datasets)

lmmlzn / Awesome-LLMs-Datasets

Summarize existing representative LLMs text datasets.

☆1,478

Alternatives and similar repositories for Awesome-LLMs-Datasets

Users that are interested in Awesome-LLMs-Datasets are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Zjh-819 / LLMDataHub
View on GitHub
A quick guide (especially) for trending instruction finetuning datasets
☆3,408Nov 28, 2023Updated 2 years ago
open-compass / opencompass
View on GitHub
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …
☆7,241Updated this week
mlabonne / llm-datasets
View on GitHub
Curated list of datasets and tools for post-training.
☆4,712Apr 29, 2026Updated 3 months ago
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,855Jul 14, 2026Updated 2 weeks ago
magesh-technovator / awesome-ai-applications
View on GitHub
A Comprehensive survey on business use cases of AI that help them thrive in the digital economy
☆13Oct 7, 2020Updated 5 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
hiyouga / LlamaFactory
View on GitHub
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
☆73,582Updated this week
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,443Jul 13, 2026Updated 2 weeks ago
Tongji-KGLLM / RAG-Survey
View on GitHub
☆2,140May 8, 2024Updated 2 years ago
BradyFU / Awesome-Multimodal-Large-Language-Models
View on GitHub
Latest Advances on Multimodal Large Language Models
☆17,958Updated this week
yangjianxin1 / Firefly
View on GitHub
Firefly: 大模型训练工具，支持训练Qwen2.5、Qwen2、Yi1.5、Phi-3、Llama3、Gemma、MiniCPM、Yi、Deepseek、Orion、Xverse、Mixtral-8x7B、Zephyr、Mistral、Baichuan2、Llma2、…
☆6,649Oct 24, 2024Updated last year
onejune2018 / Awesome-LLM-Eval
View on GitHub
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs…
☆654Nov 24, 2025Updated 8 months ago
RUCAIBox / LLMSurvey
View on GitHub
The official GitHub page for the survey paper "A Survey of Large Language Models".
☆12,196Mar 11, 2025Updated last year
hijkzzz / Awesome-LLM-Strawberry
View on GitHub
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 🍓 and reasoning techniques.
☆6,893Dec 17, 2025Updated 7 months ago
PhoebusSi / Alpaca-CoT
View on GitHub
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tunin…
☆2,791Dec 12, 2023Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
esbatmop / MNBVC
View on GitHub
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…
☆4,244Jul 13, 2026Updated 2 weeks ago
AiHubCN / Awesome-Chinese-LLM
View on GitHub
整理开源的中文大语言模型，以规模较小、可私有化部署、训练成本较低的模型为主，包括底座模型，垂直领域微调及应用，数据集与教程等。
☆22,701May 10, 2026Updated 2 months ago
ZigeW / data_management_LLM
View on GitHub
Collection of training data management explorations for large language models
☆343Aug 2, 2024Updated last year
FlagOpen / FlagEmbedding
View on GitHub
Retrieval and Retrieval-augmented LLMs
☆11,990Apr 22, 2026Updated 3 months ago
multimodal-art-projection / MAP-NEO
View on GitHub
☆986Feb 7, 2025Updated last year
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆22,699Updated this week
yaodongC / awesome-instruction-dataset
View on GitHub
A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
☆1,153Jan 4, 2024Updated 2 years ago
wasiahmad / Awesome-LLM-Synthetic-Data
View on GitHub
A reading list on LLM based Synthetic Data Generation 🔥
☆1,547Jun 5, 2025Updated last year
LianjiaTech / BELLE
View on GitHub
BELLE: Be Everyone's Large Language model Engine（开源中文对话大模型）
☆8,279Oct 16, 2024Updated last year
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
WooooDyy / LLM-Agent-Paper-List
View on GitHub
The paper list of the 86-page SCIS cover paper "The Rise and Potential of Large Language Model Based Agents: A Survey" by Zhiheng Xi et a…
☆8,169Sep 12, 2025Updated 10 months ago
InternLM / InternLM
View on GitHub
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
☆7,249Oct 30, 2025Updated 8 months ago
yizhongw / self-instruct
View on GitHub
Aligning pretrained language models with instruction data generated by themselves.
☆4,607Mar 27, 2023Updated 3 years ago
THUDM / LongBench
View on GitHub
LongBench v2 and LongBench (ACL 25'&24')
☆1,215Jan 15, 2025Updated last year
datajuicer / data-juicer
View on GitHub
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
☆6,795Updated this week
hymie122 / RAG-Survey
View on GitHub
Collecting awesome papers of RAG for AIGC. We propose a taxonomy of RAG foundations, enhancements, and applications in paper "Retrieval-…
☆1,788Aug 20, 2024Updated last year
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆17,247Updated this week
InternLM / xtuner
View on GitHub
A Next-Generation Training Engine Built for Ultra-Large MoE Models
☆5,166Updated this week
Hannibal046 / Awesome-LLM
View on GitHub
Awesome-LLM: a curated list of Large Language Model
☆27,203Jul 31, 2025Updated 11 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
modelscope / ms-swift
View on GitHub
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-V4, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL…
☆14,974Updated this week
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,237Jul 22, 2026Updated last week
huggingface / open-r1
View on GitHub
Fully open reproduction of DeepSeek-R1
☆26,410Apr 2, 2026Updated 3 months ago
allenai / OLMo
View on GitHub
Modeling, training, eval, and inference code for OLMo
☆6,609Nov 24, 2025Updated 8 months ago
Instruction-Tuning-with-GPT-4 / GPT-4-LLM
View on GitHub
Instruction Tuning with GPT-4
☆4,333Jun 11, 2023Updated 3 years ago
huggingface / peft
View on GitHub
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
☆21,460Updated this week
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,651May 26, 2026Updated 2 months ago