jiangnanboy/llm_corpus_quality

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jiangnanboy/llm_corpus_quality)

jiangnanboy / llm_corpus_quality

大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning

☆80

Alternatives and similar repositories for llm_corpus_quality

Users that are interested in llm_corpus_quality are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

jiangnanboy / doc_ai
View on GitHub
这里将paddle中的ocr等模型转为onnx格式，并利用java版深度框架djl加载这些onnx模型进行推理预测尝试。
☆14Nov 15, 2022Updated 3 years ago
jiangnanboy / text_security_audit
View on GitHub
text security audit 安全审核-语义模型过滤敏感内容检测系统
☆39Feb 14, 2025Updated last year
jiangnanboy / llm_security
View on GitHub
利用分类法和敏感词检测法对生成式大模型的输入和输出内容进行安全检测，尽早识别风险内容。The input and output contents of generative large model are checked by classification method a…
☆28Sep 9, 2024Updated last year
jiangnanboy / punctuation_prediction
View on GitHub
chinese sentence punctuation prediction，中文句子标点符号预测。
☆29Oct 19, 2022Updated 3 years ago
jiangnanboy / AutoText
View on GitHub
智能文本自动处理工具（Intelligent text automatic processing tool）。AutoText的功能主要有文本纠错，图片ocr、版面检测以及表格结构识别等。The main functions of this project include …
☆27May 17, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
hiyoung123 / DuplicateRemove
View on GitHub
基于simhash的文本去重算法
☆20Jun 18, 2021Updated 5 years ago
jiangnanboy / layout_analysis4j
View on GitHub
利用java-yolov8实现版面检测（Chinese layout detection），java-yolov8 is used to detect the layout of Chinese document images
☆27May 5, 2023Updated 3 years ago
jiangnanboy / table_ocr_java
View on GitHub
TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA
☆10Jul 18, 2023Updated 3 years ago
jiangnanboy / onnx-java
View on GitHub
onnx-java，这里利用java加载onnx模型，并进行推理。
☆21May 19, 2022Updated 4 years ago
WalkerMitty / PDFparser
View on GitHub
Here is a demo for PDF parser (Including OCR, object detection tools)
☆36Oct 14, 2024Updated last year
jiangnanboy / spark_data_mining
View on GitHub
spark tutorial for big data mining。包括app流量运营分析、als推荐、smote样本采样、RFM客户价值分群、AHP层次分析客户价值得分、手机定位数据商圈挖掘、马尔可夫智能邮件预测、时序预测、关联规则、推荐电影好友等。
☆40Sep 10, 2022Updated 3 years ago
everks / dial-clean
View on GitHub
中文对话数据清洗
☆32Nov 8, 2022Updated 3 years ago
jiangnanboy / java-springboot-paddleocr-v2
View on GitHub
本项目利用JNI加载paddle-ocr的C++编译的dll库，并利用springboot进行web部署访问。This project uses JNI to load the C++ compiled dll libraries of paddle-ocr, and us…
☆37Dec 30, 2024Updated last year
liushulinle / CRASpell
View on GitHub
The code for our ACL2022 findings paper: CRACSpell: A Contextual Typo Robust Approach with Copy Mechanism to Improve Chinese Spelling Cor…
☆77May 16, 2022Updated 4 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
jiangnanboy / text_generation
View on GitHub
Title and keywords are used to generate text.
☆12Dec 6, 2021Updated 4 years ago
rwv / caj2pdf-go
View on GitHub
📚 A Go port for caj2pdf/caj2pdf
☆10Feb 23, 2023Updated 3 years ago
panda-run / spring-djl
View on GitHub
Spring Deep Java Library 通过利用DJL框架与其他Spring框架进行整合，进行深度学习模型训练和推导。
☆24Mar 16, 2022Updated 4 years ago
jiangnanboy / text_grapher
View on GitHub
利用java对文章进行分析并图谱化展示（主要提取关键词、实体、依存分析等）。
☆12Apr 14, 2023Updated 3 years ago
jsfsds / pubmed_search
View on GitHub
一个基于模型上下文协议/MCP 构建的智能医学文献分析工具。它旨在帮助科研人员、医学从业者和学生快速检索 PubMed 数据库，并利用大型语言模型 (LLM) 的能力对文献摘要进行智能分析和总结
☆10Jun 17, 2026Updated last month
HillZhang1999 / NaSGEC
View on GitHub
Code & Data for our Paper "NaSGEC: Multi-Domain Chinese Grammatical Error Correction for Native Speaker Texts" (ACL 2023 Findings)
☆96Feb 18, 2025Updated last year
FlagOpen / FlagData
View on GitHub
☆364Jun 13, 2024Updated 2 years ago
adetion / txtfilemerge
View on GitHub
TXT文本语料数据清洗（Text corpus data cleaning）：1> 合并TXT文件；2> 过滤干扰字符串；3> 对人名、地名、组织机构进行遮码处理；4> 将其他编码格式统一转换为UTF-8
☆19Oct 14, 2022Updated 3 years ago
shuliu586 / AI_Chinese_DataSet_KnowledgeDAO
View on GitHub
供AI训练的中文数据集（持续更新。。。）与AI公司图谱，目前的数据集餐饮行业8000问，百度知道，Alpaca中文数据集，计算机领域数据集，Vicuna数据集，RedPajama数据集，Wikipedia中文词条数据集，网站论坛问答数据集
☆66Nov 29, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
alanshi / charset_mnbvc
View on GitHub
本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
☆70Oct 17, 2025Updated 9 months ago
jiangnanboy / albert_re
View on GitHub
albert-fc for RE(Relation Extraction)，中文关系抽取
☆20Apr 24, 2023Updated 3 years ago
skyobin / logwaf
View on GitHub
基于nginx lua做前端防御，基于hadoop做用户行为分析的waf
☆11Nov 17, 2016Updated 9 years ago
zhangzg1 / rag-llm
View on GitHub
基于大语言模型的RAG项目，分别实现了基于文本和知识图谱的RAG
☆28Jul 3, 2026Updated 2 weeks ago
masr2000 / CLG-CGEC
View on GitHub
☆51Dec 1, 2023Updated 2 years ago
qzw1210 / geeco
View on GitHub
Gecco是一款用java语言开发的轻量化的易用的网络爬虫。Gecco整合了jsoup、httpclient、fastjson、spring、htmlunit、redission等优秀框架，让您只需要配置一些jquery风格的选择器就能很快的写出一个爬虫。Gecco框架有优…
☆12Mar 9, 2017Updated 9 years ago
liwenju0 / error_text_gen
View on GitHub
用于生成文本纠错模型(如Gector)需要的大量数据。
☆15Jan 5, 2023Updated 3 years ago
jiangnanboy / python_search
View on GitHub
利用sklearn和gensim中的tfidf,lsa,doc2vec进行查询与文档匹配搜索
☆21Sep 11, 2022Updated 3 years ago
jiangnanboy / albert_link_prediction
View on GitHub
albert-fc for LP(Link Prediction)，中文实体链接预测
☆19Apr 21, 2023Updated 3 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
THUKElab / CCL2023-CLTC-THU_KELab
View on GitHub
This repository open-sources our GEC system submitted by THU KELab (sz) in the CCL2023-CLTC Track 1: Multidimensional Chinese Learner Tex…
☆15Nov 25, 2023Updated 2 years ago
xlxwalex / FCGEC
View on GitHub
The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型
☆122Apr 12, 2026Updated 3 months ago
dhcode-cpp / Engram-pytorch
View on GitHub
pytorch implementation of DeepSeek Engram
☆19Mar 24, 2026Updated 3 months ago
XuRui314 / GLM4v-Finetune
View on GitHub
Support finetuning GLM4v with zero2
☆16Jun 29, 2024Updated 2 years ago
Alpha-Innovator / DocParser
View on GitHub
☆18Jan 13, 2025Updated last year
qinxuewu / alibaba-cloud
View on GitHub
spring cloud alibaba系列学习案列
☆11Nov 3, 2019Updated 6 years ago
jiangnanboy / jcorrector
View on GitHub
jcorrector 中文文本纠错工具， Text Error Correction Tool，Spelling Check
☆82Mar 2, 2026Updated 4 months ago