大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
☆79Jul 25, 2024Updated last year
Alternatives and similar repositories for llm_corpus_quality
Users that are interested in llm_corpus_quality are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- text security audit 安全审核-语义模型过滤 敏感内容检测系统☆38Feb 14, 2025Updated last year
- 基于simhash的文本去重算法☆20Jun 18, 2021Updated 4 years ago
- TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA☆10Jul 18, 2023Updated 2 years ago
- Here is a demo for PDF parser (Including OCR, object detection tools)☆36Oct 14, 2024Updated last year
- t5-model-onnx,中文拼写纠错,Chinese spelling correction。☆15Sep 18, 2022Updated 3 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- 深度网络实现意图分类。☆11Feb 26, 2021Updated 5 years ago
- spark tutorial for big data mining。包括app流量运营分析、als推荐、smote样本采样、RFM客户价值分群、AHP层次分析客户价值得分、手机定 位数据商圈挖掘、马尔可夫智能邮件预测、时序预测、关联规则、推荐电影好友等。☆40Sep 10, 2022Updated 3 years ago
- MacBERT for Chinese Spelling Correction, macbert中文拼写纠错☆16May 23, 2022Updated 3 years ago
- Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖☆48Jun 19, 2024Updated last year
- Official repository of Graph RAG-Tool Fusion and ToolLinkOS dataset.☆23Feb 13, 2025Updated last year
- near-synonym, 基于大模型LLM的中文反义词/近义词(antonyms/synonyms)工具包. 也可计算词语相似度/句子相似度/文本相似度等。☆31Apr 29, 2025Updated 11 months ago
- The code for our ACL2022 findings paper: CRACSpell: A Contextual Typo Robust Approach with Copy Mechanism to Improve Chinese Spelling Cor…☆77May 16, 2022Updated 3 years ago
- Spring Deep Java Library 通过利用DJL框架与其他Spring框架进行整合,进行深度学习模型训练和推导。☆24Mar 16, 2022Updated 4 years ago
- 📚 A Go port for caj2pdf/caj2pdf☆10Feb 23, 2023Updated 3 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- ☆364Jun 13, 2024Updated last year
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆70Oct 17, 2025Updated 5 months ago
- Code & Data for our Paper "NaSGEC: Multi-Domain Chinese Grammatical Error Correction for Native Speaker Texts" (ACL 2023 Findings)☆97Feb 18, 2025Updated last year
- 利用分类法和敏感词检测法对生成式大模型的输入和输出内容进行安全检测,尽早识别风险内容。The input and output contents of generative large model are checked by classification method a…☆28Sep 9, 2024Updated last year
- 中文版面检测(Chinese layout detection),yolov8 is used to detect the layout of Chinese document images。☆60Apr 28, 2023Updated 2 years ago
- spring boot 相关使用代码☆11May 26, 2018Updated 7 years ago
- ☆12Oct 12, 2021Updated 4 years ago
- ocr,pdf转docx,pdf to docx☆23Nov 4, 2022Updated 3 years ago
- Kaggle AIMO2 solution with token-efficient reasoning LLM recipes☆46Aug 7, 2025Updated 8 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- albert-fc for RE(Relation Extraction),中文关系抽取☆19Apr 24, 2023Updated 2 years ago
- model2onnx,将roberta和macbert模型转为onnx格式,并进行推理。☆19Jul 13, 2022Updated 3 years ago
- 基于大语言模型的RAG项目,分别实现了基于文本和知识图谱的RAG☆29Dec 11, 2025Updated 3 months ago
- A simple Pomodoro Timer write in Tauri and React.☆13Nov 9, 2023Updated 2 years ago
- 用于生成文本纠错模型(如Gector)需要的大量数据。☆14Jan 5, 2023Updated 3 years ago
- 基于nginx lua做前端防御,基于hadoop做用户行为分析的waf☆11Nov 17, 2016Updated 9 years ago
- 利用Swin-Unet(Swin Transformer Unet)实现对文档图片里表格结构的识别,Swin-unet (Swin Transformer Unet) is used to identify the document table structure☆27Feb 23, 2024Updated 2 years ago
- 利用sklearn和gensim中的tfidf,lsa,doc2vec进行查询与文档匹配搜索☆21Sep 11, 2022Updated 3 years ago
- albert-fc for LP(Link Prediction),中文实体链接预测☆19Apr 21, 2023Updated 2 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Build Neo4J Knowledge Graphs from Excel files☆23Nov 18, 2024Updated last year
- 详细介绍知名大厂在搜索、推荐、广告等工业界的实践、前沿论文、技术干货分享☆20Mar 24, 2024Updated 2 years ago
- The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型☆120Dec 10, 2024Updated last year
- ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark☆51Sep 2, 2025Updated 7 months ago
- h5端调起高德、腾讯、百度地图实现车载导航插件封装☆10Dec 10, 2022Updated 3 years ago
- Using Seq2Seq transformers for Text2SQL task on WikiSQL dataset.☆12Jan 8, 2022Updated 4 years ago
- Open-source Human Feedback Library☆11Oct 25, 2023Updated 2 years ago