大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
☆81Jul 25, 2024Updated last year
Alternatives and similar repositories for llm_corpus_quality
Users that are interested in llm_corpus_quality are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- textcnn for advertising detection,广告检测☆11Jan 12, 2024Updated 2 years ago
- chinese sentence punctuation prediction,中文句子标点符号预测。☆29Oct 19, 2022Updated 3 years ago
- text security audit 安全审核-语义模型过滤 敏感内容检测系统☆39Feb 14, 2025Updated last year
- 智能文本自动处理工具(Intelligent text automatic processing tool)。AutoText的功能主要有文本纠错,图片ocr、版面检测以及表格结构识别等。The main functions of this project include …☆27May 17, 2023Updated 3 years ago
- 利用java-yolov8实现版面检测(Chinese layout detection),java-yolov8 is used to detect the layout of Chinese document images☆27May 5, 2023Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Here is a demo for PDF parser (Including OCR, object detection tools)☆36Oct 14, 2024Updated last year
- t5-model-onnx,中文拼写纠错,Chinese spelling correction。☆15Sep 18, 2022Updated 3 years ago
- 深度网络实现意图分类。☆11Feb 26, 2021Updated 5 years ago
- spark tutorial for big data mining。包括app流量运营分析、als推荐、smote样本采样、RFM客户价值分群、AHP层次分析客户价值得分、手机定位数据商圈挖掘、马尔可夫智能邮件预测、时序预测、关联规则、推荐电影好友等。☆40Sep 10, 2022Updated 3 years ago
- MacBERT for Chinese Spelling Correction, macbert中文拼写纠错☆16May 23, 2022Updated 3 years ago
- X-Trainer collaborative arm platform (±0.05 mm) with VR/gamepad teleop data adapters and NVIDIA GPU-accelerated simulation.☆44Mar 27, 2026Updated last month
- 中文对话数据清洗☆32Nov 8, 2022Updated 3 years ago
- Official repository of Graph RAG-Tool Fusion and ToolLinkOS dataset.☆23Feb 13, 2025Updated last year
- near-synonym, 基于大模型LLM的中文反义词/近义词(antonyms/synonyms)工具包. 也可计算词语相似度/句子相似度/文本相似度等。☆31Apr 29, 2025Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- 本项目利用JNI加载paddle-ocr的C++编译的dll库,并利用springboot进行web部署访问。This project uses JNI to load the C++ compiled dll libraries of paddle-ocr, and us…☆37Dec 30, 2024Updated last year
- Title and keywords are used to generate text.☆12Dec 6, 2021Updated 4 years ago
- Spring Deep Java Library 通过利用DJL框架与其他Spring框架进行整合,进行深度学习模型训练和推导。☆24Mar 16, 2022Updated 4 years ago
- ☆12Mar 2, 2020Updated 6 years ago
- 大模型API企业网关,公司内部API管理,分发聚和系统,支持将多种大模型转换成统一的OpenAI兼容接口,尤其对国内开源模型deepseek,qwen,kimi,glm提供特别支持 可供个人或者企业内部大模型API统一管理和渠道分发使用(key管理与二次分发),长期更新,支…☆40Sep 12, 2025Updated 8 months ago
- 一个基于 模型上下文协议/MCP 构建的智能医学文献分析工具。它旨在帮助科研人员、医学从业者和学生快速检索 PubMed 数据库,并利用大型语言模型 (LLM) 的能力对文献摘要进行智能分析和总结☆10May 18, 2025Updated last year
- 供AI训练的中文数据集(持续更新。。。)与AI公司图谱,目前的数据集餐饮行业8000问,百度知道,Alpaca中文数据集,计算机领域数据集,Vicuna数据集,RedPajama数据集,Wikipedia中文词条数据集,网站论坛问答数据集☆65Nov 29, 2023Updated 2 years ago
- ☆365Jun 13, 2024Updated last year
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆71Oct 17, 2025Updated 7 months ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Code & Data for our Paper "NaSGEC: Multi-Domain Chinese Grammatical Error Correction for Native Speaker Texts" (ACL 2023 Findings)☆97Feb 18, 2025Updated last year
- Go语言实现命令行版飞秋☆13Nov 26, 2018Updated 7 years ago
- spring boot 相关使用代码☆11May 26, 2018Updated 7 years ago
- ☆12Oct 12, 2021Updated 4 years ago
- ☆22Dec 8, 2022Updated 3 years ago
- ☆48Mar 21, 2022Updated 4 years ago
- 目标:构建一个更符合语言学的小而美的 llama 分词器,支持中英日三国语言☆20Jun 2, 2024Updated last year
- Kaggle AIMO2 solution with token-efficient reasoning LLM recipes☆50Aug 7, 2025Updated 9 months ago
- Python implementation of Lloyd-Max quantizer.☆24Apr 4, 2021Updated 5 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- 基于大语言模型的RAG项目,分别实现了基于文本和知识图谱的RAG☆29Dec 11, 2025Updated 5 months ago
- 用于生成文本纠错模型(如Gector)需要的大量数据。☆14Jan 5, 2023Updated 3 years ago
- A simple Pomodoro Timer write in Tauri and React.☆13Nov 9, 2023Updated 2 years ago
- ☆51Dec 1, 2023Updated 2 years ago
- 利用sklearn和gensim中的tfidf,lsa,doc2vec进行查询与文档匹配搜索☆21Sep 11, 2022Updated 3 years ago
- Gecco是一款用java语言开发的轻量化的易用的网络爬虫。Gecco整合了jsoup、httpclient、fastjson、spring、htmlunit、redission等优秀框架,让您只需要配置一些jquery风格的选择器就能很快的写出一个爬虫。Gecco框架有优…☆12Mar 9, 2017Updated 9 years ago
- This repository open-sources our GEC system submitted by THU KELab (sz) in the CCL2023-CLTC Track 1: Multidimensional Chinese Learner Tex…☆15Nov 25, 2023Updated 2 years ago