大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
☆76Jul 25, 2024Updated last year
Alternatives and similar repositories for llm_corpus_quality
Users that are interested in llm_corpus_quality are comparing it to the libraries listed below
Sorting:
- textcnn for advertising detection,广告检测☆11Jan 12, 2024Updated 2 years ago
- 这里将paddle中的ocr等模型转为onnx格式,并利用java版深度框架djl加载这些onnx模型进行推理预测尝试。☆13Nov 15, 2022Updated 3 years ago
- chinese sentence punctuation prediction,中文句子标点符号预测。☆29Oct 19, 2022Updated 3 years ago
- text security audit 安全审核-语义模型过滤 敏感内容检测系统☆38Feb 14, 2025Updated last year
- 智能文本自动处理工具(Intelligent text automatic processing tool)。AutoText的功能主要有文本纠错,图片ocr、版面检测以及表格结构识别等。The main functions of this project include …☆27May 17, 2023Updated 2 years ago
- 利用java-yolov8实现版面检测(Chinese layout detection),java-yolov8 is used to detect the layout of Chinese document images☆27May 5, 2023Updated 2 years ago
- TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA☆10Jul 18, 2023Updated 2 years ago
- Here is a demo for PDF parser (Including OCR, object detection tools)☆36Oct 14, 2024Updated last year
- 利用llm大语言模型提取卡证票据关键信息。Key Information Extraction from Image with LLM(large language model).Basically, it can extract key information from …☆16Jul 22, 2024Updated last year
- Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖☆47Jun 19, 2024Updated last year
- 中文对话数据清洗☆32Nov 8, 2022Updated 3 years ago
- near-synonym, 基于大模型LLM的中文反义词/近义词(antonyms/synonyms)工具包. 也可计算词语相似度/句子相似度/文本相似度等。☆31Apr 29, 2025Updated 10 months ago
- 本项目利用JNI加载paddle-ocr的C++编译的dll库,并利用springboot进行web部署访问。This project uses JNI to load the C++ compiled dll libraries of paddle-ocr, and us…☆37Dec 30, 2024Updated last year
- Evaluation repository of wikipedia index with Dria☆10Mar 14, 2024Updated 2 years ago
- Title and keywords are used to generate text.☆12Dec 6, 2021Updated 4 years ago
- The code for our ACL2022 findings paper: CRACSpell: A Contextual Typo Robust Approach with Copy Mechanism to Improve Chinese Spelling Cor…☆77May 16, 2022Updated 3 years ago
- Spring Deep Java Library 通过利用DJL框架与其他Spring框架进行整合,进行深度学习模型训练和推导。☆24Mar 16, 2022Updated 4 years ago
- 视频分类标注、视频时空标注☆45Aug 24, 2023Updated 2 years ago
- 大模型API企业网关,公司内部API管理,分发聚和系统,支持将多种大模型转换成统一的OpenAI兼容接口,尤其对国内开源模型deepseek,qwen,kimi,glm提供特别支持 可供个人或者企业内部大模型API统一管理和渠道分发使用(key管理与二次分发),长期更新,支…☆40Sep 12, 2025Updated 6 months ago
- 在监控画质下实现对校园自行车的重识别,包含REID模型识别,向量数据库检索,UI展示☆11Feb 13, 2024Updated 2 years ago
- 利用java对文章进行分析并图谱化展示(主要提取关键词、实体、依存分析等)。☆12Apr 14, 2023Updated 2 years ago
- Python Scritpt which can be embedded into PyTorch model to print the model size.☆19Apr 19, 2021Updated 4 years ago
- ☆363Jun 13, 2024Updated last year
- 本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作☆70Oct 17, 2025Updated 5 months ago
- Code & Data for our Paper "NaSGEC: Multi-Domain Chinese Grammatical Error Correction for Native Speaker Texts" (ACL 2023 Findings)☆98Feb 18, 2025Updated last year
- 利用分类法和敏感词检测法对生成式大模型的输入和输出内容进行安全检测,尽早识别风险内容。The input and output contents of generative large model are checked by classification method a…☆28Sep 9, 2024Updated last year
- A minimal toolkit for Context Engineering — Select, Compress, and Persist context with pure functions.☆36Jan 20, 2026Updated 2 months ago
- 中文版面检测(Chinese layout detection),yolov8 is used to detect the layout of Chinese document images。☆58Apr 28, 2023Updated 2 years ago
- spring boot 相关使用代码☆11May 26, 2018Updated 7 years ago
- ☆12Oct 12, 2021Updated 4 years ago
- An OpenGL (via libdrm) Sample for rk3399 arm linux☆21Jul 6, 2017Updated 8 years ago
- ☆23Dec 8, 2022Updated 3 years ago
- ocr,pdf转docx,pdf to docx☆23Nov 4, 2022Updated 3 years ago
- TXT文本语料数据清洗(Text corpus data cleaning):1> 合并TXT文件;2> 过滤干扰字符串;3> 对人名、地名、组织机构进行遮码处理;4> 将其他编码格式统一转换为UTF-8☆19Oct 14, 2022Updated 3 years ago
- albert-fc for RE(Relation Extraction),中文关系抽取☆19Apr 24, 2023Updated 2 years ago
- model2onnx,将roberta和macbert模型转为onnx格式,并进行推理。☆19Jul 13, 2022Updated 3 years ago
- 基于大语言模型的RAG项目,分别实现了基于文本和知识图谱的RAG☆29Dec 11, 2025Updated 3 months ago
- ☆49Dec 1, 2023Updated 2 years ago
- 利用Swin-Unet(Swin Transformer Unet)实现对文档图片里表格结构的识别,Swin-unet (Swin Transformer Unet) is used to identify the document table structure☆28Feb 23, 2024Updated 2 years ago