adetion / txtfilemergeLinks
TXT文本语料数据清洗(Text corpus data cleaning):1> 合并TXT文件;2> 过滤干扰字符串;3> 对人名、地名、组织机构进行遮码处理;4> 将其他编码格式统一转换为UTF-8
☆18Updated 2 years ago
Alternatives and similar repositories for txtfilemerge
Users that are interested in txtfilemerge are comparing it to the libraries listed below
Sorting:
- 爬取各种数据的爬虫的样例(百度百科、知乎、微博、简书、搜狗词库),可用于自然语言处理语料收集☆12Updated 2 months ago
- 仇恨言论语料库☆23Updated 2 years ago
- MiniRBT (中文小 型预训练模型系列)☆292Updated last month
- 中文 NLP 资源库,语料库,相关的框架,文章收集。☆26Updated 3 years ago
- ☆157Updated last year
- 爬取自互联网的古诗词语料库,包含先秦至当代诗词,共计1014508首诗☆34Updated 3 years ago
- 用于汇总目前的开源中文对话数据集☆176Updated 2 years ago
- [COLING 2022] CSL: A Large-scale Chinese Scientific Literature Dataset 中文科学文献数据集☆639Updated 2 years ago
- CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)☆252Updated last month
- text analysis, supporting multiple methods including word count, readability, document similarity, sentiment analysis, Word2Vec/GloVe, an…☆362Updated 4 months ago
- 中文文本相似度计算器☆158Updated 11 months ago
- 雅意信息抽取大模型:在百万级人工构造的高质量信息抽取数据上进行指令微调,由中科闻歌算法团队研发。 (Repo for YAYI Unified Information Extraction Model)☆308Updated last year
- 人民日报爬虫(Python)☆140Updated last month
- 一个面向中文文本纠错任务的综合平台,集学术研究、模型训练、模型评测和推理部署于一体,覆盖拼写纠错与语法纠错两个核心方向。☆376Updated 2 weeks ago
- ☆390Updated last month
- 使用Sentencepiece对中文语料进行分词☆12Updated last year
- PaddleNLP UIE模型的PyTorch版实现☆645Updated 2 years ago
- 古文现代文翻译平行语料库☆109Updated 3 years ago
- Yuren 13B is an information synthesis large language model that has been continuously trained based on Llama 2 13B, which builds upon the…☆15Updated last year
- 一个简单快速的分词、命名实体识别工具☆608Updated last month
- 一个基于预训练的句向量生成工具☆138Updated 2 years ago
- ChatGPT WebUI using gradio. 给 LLM 对话和检索知识问答RAG提供一个简单好用的Web UI界面☆134Updated last year
- 中文对话数据清洗☆30Updated 2 years ago
- Mimix: A Text Generation Tool and Pretrained Chinese Models☆157Updated 10 months ago
- A NLP package for Chinese text:Preprocessing, Tokenization, Chinese Fonts, Word Embeddings, Text Similarity and Sentiment Analysis 轻量级中文自…☆30Updated 10 months ago
- 打造人人都会的NLP,开源不易,记得star哦☆101Updated 2 years ago
- 在中文开源大模型的基础上进行定制化的微调,拥有自己专属的语言模型。☆50Updated 2 years ago
- <数字人文教程>资源合集☆102Updated last year
- Alpaca Chinese Dataset -- 中文指令微调数据集☆213Updated 11 months ago
- PERT: Pre-training BERT with Permuted Language Model☆365Updated last month