adetion / txtfilemergeLinks
TXT文本语料数据清洗(Text corpus data cleaning):1> 合并TXT文件;2> 过滤干扰字符串;3> 对人名、地名、组织机构进行遮码处理;4> 将其他编码格式统一转换为UTF-8
☆18Updated 2 years ago
Alternatives and similar repositories for txtfilemerge
Users that are interested in txtfilemerge are comparing it to the libraries listed below
Sorting:
- 仇恨言论语料库☆22Updated 2 years ago
- 中文文本相似度计算器☆155Updated 10 months ago
- CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)☆249Updated last month
- MiniRBT (中文小型预训练模型系列)☆287Updated last month
- 爬取各种数据的爬虫的样例(百度百科、知乎、微博、简书、搜狗词库),可用于自然语言处理语料收集☆12Updated last month
- 一个中文心理健康支持问答数据集,提供了丰富的援助策略标注。可用于生成富有援助策略的长咨询文本。☆223Updated last year
- 用于汇总目前的开源中文对话数据集☆171Updated 2 years ago
- 雅意信息抽取大模型:在百万级人工构造的高质量信息抽取数据上进行指令微调,由中科闻歌算法团队研发。 (Repo for YAYI Unified Information Extraction Model)☆308Updated last year
- We released BERT-wwm, a Chinese pre-training model based on Whole Word Masking technology, and models closely related to this technology.…☆62Updated 2 years ago
- Minimal keyword extraction with BERT☆87Updated 3 years ago
- SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding☆226Updated last year
- 大模型微调工具集合☆26Updated last year
- "桃李“: 国际中文教育大模型☆183Updated last year
- A Chinese medical ChatGPT based on LLaMa, training from large-scale pretrain corpus and multi-turn dialogue dataset.☆371Updated last year
- Mimix: A Text Generation Tool and Pretrained Chinese Models☆157Updated 9 months ago
- Alpaca Chinese Dataset -- 中文指令微调数据集☆213Updated 10 months ago
- 打造人人都会的NLP,开源不易,记得star哦☆102Updated 2 years ago
- 在中文开源大模型的基础上进行定制化的微调,拥有自己专属的语言模型。☆50Updated 2 years ago
- PaddleNLP UIE模型的PyTorch版实现☆644Updated 2 years ago
- (1)弹性区间标准化的旋转位置词嵌入编码器+peft LORA量化训练,提高万级tokens性能支持。(2)证据理论解释学习,提升模型的复杂逻辑推理能力(3)兼容alpaca数据格式。☆45Updated 2 years ago
- clueai工具包: 3行代码3分钟,自定义需要的API!☆233Updated 2 years ago
- text analysis, supporting multiple methods including word count, readability, document similarity, sentiment analysis, Word2Vec/GloVe, an…☆362Updated 3 months ago
- 继续预训练中文bert☆31Updated 4 years ago
- Code & Data for our Paper "NaSGEC: Multi-Domain Chinese Grammatical Error Correction for Native Speaker Texts" (ACL 2023 Findings)☆92Updated 5 months ago
- [COLING 2022] CSL: A Large-scale Chinese Scientific Literature Dataset 中文科学文献数据集☆637Updated 2 years ago
- [EMNLP 2024] 中文领域心理健康对话大模型MeChat☆483Updated 8 months ago
- ChatGLM2-6B微调, SFT/LoRA, instruction finetune☆109Updated 2 years ago
- ☆388Updated 3 weeks ago
- ChatGPT WebUI using gradio. 给 LLM 对话和检索知识问答RAG提供一个简单好用的Web UI界面☆132Updated 11 months ago
- 一个简单快速的分词、命名实体识别工具☆606Updated 2 weeks ago