本项目旨在对大量文本文件进行快速编码检测和转换以辅助mnbvc语料集项目的数据清洗工作
☆71Oct 17, 2025Updated 7 months ago
Alternatives and similar repositories for charset_mnbvc
Users that are interested in charset_mnbvc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 文本去重☆78May 23, 2024Updated last year
- MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…☆4,193May 5, 2026Updated 2 weeks ago
- MNBVC项目-ShareGPT语料清洗☆16Oct 4, 2023Updated 2 years ago
- ☆33Feb 9, 2025Updated last year
- Implementing BERT + CRF with PyTorch for Chinese NER.☆10Mar 7, 2022Updated 4 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- 使用ndk进行MD5加密☆32Jun 2, 2016Updated 9 years ago
- An LLM Mock Server that supports simulating the protocols of all LLM providers.☆14Oct 18, 2025Updated 7 months ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆318Mar 20, 2023Updated 3 years ago
- Python implementation of AI-powered research assistant that performs iterative, deep research on any topic by combining search engines, w…☆49Mar 22, 2025Updated last year
- A more efficient GLM implementation!☆54Feb 18, 2023Updated 3 years ago
- 大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning☆81Jul 25, 2024Updated last year
- 更纯粹、更高压缩率的Tokenizer☆488Nov 27, 2024Updated last year
- Retrieves parquet files from Hugging Face, identifies and quantifies junky data, duplication, contamination, and biased content in datase…☆53Jul 6, 2023Updated 2 years ago
- HyperFrames-fix 一键生成可上传有流量视频 。目的是让 HyperFrames 生成视频适应国内需求,1 流畅的中文语音, 2 中文短视频样式新增, 3 优化整个生成视频流程。 基本上现在一键就可以完成横版竖版的快节奏短视频,例如随便丢给一个文章,一个doc,…☆81May 7, 2026Updated last week
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- (ICLR 2025) AgentRefine: Enhancing Agent Generalization through Refinement Tuning☆19Nov 22, 2025Updated 5 months ago
- A lightweight script for processing HTML page to markdown format with support for code blocks☆82Apr 14, 2024Updated 2 years ago
- INSET: Sentence Infilling with Inter-sentential Transformer☆30Nov 21, 2020Updated 5 years ago
- Web archiving utility library☆11May 5, 2026Updated 2 weeks ago
- ☆13Jan 20, 2023Updated 3 years ago
- Data and codes for BioBERT-MRC☆11Oct 5, 2021Updated 4 years ago
- ☆17Jul 25, 2025Updated 9 months ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆71Aug 5, 2025Updated 9 months ago
- I-SHEEP: Iterative Self-enHancEmEnt Paradigm of LLMs through Self-Instruct and Self-Assessment☆17Jan 16, 2025Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- SeekMoney-ai ,找商机,覆盖全视频社交媒体渠道。- 多平台数据采集(抖音、小红书、TikTok、Bilibili、微信视频号、YouTube) - 基于 openai兼容/GLM + embedding + DBSCAN 的语义聚类算法 - 调用 opena…☆68Mar 13, 2026Updated 2 months ago
- All-in-one text de-duplication☆759Mar 9, 2026Updated 2 months ago
- 主要是用python进行生存分析的步骤,包括生存分析(逐步和单因素),KM曲线、决策曲线,ROC曲线 ,训练测试样本分布比较☆11Dec 21, 2020Updated 5 years ago
- Making large AI models cheaper, faster and more accessible☆15Apr 20, 2023Updated 3 years ago
- 智枢多模态应急减灾智能平台,基于哈工大优势学科,深度融合卫星遥感、产业分布、物联网感知、社交媒体等多源异构数据,构建了包括洪水模型,气象模型,地震模型,野火模型等在内的智能体集群,精确识别灾情、量化评估灾损,实现灾害管理,填补我国巨灾模型多智能体平台的空白☆35Aug 15, 2025Updated 9 months ago
- Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at SDU@AAAI-22☆15Aug 3, 2023Updated 2 years ago
- A list of advice on doing research that is useful for me :)☆13Aug 17, 2019Updated 6 years ago
- Saito --> NEW REPOSITORY -->☆12Dec 31, 2025Updated 4 months ago
- 毕业设计:基于AI+GraphCast的智慧城市与气象多元融合云应用平台☆17May 5, 2024Updated 2 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding☆15Feb 3, 2022Updated 4 years ago
- ☆10Oct 6, 2015Updated 10 years ago
- A huge dataset for Document Visual Question Answering☆23Jul 29, 2024Updated last year
- 使用Few-Shot方法来做文本分类任务,基于THUCNews数据☆10Jun 4, 2020Updated 5 years ago
- This repository is a sub branch of AI Knowledge Tree, mainly focus on Natural Language Processing.☆27Jun 14, 2021Updated 4 years ago
- 中华经典文献数据集☆21Jun 29, 2023Updated 2 years ago
- Remote MCP server that gives LLMs access to run network commands☆55Apr 9, 2026Updated last month