opendatalab / WanJuan3.0
WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据规模均超过150GB
☆24Updated 3 months ago
Alternatives and similar repositories for WanJuan3.0
Users that are interested in WanJuan3.0 are comparing it to the libraries listed below
Sorting:
- 万卷1.0多模态语料☆560Updated last year
- datasets resource☆113Updated 3 weeks ago
- ☆226Updated last year
- ☆172Updated 2 years ago
- ☆324Updated 11 months ago
- MPB (Miner-PDF-Benchmark) is an end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios.☆22Updated 5 months ago
- Imitate OpenAI with Local Models☆88Updated 8 months ago
- ☆18Updated this week
- 大模型多维度中文对齐评测基准 (ACL 2024)☆386Updated 9 months ago
- Dingo: A Comprehensive Data Quality Evaluation Tool☆144Updated last week
- 我们是第一个完全可商用的角色大模型。☆40Updated 9 months ago
- A demo built on Megrez-3B-Instruct, integrating a web search tool to enhance the model's question-and-answer capabilities.☆38Updated 5 months ago
- A simple Semantic Kernel semantic function debugging tool.☆29Updated last year
- A Python Package to Access World-Class Generative Models☆127Updated 11 months ago
- 旨在收集各行业的开源数据,引导和推动行业大模型的发展☆45Updated 6 months ago
- A dataset template for guiding chat-models to self-cognition, including information about the model’s identity, capabilities, usage, limi…☆27Updated last year
- The code and data for GrammarGPT.☆169Updated last year
- The Open-Source Data Annotation Platform☆811Updated 2 months ago
- The official codes for "Aurora: Activating chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning"☆261Updated last year
- 更纯粹、更高压缩率的Tokenizer☆481Updated 5 months ago
- Its an open source LLM based on MOE Structure.☆58Updated 10 months ago
- vLLM client with minimal dependencies☆13Updated last year
- 律知, 法律咨询大模型☆38Updated last year
- Llama2开源模型中文版-全方位测评,基于SuperCLUE的OPEN基准 | Llama2 Chinese evaluation with SuperCLUE☆126Updated last year
- deep learning☆149Updated last week
- llama inference for tencentpretrain☆98Updated last year
- 国内首个全参数训练的法律大模型 HanFei-1.0 (韩非)☆116Updated last year
- The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.☆441Updated 7 months ago
- 中文书籍收录整理, Collection of Chinese Books☆184Updated last year
- 《机器学习工程》开源电子书,欢迎一起贡献完善《Machine Learning Engineering》中文版☆73Updated last year