opendatalab / WanJuan3.0

WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据规模均超过150GB
24Updated 3 months ago

Alternatives and similar repositories for WanJuan3.0

Users that are interested in WanJuan3.0 are comparing it to the libraries listed below

Sorting: