bojone/bytepiece

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bojone/bytepiece)

bojone / bytepiece

更纯粹、更高压缩率的Tokenizer

☆488

Alternatives and similar repositories for bytepiece

Users that are interested in bytepiece are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SunDoge / bytepiece-rs
View on GitHub
更纯粹、更高压缩率的Tokenizer in Rust
☆14Dec 21, 2024Updated last year
bojone / rerope
View on GitHub
Rectified Rotary Position Embeddings
☆395May 20, 2024Updated 2 years ago
hscspring / bytepiece-rs
View on GitHub
The Bytepiece Tokenizer Implemented in Rust.
☆15Nov 28, 2023Updated 2 years ago
bojone / NBCE
View on GitHub
Naive Bayes-based Context Extension
☆328Dec 9, 2024Updated last year
esbatmop / MNBVC
View on GitHub
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…
☆4,244Jul 13, 2026Updated 2 weeks ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
liwenju0 / cutword
View on GitHub
一个简单快速的分词、命名实体识别工具
☆636Sep 26, 2025Updated 10 months ago
JunnYu / RoFormer_pytorch
View on GitHub
RoFormer V1 & V2 pytorch
☆523May 18, 2022Updated 4 years ago
OpenLMLab / MOSS-RLHF
View on GitHub
Secrets of RLHF in Large Language Models Part I: PPO
☆1,426Mar 3, 2024Updated 2 years ago
LianjiaTech / BELLE
View on GitHub
BELLE: Be Everyone's Large Language model Engine（开源中文对话大模型）
☆8,279Oct 16, 2024Updated last year
hiyouga / FastEdit
View on GitHub
🩹Editing large language models within 10 seconds⚡
☆1,370Aug 13, 2023Updated 2 years ago
baichuan-inc / Baichuan2
View on GitHub
A series of large language models developed by Baichuan Intelligent Technology
☆4,088Nov 8, 2024Updated last year
IDEA-CCNL / Fengshenbang-LM
View on GitHub
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。
☆4,125Jun 8, 2026Updated last month
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,855Jul 14, 2026Updated 2 weeks ago
TigerResearch / TigerBot
View on GitHub
TigerBot: A multi-language multi-task LLM
☆2,259Dec 28, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
baichuan-inc / Baichuan-7B
View on GitHub
A large-scale 7B pretraining language model developed by BaiChuan-Inc.
☆5,651Jul 18, 2024Updated 2 years ago
AetherCortex / Llama-X
View on GitHub
Open Academic Research on Improving LLaMA to SOTA LLM
☆1,605Aug 30, 2023Updated 2 years ago
dandelionsllm / pandallm
View on GitHub
Panda项目是于2023年5月启动的开源海外中文大语言模型项目，致力于大模型时代探索整个技术栈，旨在推动中文自然语言处理领域的创新和合作。
☆1,032Oct 19, 2023Updated 2 years ago
PhoebusSi / Alpaca-CoT
View on GitHub
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tunin…
☆2,791Dec 12, 2023Updated 2 years ago
ztxz16 / fastllm
View on GitHub
fastllm是后端无依赖的高性能大模型推理库。同时支持张量并行推理稠密模型和混合模式推理MOE模型，任意10G以上显卡即可推理满血DeepSeek。双路9004/9005服务器+单显卡部署DeepSeek满血满精度原版模型，单并发20tps；INT4量化模型单并发30tp…
☆4,869Updated this week
bojone / Keras-DDPM
View on GitHub
生成扩散模型的Keras实现
☆335Feb 14, 2025Updated last year
twang2218 / vocab-coverage
View on GitHub
语言模型中文认知能力分析
☆235Sep 9, 2023Updated 2 years ago
opendatalab / WanJuan1.0
View on GitHub
万卷1.0多模态语料
☆574Oct 20, 2023Updated 2 years ago
bojone / FSQ
View on GitHub
Keras implement of Finite Scalar Quantization
☆87Oct 31, 2023Updated 2 years ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
Unakar / Logic-RL
View on GitHub
Reproduce R1 Zero on Logic Puzzle
☆2,450Mar 20, 2025Updated last year
GanjinZero / RRHF
View on GitHub
[NIPS2023] RRHF & Wombat
☆805Sep 22, 2023Updated 2 years ago
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆24,559Updated this week
FlagOpen / FlagEmbedding
View on GitHub
Retrieval and Retrieval-augmented LLMs
☆11,990Apr 22, 2026Updated 3 months ago
L1aoXingyu / llm-infer-bench
View on GitHub
☆12Sep 1, 2023Updated 2 years ago
CVI-SZU / Linly
View on GitHub
Chinese-LLaMA 1&2、Chinese-Falcon 基础模型；ChatFlow中文对话模型；中文OpenLLaMA模型；NLP预训练/指令微调数据集
☆3,045Apr 14, 2024Updated 2 years ago
ymcui / Chinese-LLaMA-Alpaca
View on GitHub
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
☆18,945Apr 19, 2026Updated 3 months ago
bojone / bert4keras
View on GitHub
keras implement of transformers for humans
☆5,417Nov 11, 2024Updated last year
SkyworkAI / Skywork
View on GitHub
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sour…
☆1,497Mar 7, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
OpenMOSS / MOSS
View on GitHub
An open-source tool-augmented conversational language model from Fudan University
☆12,203May 27, 2026Updated 2 months ago
yangjianxin1 / Firefly
View on GitHub
Firefly: 大模型训练工具，支持训练Qwen2.5、Qwen2、Yi1.5、Phi-3、Llama3、Gemma、MiniCPM、Yi、Deepseek、Orion、Xverse、Mixtral-8x7B、Zephyr、Mistral、Baichuan2、Llma2、…
☆6,649Oct 24, 2024Updated last year
wenet-e2e / WeTextProcessing
View on GitHub
Text Normalization & Inverse Text Normalization
☆802Updated this week
JIA-Lab-research / LongLoRA
View on GitHub
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
☆2,689Aug 14, 2024Updated last year
dbiir / UER-py
View on GitHub
Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo
☆3,111May 9, 2024Updated 2 years ago
NVIDIA / FasterTransformer
View on GitHub
Transformer related optimization, including BERT, GPT
☆6,445Mar 27, 2024Updated 2 years ago
princeton-nlp / SimCSE
View on GitHub
[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
☆3,655Oct 16, 2024Updated last year