daac-tools/vaporetto

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/daac-tools/vaporetto)

daac-tools / vaporetto

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

☆295

Alternatives and similar repositories for vaporetto

Users that are interested in vaporetto are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

daac-tools / vibrato
View on GitHub
🎤 vibrato: Viterbi-based accelerated tokenizer
☆415Updated this week
daac-tools / daachorse
View on GitHub
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure in Rust.
☆270Updated this week
daac-tools / python-vaporetto
View on GitHub
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. (Python wrapper)
☆21May 30, 2026Updated last month
lindera / lindera
View on GitHub
A multilingual morphological analysis library.
☆644Updated this week
daac-tools / crawdad
View on GitHub
🦞 Rust library of natural language dictionaries using character-wise double-array tries.
☆38Jan 13, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
WorksApplications / chikkarpy
View on GitHub
Japanese synonym library
☆55Feb 7, 2022Updated 4 years ago
WorksApplications / sudachi.rs
View on GitHub
Sudachi in Rust 🦀 and new generation of SudachiPy
☆459Jun 29, 2026Updated 3 weeks ago
WorksApplications / SudachiTra
View on GitHub
Japanese tokenizer for Transformers
☆80Dec 15, 2023Updated 2 years ago
megagonlabs / bunkai
View on GitHub
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
☆200Mar 26, 2024Updated 2 years ago
daac-tools / find-simdoc
View on GitHub
Finding all pairs of similar documents time- and memory-efficiently
☆62Mar 13, 2025Updated last year
himkt / konoha
View on GitHub
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
☆263Updated this week
himkt / awesome-bert-japanese
View on GitHub
📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
☆132Mar 15, 2023Updated 3 years ago
hotchpotch / fast-bunkai
View on GitHub
⚡Japanese sentence splitting(日本語文境界判定器), 40–250× faster via a Rust-accelerated Python library with near-perfect API compatibility with …
☆74Oct 14, 2025Updated 9 months ago
octanove / shiba
View on GitHub
Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.
☆89Nov 3, 2023Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
kajyuuen / daaja
View on GitHub
This repository has implementations of data augmentation for NLP for Japanese.
☆64Feb 16, 2023Updated 3 years ago
stockmarkteam / ner-wikipedia-dataset
View on GitHub
Wikipediaを用いた日本語の固有表現抽出データセット
☆143Sep 2, 2023Updated 2 years ago
taishi-i / toiro
View on GitHub
A tool for comparing tokenizers
☆122Nov 9, 2025Updated 8 months ago
megagonlabs / jrte-corpus
View on GitHub
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
☆77Jun 23, 2023Updated 3 years ago
takuyaa / yada
View on GitHub
Yada is a yet another double-array trie library aiming for fast search and compact data representation.
☆48Jun 7, 2026Updated last month
altescy / colt
View on GitHub
🐎 Colt: Effortlessly configure and construct Python objects with colt, a lightweight library inspired by AllenNLP and Tango
☆26Jul 13, 2026Updated last week
megagonlabs / ginza-transformers
View on GitHub
Use custom tokenizers in spacy-transformers
☆16Aug 9, 2022Updated 3 years ago
WorksApplications / chiVe
View on GitHub
Japanese word embedding with Sudachi and NWJC 🌿
☆177Mar 1, 2024Updated 2 years ago
daac-tools / python-daachorse
View on GitHub
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)
☆21May 30, 2026Updated last month
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
retarfi / language-pretraining
View on GitHub
Pre-training Language Models for Japanese
☆50Jul 2, 2023Updated 3 years ago
cl-tohoku / keigo_transfer_task
View on GitHub
敬語変換タスクにおける評価用データセット
☆21Nov 24, 2022Updated 3 years ago
Leko / goya
View on GitHub
Japanese Morphological Analysis written in Rust
☆84Dec 30, 2021Updated 4 years ago
kampersanda / sif-embedding
View on GitHub
Rust implementation of SIF and uSIF: Simple and fast sentence embedding
☆19Jan 22, 2025Updated last year
yukiar / OTAlign
View on GitHub
Repository of ACL2023 paper: Unbalanced Optimal Transport for Unbalanced Word Alignment
☆38Sep 13, 2023Updated 2 years ago
HojiChar / HojiChar
View on GitHub
The robust text processing pipeline framework enabling customizable, efficient, and metric-logged text preprocessing.
☆128Updated this week
WorksApplications / Sudachi
View on GitHub
A Japanese Tokenizer for Business
☆990Jul 14, 2026Updated last week
ir100 / ir100
View on GitHub
情報検索100本ノック
☆93Dec 3, 2025Updated 7 months ago
megagonlabs / ginza
View on GitHub
A Japanese NLP Library using spaCy as framework based on Universal Dependencies
☆862Jul 10, 2026Updated last week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ku-nlp / kwja
View on GitHub
An integrated Japanese analyzer based on foundation models
☆145Updated this week
yahoojapan / JGLUE
View on GitHub
JGLUE: Japanese General Language Understanding Evaluation
☆346Mar 31, 2025Updated last year
WorksApplications / ViSudachi
View on GitHub
A tool for visualizing the internal structures of morphological analyzer Sudachi
☆18Jun 9, 2022Updated 4 years ago
polm / fugashi
View on GitHub
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
☆533Oct 24, 2025Updated 8 months ago
ndl-lab / huriganacorpus-ndlbib
View on GitHub
全国書誌データから作成した振り仮名のデータセット
☆32Sep 21, 2021Updated 4 years ago
neologd / namelti
View on GitHub
Namelti : The automatic transcription generation library for person name in Katakana
☆24Jul 10, 2023Updated 3 years ago
de9uch1 / semsis
View on GitHub
A library for semantic similarity search
☆26Jan 31, 2025Updated last year