๐ฅ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
โ21Jun 1, 2025Updated 9 months ago
Alternatives and similar repositories for python-vaporetto
Users that are interested in python-vaporetto are comparing it to the libraries listed below
Sorting:
- ๆฌ่ชๅคๆใฟในใฏใซใใใ่ฉไพก็จใใผใฟใปใใโ21Nov 24, 2022Updated 3 years ago
- Finding all pairs of similar documents time- and memory-efficientlyโ62Mar 13, 2025Updated 11 months ago
- ๆณๅพใปๅคไพ้ขไฟใฎใใผใฟใปใใโ49Jan 8, 2025Updated last year
- โ10Sep 14, 2022Updated 3 years ago
- Includes a file with zstd compression in Rustโ13Feb 17, 2023Updated 3 years ago
- Tokyo Metropolitan University Paraphrase Corpus (TMUP)โ11Jun 12, 2017Updated 8 years ago
- atmaCup #11 ใฎ Public 4th / Private 5th Solution ใฎใชใใธใใชใงใใโ12Aug 3, 2021Updated 4 years ago
- ๐ฆ Rust library of natural language dictionaries using character-wise double-array tries.โ36Jan 13, 2025Updated last year
- ๐ฅ Vaporetto: Very accelerated pointwise prediction based tokenizerโ252Feb 7, 2026Updated 3 weeks ago
- Japanese data from the Google UDT 2.0.โ28Mar 24, 2023Updated 2 years ago
- Trials of pre-trained BERT models for the medical domain in Japanese.โ12Nov 21, 2020Updated 5 years ago
- An easy-to-use ML pipeline package for Python inspired by scikit-learn pipeline and PyTorch layers.โ12Aug 27, 2023Updated 2 years ago
- โ14Nov 24, 2022Updated 3 years ago
- Code for COLING 2020 Paperโ13Feb 3, 2026Updated last month
- Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.โ89Nov 3, 2023Updated 2 years ago
- This repository is a collection of MLOps case studies.โ36Sep 3, 2023Updated 2 years ago
- This repository has implementations of data augmentation for NLP for Japanese.โ64Feb 16, 2023Updated 3 years ago
- DIRECT: Direct and Indirect REsponses in Conversational Text Corpusโ17Jul 1, 2021Updated 4 years ago
- Code and dataset "ZEST" from "Learning from task descriptions", Weller et al, EMNLP 2020โ17Mar 15, 2021Updated 4 years ago
- Rust binding of primitivโ20Jun 3, 2018Updated 7 years ago
- ๐ A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)โ20Mar 15, 2025Updated 11 months ago
- Arguments parser with class for Python, inspired by StructOptโ62Sep 17, 2023Updated 2 years ago
- Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese with BERT modelsโ20Updated this week
- โ16Jan 3, 2025Updated last year
- Codes to pre-train Japanese T5 modelsโ40Sep 7, 2021Updated 4 years ago
- The evaluation scripts of JMTEB (Japanese Massive Text Embedding Benchmark)โ84Jan 6, 2026Updated last month
- The robust text processing pipeline framework enabling customizable, efficient, and metric-logged text preprocessing.โ125Nov 13, 2025Updated 3 months ago
- Utility scripts for preprocessing Wikipedia texts for NLPโ78Apr 9, 2024Updated last year
- โ17May 31, 2023Updated 2 years ago
- โ19Jan 28, 2021Updated 5 years ago
- A simple implementation of SimCSEโ78Oct 31, 2022Updated 3 years ago
- Pre-training Language Models for Japaneseโ50Jul 2, 2023Updated 2 years ago
- Viterbi-based accelerated tokenizer (Python wrapper)โ43Sep 4, 2024Updated last year
- Repository for JSICKโ45May 31, 2023Updated 2 years ago
- Funer is Rule based Named Entity Recognition tool.โ22Apr 21, 2022Updated 3 years ago
- lightweight, fast and robust columnar dataframe for data analytics with online updateโ23Aug 14, 2021Updated 4 years ago
- 1st place solution source code of Kaggle Happy Whale competitionโ58May 24, 2022Updated 3 years ago
- ๐ค vibrato: Viterbi-based accelerated tokenizerโ398Feb 7, 2026Updated 3 weeks ago
- Discovering Universal Geometry in Embeddings with ICA (Published in EMNLP 2023)โ20Jun 17, 2025Updated 8 months ago