stephantul / unitokenLinks

Tokenization across languages. Useful as preprocessing for subword tokenization.

☆21

Alternatives and similar repositories for unitoken

Users that are interested in unitoken are comparing it to the libraries listed below

Sorting:

jackbandy / bookcorpus-datasheet
Documentation effort for the BookCorpus dataset
☆34Updated 4 years ago
pmbaumgartner / spacy-setfit-textcat
☆30Updated 3 years ago
litus-ai / classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
☆87Updated 2 months ago
iaramer / dobbi
An open-source NLP library: fast text cleaning and preprocessing
☆23Updated 4 years ago
MilaNLProc / language-invariant-properties
☆22Updated 3 years ago
Kaleidophon / token2index
A lightweight but powerful library to build token indices for NLP tasks, compatible with major Deep Learning frameworks like PyTorch and …
☆51Updated last year
huggingface / model_card
☆30Updated 4 years ago
darrow-labs / LegalLens
☆10Updated last year
salesforce / TaiChi
Open source library for few shot NLP
☆78Updated 2 years ago
allenai / smashed
SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…
☆35Updated last year
nlp-uoregon / famie
FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction
☆24Updated 3 years ago
facebookresearch / text_characterization_toolkit
A library for computing diverse text characteristics and using them to analyze data sets and models with ease.
☆40Updated 3 years ago
UKPLab / eacl2024-lagonn
Source code and data for Like a Good Nearest Neighbor
☆30Updated 10 months ago
joelgrus / streamlit-allennlp
allennlp + streamlit demo
☆21Updated 6 years ago
bhavsarpratik / semantic-search
[WIP] Behold, semantic-search, built over sentence-transformers to make it easy for search engineers to evaluate, optimise and deploy mod…
☆15Updated 2 years ago
thakur-nandan / sprint
SPRINT Toolkit helps you evaluate diverse neural sparse models easily using a single click on any IR dataset.
☆47Updated 2 years ago
writer / replaCy
spaCy match and replace, maintaining conjugation
☆36Updated 2 years ago
pmbaumgartner / setfit
☆43Updated 2 years ago
spyysalo / wiki-bert-pipeline
Generate BERT vocabularies and pretraining examples from Wikipedias
☆17Updated 5 years ago
orevaahia / magnet-tokenization
☆13Updated last year
rajammanabrolu / Q-BERT
Agents that build knowledge graphs and explore textual worlds by asking questions
☆79Updated 2 years ago
BramVanroy / spacy-extreme
An example of how to use spaCy for extremely large files without running into memory issues
☆36Updated 3 years ago
webis-de / summary-explorer
Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.
☆45Updated last year
megagonlabs / ruler
Data Programming by Demonstration (DPBD) for Document Classification
☆35Updated 4 years ago
jxpress / setfit-pytorch-lightning
☆44Updated 2 years ago
knodle / knodle
A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently…
☆108Updated last year
lucy3 / whos_filtered
☆14Updated last year
openredact / nerwhal
This is a prototype of a multi-lingual suite for named-entity recognition in Python.
☆21Updated last year
chr1st1ank / narrow-down
Fast fuzzy text search
☆11Updated 2 years ago
nyu-mll / pretraining-learning-curves
The repository for the paper "When Do You Need Billions of Words of Pretraining Data?"
☆21Updated 5 years ago