akb89 / witokit

A Python toolkit to generate a tokenized dump of Wikipedia for NLP

☆11

Alternatives and similar repositories for witokit:

Users that are interested in witokit are comparing it to the libraries listed below

jonathandunn / common_crawl_corpus
Scripts for building a geo-located web corpus using Common Crawl data
☆11Updated 2 months ago
allenai / HyBayes
Bayesian Assessment of Hypotheses
☆24Updated last year
anlausch / ArguminSci
Analyze Argumentation and Rhetorical Aspects in Scientific Writing.
☆19Updated 2 years ago
vered1986 / panic
PANiC - PAraphrasing Noun-Compounds
☆15Updated 6 years ago
mayhewsw / pytorch-truecaser
A simple neural truecaser written in pytorch and allennlp.
☆32Updated 7 months ago
iesl / stance
Learned string similarity for entity names using optimal transport.
☆34Updated 4 years ago
revuel / PatternOmatic
Finds linguistic patterns effortlessly
☆34Updated last year
clips / wordkit
Featurize words into orthographic and phonological vectors.
☆40Updated last year
allenai / pybart
Converter from UD-trees to BART representation
☆36Updated 10 months ago
wenkokke / dep2con
several algorithms for converting dependency structures into constituency structures.
☆10Updated 2 years ago
mjpost / bin
bin files
☆13Updated last month
facebookresearch / irt-leaderboard
Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simp…
☆17Updated 2 years ago
jaredleekatzman / Wordly
ADS Project
☆14Updated 9 years ago
bitextor / bifixer
Tool to fix bitexts and tag near-duplicates for removal
☆29Updated 5 months ago
stefan-it / gc4lm
GC4LM: A Colossal (Biased) language model for German
☆13Updated 3 years ago
GorkaUrbizu / Coreference-Corpora-Resources
List of corpora annotated for coreference for different languages
☆17Updated 5 months ago
SapienzaNLP / mcl-wic
Semeval-2021 Multilingual and Cross-lingual Word-in-Context Task
☆18Updated 3 years ago
facebookresearch / text-simplification-evaluation
Reference-less Quality Estimation of Text Simplification Systems
☆48Updated last year
armatthews / TokenizeAnything
A re-implementation of redpony/cdec's tokenize-anything.pl script in python
☆8Updated 8 years ago
boberle / dependency2tree
Convert CoNLL output of a dependency parser into a latex or graphviz tree
☆12Updated 4 years ago
BramVanroy / spacy-extreme
An example of how to use spaCy for extremely large files without running into memory issues
☆36Updated 2 years ago
trevorcohn / mantis
Deep learning model of machine translation using attentional and structural biases
☆13Updated 7 years ago
adrianeboyd / boyd-wnut2018
Code and data for: Low Resource Grammatical Error Correction Using Wikipedia Edits (WNUT 2018)
☆14Updated 6 months ago
bltlab / mot
Multilingual Open Text
☆25Updated 2 months ago
kensk8er / langdist
Multilingual Language Modeling Toolkit
☆11Updated 7 years ago
explosion / spacy-alignments
💫 A spaCy package for Yohei Tamura's Rust tokenizations library
☆27Updated last year
MilaNLProc / bertlang
A web interface to understand language-specific BERT-models
☆17Updated 9 months ago
bltlab / seqscore
SeqScore: Scoring for named entity recognition and other sequence labeling tasks
☆22Updated 2 weeks ago
akb89 / pyfn
A python module to process data for Frame Semantic Parsing
☆23Updated 4 years ago
transducens / linguacrawl
Crawling engine that crawls a set of top-level domains looking for documents in a list of languages
☆10Updated 11 months ago