jonathandunn / corpus_similarityLinks

Measure the similarity of text corpora for 74 languages

☆13

Alternatives and similar repositories for corpus_similarity

Users that are interested in corpus_similarity are comparing it to the libraries listed below

Sorting:

SemiringInc / Mueller-Report-Corpus
The Mueller Report Corpus V 0.1
☆11Updated 5 years ago
clarinsi / tweetcat
TweetCaT - a tool for building Twitter corpora of smaller languages or specific geographical regions
☆12Updated 8 years ago
bjascob / pyInflect
A python module for word inflections designed for use with spaCy.
☆92Updated 5 years ago
amir-zeldes / rstWeb
Repository for rstWeb, a browser based annotation interface for Rhetorical Structure Theory
☆43Updated 8 months ago
arne-cl / discoursegraphs
linguistic converter / merging tool for multi-level annotated corpora. graph-based (using Python and NetworkX).
☆50Updated 2 years ago
GateNLP / broad_twitter_corpus
The Broad Twitter Corpus, an NER dataset in English stratified for time, location, social media genre, socioeconomic factors (COLING 2016…
☆68Updated 3 years ago
ghpaetzold / massalign
Alignment and annotation for comparable documents.
☆22Updated 6 years ago
babylonhealth / hmrb
☆70Updated 2 years ago
proycon / flat
FoLiA Linguistic Annotation Tool -- Flat is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.g…
☆113Updated 5 months ago
amir-zeldes / gum
Repository for the Georgetown University Multilayer Corpus (GUM)
☆98Updated 2 weeks ago
dkpro / dkpro-c4corpus
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆52Updated 5 years ago
tokestermw / spacy_grammar
Language Tool style grammar handling with spaCy 2.0
☆42Updated 6 years ago
fnl / segtok
Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic fe…
☆170Updated 3 years ago
UKPLab / eacl2017-oodFrameNetSRL
Implementation of a simple frame identification approach (SimpleFrameId) described in the paper "Out-of-domain FrameNet Semantic Role Lab…
☆15Updated 8 years ago
vered1986 / panic
PANiC - PAraphrasing Noun-Compounds
☆15Updated 7 years ago
explosion / wikid
Generate a SQLite database from Wikipedia & Wikidata dumps.
☆35Updated last year
burrsettles / readability
Text readability metrics in Python.
☆11Updated 11 years ago
mikahama / natas
Python 3 library for processing historical English
☆67Updated 11 months ago
kensho-technologies / qwikidata
Python tools for interacting with Wikidata
☆154Updated last year
impresso / CLEF-HIPE-2020
Identifying Historical People, Places and other Entities: Shared Task on Named Entity Recognition and Linking on Historical Newspapers at…
☆22Updated 11 months ago
TimKam / compound-word-splitter
A compound word splitter for Python
☆48Updated 3 years ago
mholtzscher / spacy_readability
spaCy pipeline component for adding text readability meta data to Doc objects.
☆56Updated 6 years ago
MartinoMensio / spacy-dbpedia-spotlight
A spaCy wrapper for DBpedia Spotlight
☆110Updated 2 years ago
ldtoolkit / conceptnet-rocks
Python library to work with ConceptNet offline
☆10Updated 2 years ago
proycon / folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…
☆65Updated last year
kermitt2 / grobid-ner
A Named-Entity Recogniser based on Grobid.
☆55Updated 2 months ago
LuminosoInsight / exquisite-corpus
Put together a multilingual corpus from a variety of sources. Used for wordfreq and word embeddings.
☆52Updated 4 years ago
nert-nlp / streusle
STREUSLE: a corpus with comprehensive lexical semantic annotation (multiword expressions, supersenses)
☆66Updated last month
clarinsi / csmtiser
A tool for text normalisation via character-level machine translation
☆13Updated 5 years ago
SapienzaNLP / ewiser
A Word Sense Disambiguation system integrating implicit and explicit external knowledge.
☆69Updated 3 years ago