jonathandunn / common_crawl_corpusLinks

Scripts for building a geo-located web corpus using Common Crawl data

☆11

Alternatives and similar repositories for common_crawl_corpus

Users that are interested in common_crawl_corpus are comparing it to the libraries listed below

Sorting:

transducens / linguacrawl
Crawling engine that crawls a set of top-level domains looking for documents in a list of languages
☆11Updated last year
cisnlp / GlotWeb
🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
☆15Updated last month
BramVanroy / spacy-extreme
An example of how to use spaCy for extremely large files without running into memory issues
☆36Updated 3 years ago
tokestermw / spacy_grammar
Language Tool style grammar handling with spaCy 2.0
☆42Updated 7 years ago
dkpro / dkpro-c4corpus
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆52Updated 5 years ago
writer / fitbert
Use BERT to Fill in the Blanks
☆83Updated 3 years ago
chozelinek / europarl
Toolkit to compile a comparable/parallel corpus from European Parliament proceedings
☆16Updated 5 years ago
RxNLP / PyRXNLP
Build intelligent data-driven applications with minimal effort. Sentence Clustering, Topics Extraction, Text Similarity, Opinion Summariz…
☆41Updated 5 years ago
mediacloud / date_guesser
A library to extract a publication date from a web page, along with a measure of the accuracy.
☆41Updated 6 years ago
plkumjorn / GrASP
An implementation of GrASP (Shnarch et. al., 2017)
☆21Updated 3 years ago
StonyBrookNLP / PerSenT
[COLING2020] A challenge dataset for Person SenTiment analysis in news domain.
☆11Updated 3 years ago
mariananeves / annotation-tools
☆64Updated 2 years ago
TurkuNLP / wikibert
BERT models for many languages created from Wikipedia texts
☆33Updated 5 years ago
loomchild / maligna
Bilingual sengence aligner
☆28Updated 2 years ago
projecte-aina / spacy
Pre-production releases for Spacy in Catalan
☆14Updated 3 years ago
modernmt / DataCollection
Data collection, alignment and TAUS repository
☆23Updated 7 years ago
kermitt2 / grobid-ner
A Named-Entity Recogniser based on Grobid.
☆54Updated 4 months ago
AudayBerro / automatedParaphrase
Automated paraphrases Generation
☆36Updated 2 years ago
pmbaumgartner / spacy-setfit-textcat
☆30Updated 3 years ago
bltlab / paranames
ParaNames: A multilingual resource for parallel names
☆36Updated last year
stefan-it / gc4lm
GC4LM: A Colossal (Biased) language model for German
☆13Updated 4 years ago
UKPLab / linspector
☆25Updated 5 years ago
HKUST-KnowComp / MLMA_hate_speech
Dataset and code of our EMNLP 2019 paper "Multilingual and Multi-Aspect Hate Speech Analysis"
☆57Updated 10 months ago
gkiril / MinSCIE
MinScIE is an Open Information Extraction system which provides structured knowledge enriched with semantic information about citations.
☆15Updated 6 years ago
pmbaumgartner / remerge-mwe
REMERGE - Multi-Word Expression discovery algorithm
☆14Updated 2 years ago
BramVanroy / spacy_conll
Pipeline component for spaCy (and other spaCy-wrapped parsers such as spacy-stanza and spacy-udpipe) that adds CoNLL-U properties to a Do…
☆82Updated last year
alexyorke / butter-fingers
A python library to generate highly realistic typos (fuzz-testing)
☆12Updated 6 months ago
zyocum / dedup
Find duplicate text files.
☆15Updated 8 months ago
lum-ai / odinson
Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet pow…
☆72Updated last year
Mrezvan94 / Harassment-Corpus
Harassment Lexicon and Corpus
☆30Updated 7 years ago