jonathandunn / common_crawl_corpusLinks
Scripts for building a geo-located web corpus using Common Crawl data
☆11Updated 3 months ago
Alternatives and similar repositories for common_crawl_corpus
Users that are interested in common_crawl_corpus are comparing it to the libraries listed below
Sorting:
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Updated last year
- A collection of over 1.5 Million tweets data translated to French, with their sentiment.☆35Updated 8 years ago
- A Named-Entity Recogniser based on Grobid.☆54Updated 2 months ago
- MinScIE is an Open Information Extraction system which provides structured knowledge enriched with semantic information about citations.☆15Updated 6 years ago
- SciWING is a modern toolkit for scientific document processing from WING-NUS☆63Updated 2 years ago
- 🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.☆14Updated 4 months ago
- This repository provides our datasets for Arabic emotion detection in Twitter☆9Updated 7 years ago
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- ☆64Updated 2 years ago
- Röttger et al. (WOAH at NAACL 2022): "Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models"☆17Updated 3 years ago
- Religious Hate Speech Detection for Arabic Tweets☆24Updated 6 years ago
- A context-based spellchecker for correcting OCR output.☆20Updated 2 years ago
- Use BERT to Fill in the Blanks☆83Updated 3 years ago
- A spaCy wrapper for DBpedia Spotlight☆110Updated 2 years ago
- A spaCy wrapper of OpenTapioca for named entity linking on Wikidata☆94Updated 2 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Code and models for our CLEF-HIPE (Named Entity Processing on Historical Newspapers) submissions☆19Updated 2 years ago
- spaCy pipeline component for generating spaCy KnowledgeBase Alias Candidates for Entity Linking☆85Updated 2 years ago
- List of corpora annotated for coreference for different languages☆17Updated last year
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 5 years ago
- CHOLAN: A Modular Approach for Neural Entity Linking on Wikipedia and Wikidata☆32Updated 3 years ago
- Python tools for interacting with Wikidata☆154Updated last year
- Python package for stylometry☆63Updated 4 years ago
- CrowdTruth framework for crowdsourcing ground truth for training & evaluation of AI systems☆61Updated last year
- Data collection, alignment and TAUS repository☆23Updated 7 years ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆164Updated 2 years ago
- A Flexible Deep Learning Approach to Fuzzy String Matching☆146Updated 9 months ago
- Extracting useful metadata from Wikipedia dumps in any language.☆27Updated 5 years ago
- ☆18Updated 4 years ago
- ParaNames: A multilingual resource for parallel names☆34Updated last year