jonathandunn / common_crawl_corpus
Scripts for building a geo-located web corpus using Common Crawl data
☆11Updated 2 months ago
Alternatives and similar repositories for common_crawl_corpus:
Users that are interested in common_crawl_corpus are comparing it to the libraries listed below
- An example of how to use spaCy for extremely large files without running into memory issues☆36Updated 2 years ago
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆10Updated 11 months ago
- Finds linguistic patterns effortlessly☆35Updated last year
- an experimental implementation of Burrow's delta in Python 3☆20Updated 3 years ago
- Build intelligent data-driven applications with minimal effort. Sentence Clustering, Topics Extraction, Text Similarity, Opinion Summariz…☆40Updated 5 years ago
- Lexicons for the Multilingual UCREL Semantic Analysis System☆40Updated last year
- Topic modelling with SpaCy, Gensim and Textacy☆19Updated 6 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Wikidata embedding☆51Updated 2 months ago
- Searching in-memory corpus with Corpus Query Language (CQL)☆19Updated last month
- Analyze Argumentation and Rhetorical Aspects in Scientific Writing.☆19Updated 2 years ago
- A python module to process data for Frame Semantic Parsing☆23Updated 4 years ago
- Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.☆13Updated 5 months ago
- a python package for cleaning Gutenberg books and dataset☆33Updated last year
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆50Updated 4 years ago
- ☆16Updated 5 years ago
- Code and models for our CLEF-HIPE (Named Entity Processing on Historical Newspapers) submissions☆19Updated last year
- Python 3 library for processing historical English☆64Updated 5 months ago
- Python SDK for the TextRazor Text Analytics API☆20Updated last year
- Wrapper for DKPro Core to extract lingustic information from books.☆16Updated 2 years ago
- Language Tool style grammar handling with spaCy 2.0☆42Updated 6 years ago
- SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time☆39Updated last year
- MinScIE is an Open Information Extraction system which provides structured knowledge enriched with semantic information about citations.☆15Updated 5 years ago
- several algorithms for converting dependency structures into constituency structures.☆10Updated 2 years ago
- python package for calculating famous measures in computational linguistics☆13Updated 2 months ago
- sequence tagging with spaCy and crfsuite☆19Updated last year
- This repository provides various Python methods for finding and aggregating synonyms for an individual word or a list of words.☆33Updated last year
- TopicScan: Visualization and validation interface for NMF Topic Modeling☆23Updated 4 years ago
- A Named-Entity Recogniser based on Grobid.☆50Updated 4 months ago
- ☆17Updated 3 years ago