commoncrawl / cc-pyspark
Process Common Crawl data with Python and Spark
☆422Updated last month
Alternatives and similar repositories for cc-pyspark:
Users that are interested in cc-pyspark are comparing it to the libraries listed below
- Index Common Crawl archives in tabular format☆113Updated this week
- Statistics of Common Crawl monthly archives mined from URL index files☆175Updated this week
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆188Updated 6 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆168Updated 2 months ago
- A python utility for downloading Common Crawl data☆233Updated last year
- News crawling with StormCrawler - stores content as WARC☆338Updated 3 weeks ago
- Heuristic based boilerplate removal tool☆758Updated 2 weeks ago
- Streaming WARC/ARC library for fast web archive IO☆404Updated 3 months ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆166Updated 2 years ago
- Article extraction benchmark: dataset and evaluation scripts☆306Updated 10 months ago
- Tools to construct and process webgraphs from Common Crawl data☆87Updated this week
- Python port of Boilerpipe library☆86Updated 6 months ago
- spacy-wordnet creates annotations that easily allow the use of wordnet and wordnet domains by using the nltk wordnet interface☆253Updated 6 months ago
- Index URLs in Common Crawl☆193Updated 7 years ago
- Common Crawl fork of Apache Nutch☆32Updated this week
- Various Jupyter notebooks about Common Crawl data☆51Updated 3 weeks ago
- Fuzzy matching and more functionality for spaCy.☆255Updated 8 months ago
- NER toolkit for HTML data☆259Updated 10 months ago
- SpikeX - SpaCy Pipes for Knowledge Extraction☆397Updated 3 years ago
- 🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy☆309Updated last year
- 💙 Emoji handling and meta data for spaCy with custom extension attributes☆181Updated last year
- ☆168Updated 9 months ago
- Python3 bindings for the Compact Language Detector v3 (CLD3)☆150Updated last year
- Self-Supervision for Named Entity Disambiguation at the Tail☆215Updated 2 years ago
- Common Crawl Index Server☆66Updated 2 weeks ago
- Information extraction from English and German texts based on predicate logic☆135Updated last year
- Clean personally identifiable information from dirty dirty text.☆406Updated last year
- Full text geoparsing as a Python library☆745Updated 3 years ago
- 📂 Additional lookup tables and data resources for spaCy☆105Updated last month
- 🍳 Recipes for the Prodigy, our fully scriptable annotation tool☆490Updated 7 months ago