rossf7 / elasticrawl
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Updated 8 years ago
Alternatives and similar repositories for elasticrawl:
Users that are interested in elasticrawl are comparing it to the libraries listed below
- Human-Powered Data Analysis with Mechanical Turk☆300Updated 12 years ago
- Index URLs in Common Crawl☆194Updated 7 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Updated 6 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆166Updated 3 years ago
- Extract postal addresses from the DOM☆66Updated 12 years ago
- GraphAware Timer-Driven Runtime Module that executes PageRank-like algorithm on the graph☆26Updated 7 years ago
- Raw Wikipedia counts for entity linking☆19Updated 7 years ago
- The Summarizer from the Web IR / NLP Group (WING), hence SWING, is a modular, state-of-the-art automatic extractive text summarization sy…☆39Updated 10 years ago
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Updated 12 years ago
- Semanticizest: dump parser and client☆20Updated 8 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- ☆21Updated 6 years ago
- Parser and standardizer for politician, individual and organization names.☆129Updated 7 years ago
- Wikipedia Live Monitor☆21Updated 4 months ago
- Demo code for learning_text_transformer☆25Updated 10 years ago
- Browser add-on and web server to support collection and analysis of web browsing data.☆13Updated 9 years ago
- Supervised learning for novelty detection in text☆78Updated 8 years ago
- Gulp plugin to deploy tensorflow in aws lambda☆17Updated 8 years ago
- ☆43Updated 9 years ago
- FacetView is a pure javascript frontend for ElasticSearch.☆290Updated 9 years ago
- Wikidata and Wikipedia API client.☆35Updated last year
- Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser☆13Updated 3 years ago
- Ruby client library for controlling Google Refine☆44Updated 7 years ago
- A collection and conversion of WARN notices from California☆12Updated 8 years ago
- Prototype plugin to support topic modeling using LDA in Elasticsearch☆20Updated 9 years ago
- ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (image…☆95Updated 6 years ago
- Lentil is no longer supported. Lentil is a Ruby on Rails Engine that supports the harvesting of images from Instagram.☆58Updated 6 years ago
- Visualization and summarization of a collection of documents.☆20Updated 2 years ago
- An attempt at creating a silver/gold standard dataset for backtesting yesterday & today's content-extractors☆34Updated 10 years ago
- Pipeline for distributed Natural Language Processing, made in Python☆64Updated 8 years ago