rossf7 / elasticrawl
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Updated 8 years ago
Alternatives and similar repositories for elasticrawl:
Users that are interested in elasticrawl are comparing it to the libraries listed below
- Human-Powered Data Analysis with Mechanical Turk☆300Updated 12 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Updated 6 years ago
- Index URLs in Common Crawl☆193Updated 7 years ago
- A lightweight server to allow HTTP requests to the Stanford Named Entity Recognized and a heavily modified CLAVIN geoparser.☆119Updated 2 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆166Updated 2 years ago
- Semanticizest: dump parser and client☆20Updated 8 years ago
- Extract postal addresses from the DOM☆66Updated 12 years ago
- Wikidata and Wikipedia API client.☆35Updated last year
- Parser and standardizer for politician, individual and organization names.☆129Updated 7 years ago
- Updates to Zope's keyphrase extractor (forked from 1.1.0)☆66Updated 7 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆186Updated 6 years ago
- Analysis plugin for ElasticSearch providing capability for processing inline annotations in documents.☆35Updated 11 years ago
- Tribe extracts a network from an email mbox and writes it to a graphml file for visualization and analysis.☆79Updated last year
- Raw Wikipedia counts for entity linking☆19Updated 7 years ago
- ☆43Updated 9 years ago
- A simple Python library/tool for pulling location information from unstructured text☆185Updated 14 years ago
- ☆21Updated 6 years ago
- ☆24Updated 9 years ago
- Tools to download and process name data from various sources.☆90Updated 11 years ago
- Apache Nutch fork tunned for web services and data discovery.☆9Updated 9 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- Ruby client library for controlling Google Refine☆44Updated 6 years ago
- Supervised learning for novelty detection in text☆78Updated 8 years ago
- Empower Curiosity / Redshift analytics platform☆77Updated 3 years ago
- Solr Dictionary Annotator (Microservice for Spark)☆71Updated 5 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- Fuzzy Categorical Distances☆14Updated 4 years ago
- Solrstrap is a Query-Result interface for Solr written in JavaScript, HTML and CSS☆86Updated 7 years ago
- Visualization and summarization of a collection of documents.☆20Updated 2 years ago
- FacetView is a pure javascript frontend for ElasticSearch.☆291Updated 9 years ago