tpickett / mongo-elasticsearch-nutch
Docker image for creating a single Apache Nutch server, with mongodb as crawl storage and Elasticsearch for indexing
☆17Updated 9 years ago
Alternatives and similar repositories for mongo-elasticsearch-nutch:
Users that are interested in mongo-elasticsearch-nutch are comparing it to the libraries listed below
- A python library detect and extract listing data from HTML page.☆109Updated 7 years ago
- Solr AutoComplete implementation☆59Updated 7 years ago
- A bundle of useful Elasticsearch plugins☆110Updated 9 months ago
- FacetView is a pure javascript frontend for ElasticSearch.☆291Updated 9 years ago
- A scrapy pipeline which send items to Elastic Search server☆327Updated 2 years ago
- Tools for web page segmentation. In development☆17Updated 6 years ago
- An extension to the demo template of ElasticUI a beautiful AngularJS frontend to ElasticSearch for faceted navigation☆39Updated 9 years ago
- Naive Bayes Classifier implemented with Elasticsearch Aggregations☆51Updated 10 years ago
- Search UI for Elasticsearch☆325Updated 3 years ago
- A curated list of Awesome Apache Solr links and resources.☆107Updated 3 years ago
- Scrapy middleware for the autologin☆37Updated 6 years ago
- Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)☆204Updated 8 months ago
- Angular JS Solr and Elasticsearch and OpenSearch Diagnostic Search Services☆26Updated 7 months ago
- docker scrapyd scrapy boot2docker crawler - a spider Python application that can be "Dockerized".☆42Updated 9 years ago
- An efficient simhash implementation for python☆124Updated 5 years ago
- Pipeline for distributed Natural Language Processing, made in Python☆65Updated 8 years ago
- Easy extraction of keywords and engines from search engine results pages (SERPs).☆90Updated 3 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆55Updated last year
- Aviation grade news article metadata extraction☆36Updated last year
- Web Crawler for Elasticsearch☆234Updated 5 years ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆96Updated 7 years ago
- Web page segmentation and noise removal☆55Updated 11 months ago
- Solrstrap is a Query-Result interface for Solr written in JavaScript, HTML and CSS☆86Updated 7 years ago
- An Apache Lucene TokenFilter that uses a word2vec vectors for term expansion.☆24Updated 10 years ago
- Demo of the Newspaper article extraction library.☆29Updated 10 years ago
- A python implementation of DEPTA☆83Updated 8 years ago
- Web Content Extraction Through Machine Learning☆185Updated 10 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- Automatic Item List Extraction☆87Updated 8 years ago
- A web service that computes a set of readability metrics for text. We currently support the following metrics: Automated Readability Inde…☆71Updated 2 years ago