bejean / crawl-anywhere
Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
☆96Updated 7 years ago
Alternatives and similar repositories for crawl-anywhere:
Users that are interested in crawl-anywhere are comparing it to the libraries listed below
- ☆65Updated 8 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆212Updated 2 years ago
- An open source search engine for corporate data and websites.☆106Updated 7 years ago
- ☆28Updated 8 years ago
- Feed discovery to share :)☆40Updated 8 years ago
- XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approac…☆43Updated 8 years ago
- Automatic, zero-config web scraping -- written in Java, has no dependency on Java EE or app servers, and the web scraper has a restful/JS…☆155Updated 7 years ago
- A tool for manage website extraction configs☆37Updated 11 years ago
- OpenBlock is a web application and RESTful service that allows users to browse and search their local area for "hyper-local news☆61Updated 3 years ago
- A Nutch 2.2.1 plugin which allows users to shuffle off the responsibility for retrieving pages to a selenium hub/node spoke system. This …☆16Updated 8 years ago
- Examples of Text Mining in WEKA☆64Updated 11 years ago
- Collects multimedia content shared through social networks.☆19Updated 9 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- ElasticSearch CookBook Second Edition - Code repository☆41Updated 10 years ago
- Twitter River Plugin for elasticsearch (STOPPED)☆204Updated 5 months ago
- Sample Index script for ElasticSearch. Includes data CSV.☆25Updated last year
- a json aware ElasticSearch front end☆299Updated 10 years ago
- Repackaging of Boilerpipe published on Maven Central Repository.☆53Updated last year
- XML interface for Elasticsearch REST☆44Updated 8 years ago
- Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignm…☆49Updated 12 years ago
- Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading p…☆142Updated 2 years ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 4 years ago
- FacetView is a pure javascript frontend for ElasticSearch.☆291Updated 9 years ago
- solr-logstash☆43Updated 8 years ago
- Distributed Realtime Search with Lucene and MongoDB☆59Updated 6 years ago
- Elasticsearch Latent Semantic Indexing experimentation☆33Updated 5 years ago
- Sentiment analysis framework developed by CERTH.☆22Updated 9 years ago
- Thoth is a real-time solr monitor and search analysis engine. It's a set of tools that can help you collect, visualize and leverage data …☆68Updated 10 years ago
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 5 years ago