bejean / crawl-anywhereLinks
Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
☆98Updated 8 years ago
Alternatives and similar repositories for crawl-anywhere
Users that are interested in crawl-anywhere are comparing it to the libraries listed below
Sorting:
- Html Content / Article Extractor in Scala - open sourced from Gravity Labs - http://gravity.com☆343Updated 6 years ago
- ☆66Updated 8 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆222Updated 2 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approac…☆43Updated 9 years ago
- Automatic, zero-config web scraping -- written in Java, has no dependency on Java EE or app servers, and the web scraper has a restful/JS…☆155Updated 8 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- a pure javascript frontend for ElasticSearch search indices.☆80Updated 7 years ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆197Updated this week
- A platform for backing crowdsourcing websites, built in golang for elasticsearch☆360Updated 5 years ago
- Open-source Enterprise Grade Search Engine Software☆512Updated 3 years ago
- Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignm…☆49Updated 13 years ago
- Blog crawler for the blogforever project.☆23Updated 11 years ago
- Keeps a mirror of DBpedia live in sync☆27Updated 4 years ago
- Crawljax☆535Updated 2 years ago
- An open source search engine for corporate data and websites.☆107Updated 8 years ago
- open source big data integration, analytics, and visualization☆419Updated 8 years ago
- Approve or reject statements from third-party datasets☆146Updated 7 years ago
- A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector☆251Updated 7 years ago
- Carrot2 plugin for ElasticSearch☆293Updated 2 years ago
- A POC at replicating Facebook Graph Search with Cypher and Neo4j☆101Updated 12 years ago
- FacetView is a pure javascript frontend for ElasticSearch.☆291Updated 10 years ago
- Algorithmic summarizer for RSS/Atom Feeds, Web Urls and arbitrary text. Codebase for the application deployed at http://tldrzr.herokuapp.…☆53Updated 9 years ago
- a full cross platform video screen capture tool and host. Java based screen recorder, and Django based web backend. Also included is a di…☆106Updated 9 years ago
- solr-logstash☆43Updated 9 years ago
- crawler for YouTube☆48Updated 11 years ago
- Text classification using Naive Bayes and Elasticsearch☆152Updated 9 years ago
- Solrstrap is a Query-Result interface for Solr written in JavaScript, HTML and CSS☆87Updated 8 years ago
- Suite of tools for detecting changes in web pages and their rendering☆55Updated last year
- This is an old repo, for latest maintained version go here - https://github.com/socioboard/Socioboard-Core-3.0☆245Updated 8 years ago