bejean / crawl-anywhere
Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
☆96Updated 7 years ago
Alternatives and similar repositories for crawl-anywhere:
Users that are interested in crawl-anywhere are comparing it to the libraries listed below
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 3 years ago
- ☆28Updated 8 years ago
- Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignm…☆49Updated 12 years ago
- ☆18Updated 8 years ago
- Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…☆187Updated this week
- A python library detect and extract listing data from HTML page.☆108Updated 7 years ago
- crawler for YouTube☆48Updated 11 years ago
- Automatic, zero-config web scraping -- written in Java, has no dependency on Java EE or app servers, and the web scraper has a restful/JS…☆155Updated 7 years ago
- A tool for manage website extraction configs☆37Updated 11 years ago
- Recommendations Serving Engine using python☆28Updated 9 years ago
- Algorithmic summarizer for RSS/Atom Feeds, Web Urls and arbitrary text. Codebase for the application deployed at http://tldrzr.herokuapp.…☆53Updated 8 years ago
- ☆66Updated 8 years ago
- a pure javascript frontend for ElasticSearch search indices.☆79Updated 7 years ago
- API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.☆74Updated 5 years ago
- A platform for backing crowdsourcing websites, built in golang for elasticsearch☆360Updated 4 years ago
- OpenBlock is a web application and RESTful service that allows users to browse and search their local area for "hyper-local news☆61Updated 3 years ago
- Crawler is a bare-bones spider designed to quickly and effectively build an index of all files and pages on a given Web site as well as t…☆91Updated 12 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- FreebaseAPI is a library to use the Freebase API (data mapper + low level API)☆42Updated 10 years ago
- Term List Matching Plugin for ElasticSearch☆26Updated 11 years ago
- ☆18Updated 9 years ago
- Personalized Recommendations helps you find the best place to go on vacations by using the Concept Insights and Tradeoff Analytics Watso…☆16Updated 9 years ago
- Screenshot as a service with phantomJS headless browser☆58Updated 11 years ago
- Compile Yahoo! Pipes to Javascript (Node.js)☆44Updated 12 years ago
- Approve or reject statements from third-party datasets☆146Updated 6 years ago
- ☆223Updated 9 years ago
- A language detection Web Service☆53Updated 7 years ago
- Skeleton for Meetup - Building your own recommendation engine in an hour☆29Updated 3 years ago