Norconex / crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆188Updated this week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆34Updated 7 months ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆215Updated 2 years ago
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆96Updated 7 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆907Updated this week
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆415Updated 2 years ago
- Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to netw…☆22Updated 7 months ago
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Common web archive utility code.☆55Updated 2 months ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 4 years ago
- Implementation of Norconex Committer for Elasticsearch.☆11Updated 3 years ago
- The LAW next generation crawler.☆87Updated 3 years ago
- Java library for reading and writing WARC files with a typed API☆48Updated 4 months ago
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆117Updated this week
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- Palladian is a Java-based toolkit with functionality for text processing, classification, information extraction, and data retrieval from…☆38Updated this week
- Fast in-memory graph structure, powering Gephi☆75Updated this week
- Solr Query Segmenter for structuring unstructured queries☆21Updated 4 years ago
- ACHE is a web crawler for domain-specific search.☆468Updated last year
- Automatic, zero-config web scraping -- written in Java, has no dependency on Java EE or app servers, and the web scraper has a restful/JS…☆155Updated 7 years ago
- Repackaging of Boilerpipe published on Maven Central Repository.☆53Updated last year
- Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that …☆32Updated 2 years ago
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 3 years ago
- Distributed Realtime Search with Lucene and MongoDB☆59Updated 7 years ago
- command line tool for Apache Lucene☆162Updated last month
- Web Crawler for Elasticsearch☆235Updated 5 years ago
- ☆66Updated 8 years ago