Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆197Updated this week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆251Updated last week
- A scalable, mature and versatile web crawler based on Apache Storm☆953Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆223Updated 3 years ago
- Open-source Enterprise Grade Search Engine Software☆512Updated 3 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆419Updated 2 years ago
- High-security graph database☆64Updated 3 years ago
- Fast in-memory graph structure, powering Gephi☆76Updated 3 weeks ago
- The LAW next generation crawler.☆90Updated 4 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Common web archive utility code.☆57Updated 3 weeks ago
- A curated list of Awesome Apache Solr links and resources.☆110Updated 4 years ago
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆133Updated 5 months ago
- Carrot2: Text Clustering Algorithms and Applications☆839Updated last month
- Java library for reading and writing WARC files with a typed API☆52Updated this week
- TinkerPop3 Graph Structure Implementation for OrientDB☆94Updated this week
- Mirror of Apache ManifoldCF☆80Updated 2 months ago
- Apache OpenNLP Sandbox☆45Updated this week
- ModeShape is a distributed, hierarchical, transactional, and consistent data store with support for queries, full-text search, events, ve…☆220Updated 2 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆72Updated last year
- Browser-driven explorer for lucene indexes☆74Updated 4 years ago
- Java library for parsing semi-structured text files☆65Updated 4 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- Carrot2 plugin for ElasticSearch☆294Updated 2 years ago
- Silk is a port of Kibana 4 project.☆70Updated 9 years ago
- Java access to Neo4J graph databases at multiple levels of abstraction☆84Updated 5 years ago
- Distributed processing framework for search solutions☆82Updated 3 years ago
- Constellio 8☆23Updated 4 years ago
- Suite of tools for detecting changes in web pages and their rendering☆55Updated 2 years ago
- A text tagger based on Lucene / Solr, using FST technology☆177Updated 2 years ago
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆34Updated 2 years ago