Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆196Updated 2 weeks ago
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆251Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆223Updated 3 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆419Updated 2 years ago
- The LAW next generation crawler.☆90Updated 4 years ago
- A scalable, mature and versatile web crawler based on Apache Storm☆956Updated this week
- Apache OpenNLP Sandbox☆46Updated this week
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Java library for parsing semi-structured text files☆65Updated 4 years ago
- Java library for reading and writing WARC files with a typed API☆52Updated 3 weeks ago
- High-security graph database☆64Updated 3 years ago
- Common web archive utility code.☆59Updated last week
- Apache Solr interpreter for Apache Zeppelin☆29Updated 2 years ago
- Extra pluggable modules for Apache MetaModel (but licensed with LGPL)☆17Updated 4 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆72Updated last year
- A curated list of Awesome Apache Solr links and resources.☆110Updated 4 years ago
- Fast in-memory graph structure, powering Gephi☆79Updated this week
- Mirror of Apache ManifoldCF☆80Updated last week
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆283Updated 7 years ago
- API definition, resources and reference implementation of URL Frontiers☆56Updated last month
- Distributed processing framework for search solutions☆82Updated 3 years ago
- ModeShape is a distributed, hierarchical, transactional, and consistent data store with support for queries, full-text search, events, ve…☆221Updated 3 years ago
- Browser-driven explorer for lucene indexes☆74Updated 4 years ago
- Solr Redis Extensions☆53Updated last year
- Storm / Solr Integration☆19Updated last year
- Java text categorization system☆57Updated 8 years ago
- A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …☆48Updated 4 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆112Updated 3 years ago
- Core API for Silverpeas☆52Updated last week
- Document Ingestion Framework for Search Systems☆37Updated 3 weeks ago
- Carrot2: Text Clustering Algorithms and Applications☆841Updated last month