Norconex / crawlersLinks

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

☆197

Alternatives and similar repositories for crawlers

Users that are interested in crawlers are comparing it to the libraries listed below

Sorting:

crawler-commons / crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
☆250Updated last week
apache / stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
☆950Updated last week
commoncrawl / commoncrawl-crawler
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆222Updated 2 years ago
jaeksoft / opensearchserver
Open-source Enterprise Grade Search Engine Software
☆512Updated 3 years ago
USCDataScience / sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
☆419Updated 2 years ago
flaxsearch / marple
Browser-driven explorer for lucene indexes
☆74Updated 4 years ago
apache / opennlp-sandbox
Apache OpenNLP Sandbox
☆45Updated this week
nsoft / jesterj
Document Ingestion Framework for Search Systems
☆37Updated 2 months ago
Anant / awesome-solr
A curated list of Awesome Apache Solr links and resources.
☆110Updated 4 years ago
iipc / jwarc
Java library for reading and writing WARC files with a typed API
☆50Updated 2 months ago
orientechnologies / orientdb-gremlin
TinkerPop3 Graph Structure Implementation for OrientDB
☆94Updated 3 weeks ago
gephi / graphstore
Fast in-memory graph structure, powering Gephi
☆75Updated this week
visallo / vertexium
High-security graph database
☆64Updated 3 years ago
apache / gora
The Apache Gora open source framework provides an in-memory data model and persistence for big data.
☆121Updated last year
apache / manifoldcf
Mirror of Apache ManifoldCF
☆80Updated 2 months ago
sweble / sweble-wikitext
The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…
☆72Updated last year
zuliaio / zuliasearch
Zulia Search Engine
☆33Updated 2 weeks ago
carrot2 / carrot2
Carrot2: Text Clustering Algorithms and Applications
☆836Updated 2 weeks ago
querqy / querqy
Query preprocessor for Java-based search engines (Querqy Core and Lucene implementation)
☆189Updated this week
Syncleus / Ferma
An ORM / OGM for the TinkerPop graph stack.
☆139Updated 3 years ago
iipc / webarchive-commons
Common web archive utility code.
☆57Updated this week
lucidworks / silk
Silk is a port of Kibana 4 project.
☆70Updated 9 years ago
crawler-commons / url-frontier
API definition, resources and reference implementation of URL Frontiers
☆54Updated 3 weeks ago
Findwise / Hydra
Distributed processing framework for search solutions
☆82Updated 2 years ago
sematext / query-segmenter
Solr Query Segmenter for structuring unstructured queries
☆22Updated 4 years ago
behas / lucene-skos
SKOS Support for Apache Lucene and Solr
☆56Updated 4 years ago
francelabs / datafari
Open Source, Distributed, Big Data Enterprise Search Engine
☆90Updated 2 weeks ago
LAW-Unimi / BUbiNG
The LAW next generation crawler.
☆89Updated 4 years ago
bejean / crawl-anywhere
Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
☆98Updated 8 years ago
openvenues / jpostal
Java/JNI bindings to libpostal for for fast international street address parsing/normalization
☆130Updated 5 months ago