Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆191Updated this week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated last week
- A scalable, mature and versatile web crawler based on Apache Storm☆923Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆216Updated 2 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆416Updated 2 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Apache OpenNLP Sandbox☆43Updated this week
- Fast in-memory graph structure, powering Gephi☆75Updated 3 weeks ago
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- ModeShape is a distributed, hierarchical, transactional, and consistent data store with support for queries, full-text search, events, ve…☆217Updated 2 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆34Updated this week
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆125Updated 3 weeks ago
- The LAW next generation crawler.☆87Updated 3 years ago
- High-security graph database☆64Updated 3 years ago
- Mirror of Apache OpenNLP Add-ons☆17Updated last week
- API definition, resources and reference implementation of URL Frontiers☆50Updated 2 weeks ago
- Java library for parsing semi-structured text files☆65Updated 3 years ago
- The next generation of open source search☆92Updated 8 years ago
- restSQL service and core framework☆146Updated 6 years ago
- Zulia Search Engine☆33Updated last week
- Extra pluggable modules for Apache MetaModel (but licensed with LGPL)☆17Updated 3 years ago
- Carrot2: Text Clustering Algorithms and Applications☆814Updated this week
- Solr Redis Extensions☆53Updated last year
- Palladian is a Java-based toolkit with functionality for text processing, classification, information extraction, and data retrieval from…☆39Updated this week
- Scriptella is an open source ETL (Extract-Transform-Load) and script execution tool written in Java. Note: The project is no longer under…☆107Updated 2 months ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆282Updated 7 years ago
- A curated list of Awesome Apache Solr links and resources.☆109Updated 3 years ago
- Crawljax☆526Updated last year
- XML for Analysis (XMLA) server based upon an olap4j connection☆23Updated 8 years ago
- Apache NiFi Custom Processor Extracting Text From Files with Apache Tika☆35Updated last year
- ☆28Updated 9 years ago