Norconex / crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆188Updated this week
Alternatives and similar repositories for crawlers:
Users that are interested in crawlers are comparing it to the libraries listed below
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 3 weeks ago
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆213Updated 2 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆34Updated 6 months ago
- Common web archive utility code.☆55Updated last month
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆56Updated 4 years ago
- High-security graph database☆62Updated 2 years ago
- Silk is a port of Kibana 4 project.☆70Updated 8 years ago
- TinkerPop3 Graph Structure Implementation for OrientDB☆94Updated last week
- Java library for reading and writing WARC files with a typed API☆48Updated 4 months ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- The LAW next generation crawler.☆87Updated 3 years ago
- Fast in-memory graph structure, powering Gephi☆75Updated 5 months ago
- Solr Query Segmenter for structuring unstructured queries☆21Updated 3 years ago
- A text tagger based on Lucene / Solr, using FST technology☆176Updated last year
- An ORM / OGM for the TinkerPop graph stack.☆137Updated 2 years ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Carrot2 plugin for ElasticSearch☆291Updated 2 years ago
- Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.☆415Updated 2 years ago
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- Java access to Neo4J graph databases at multiple levels of abstraction☆85Updated 4 years ago
- Open Source, Distributed, Big Data Enterprise Search Engine☆69Updated last month
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆281Updated 7 years ago
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆116Updated last year
- Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.☆34Updated last year
- ☆28Updated 8 years ago
- Neo4j JDBC driver☆69Updated last year
- A Solr browser and administration tool☆27Updated 4 years ago
- Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to netw…☆22Updated 7 months ago
- A parallel download manager for web scraping that supports proxy servers☆9Updated 3 weeks ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 5 years ago