Norconex / crawlersLinks
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆190Updated 3 weeks ago
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated this week
- A scalable, mature and versatile web crawler based on Apache Storm☆920Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆216Updated 2 years ago
- Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to netw…☆22Updated 9 months ago
- The Apache Gora open source framework provides an in-memory data model and persistence for big data.☆121Updated last year
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆96Updated 8 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆34Updated last month
- Apache OpenNLP Sandbox☆43Updated this week
- API definition, resources and reference implementation of URL Frontiers☆50Updated 2 weeks ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- an open-source data management platform for knowledge workers (https://github.com/dswarm/dswarm-documentation/wiki)☆54Updated 7 years ago
- ☆28Updated 9 years ago
- The next generation of open source search☆92Updated 8 years ago
- SKOS Support for Apache Lucene and Solr☆56Updated 4 years ago
- WARC (Web Archive) Input and Output Formats for Hadoop☆35Updated 10 years ago
- Solr AutoComplete implementation☆59Updated 7 years ago
- Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.☆282Updated 7 years ago
- Java access to Neo4J graph databases at multiple levels of abstraction☆85Updated 4 years ago
- Mirror of Apache Stanbol (incubating)☆112Updated last year
- Java/JNI bindings to libpostal for for fast international street address parsing/normalization☆120Updated last month
- Mirror of Apache James Mime4j☆54Updated 3 months ago
- High-security graph database☆64Updated 3 years ago
- Browser-driven explorer for lucene indexes☆74Updated 3 years ago
- Skywalker for Elasticsearch is like Luke for Lucene☆79Updated 5 years ago
- Example of running MDX on Druid via Mondrian and Calcite☆26Updated 8 years ago
- Geographic Place, Date/time, and Pattern entity extraction toolkit along with text extraction from unstructured data and GIS outputters.☆44Updated last week
- BatchRefine adds batch processing capabilities to OpenRefine☆50Updated 8 years ago
- Silk is a port of Kibana 4 project.☆70Updated 9 years ago
- Some code to deduce an OS/Platform/Browser out of a user-agent string☆53Updated 7 years ago
- Implementation of Norconex Committer for Elasticsearch.☆11Updated 3 years ago