Norconex / crawlers
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆181Updated this week
Related projects: ⓘ
- A set of reusable Java components that implement functionality common to any web crawler☆233Updated last month
- A scalable, mature and versatile web crawler based on Apache Storm☆879Updated this week
- The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)☆214Updated last year
- Common web archive utility code.☆50Updated last week
- Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.☆96Updated 7 years ago
- Java library for reading and writing WARC files with a typed API☆46Updated 2 months ago
- Open-source Enterprise Grade Search Engine Software☆499Updated 2 years ago
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆32Updated last year
- Suite of tools for detecting changes in web pages and their rendering☆53Updated 9 months ago
- A text tagger based on Lucene / Solr, using FST technology☆173Updated 9 months ago
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆55Updated 3 years ago
- Open Source, Distributed, Big Data Enterprise Search Engine☆68Updated this week
- API definition, resources and reference implementation of URL Frontiers☆44Updated last week
- Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to netw…☆21Updated last year
- ☆28Updated 8 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 8 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆70Updated 5 months ago
- Solr Query Segmenter for structuring unstructured queries☆21Updated 3 years ago
- Distributed processing framework for search solutions☆81Updated last year
- Java text categorization system☆54Updated 7 years ago
- DKPro JWPL (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all informat…☆81Updated 7 months ago
- Storm / Solr Integration☆19Updated 7 months ago
- Solr Redis Extensions☆52Updated 7 months ago
- Apache OpenNLP Sandbox☆42Updated this week
- Browser-driven explorer for lucene indexes☆72Updated 3 years ago
- WARC and ARC indexing and discovery tools.☆114Updated last month
- ☆47Updated 7 years ago
- Carrot2: Text Clustering Algorithms and Applications☆764Updated last week
- Java port of langid.py (language identifier)☆28Updated 11 years ago
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago