Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
☆198Mar 2, 2026Updated last week
Alternatives and similar repositories for crawlers
Users that are interested in crawlers are comparing it to the libraries listed below
Sorting:
- Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, what…☆34Feb 21, 2026Updated 2 weeks ago
- FoGFaaS: Add serverless computing (faas) to ifogsim☆22Mar 30, 2025Updated 11 months ago
- A scalable, mature and versatile web crawler based on Apache Storm☆969Updated this week
- Test resources support☆11Updated this week
- Farm Animal Tracking and Breeding Project☆15May 28, 2025Updated 9 months ago
- Collection of Singularity build files and scripts to create them for popular Linux Distributions☆10Jun 23, 2022Updated 3 years ago
- Named Entity Recognition and Pattern Mining☆22Mar 10, 2020Updated 5 years ago
- Generic library shared between several projects.☆14Feb 23, 2026Updated 2 weeks ago
- In this very simple Docker Swarm Demo we create Docker hosts with Docker Machine and install after this a small Elasticsearch cluster.☆12Jul 31, 2016Updated 9 years ago
- My dotfiles for zsh, vim, git, mintty and more.☆17Updated this week
- Code for the paper Faster Phrase-Based Decoding by Refining Feature State☆14Jan 9, 2023Updated 3 years ago
- spark sql online editor☆13Dec 11, 2022Updated 3 years ago
- Open-source Enterprise Grade Search Engine Software☆513Sep 3, 2022Updated 3 years ago
- Spring Boot Web with Hessian☆11Jul 2, 2014Updated 11 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Topic Model or LDA in Cython☆21Apr 9, 2011Updated 14 years ago
- Mirror of Apache Corinthia (Incubating)☆16Apr 28, 2017Updated 8 years ago
- Fureteur is a simple, configurable, fault-tolerant web crawler written is Scala☆28Oct 14, 2014Updated 11 years ago
- Fetch and insert AI-generated summaries of web content. Combine with Send To Kindle for quick summaries and full articles. Support for Mi…☆23Nov 30, 2025Updated 3 months ago
- A set of reusable Java components that implement functionality common to any web crawler☆254Feb 26, 2026Updated last week
- HMAC authentication for RESTful web applications☆54Dec 5, 2024Updated last year
- Text Simplification System and Dataset☆15Jul 19, 2017Updated 8 years ago
- ☆17May 25, 2015Updated 10 years ago
- This is a Java library which can be used to crawl the content of some of web properties (www.salesforce.com, blogs.salesforce.com for exa…☆25May 15, 2025Updated 9 months ago
- Example project for ANTLR tutorial blog post.☆25Sep 29, 2011Updated 14 years ago
- ☆17Updated this week
- 分布式脚手架框架(总结整理)☆15Aug 27, 2015Updated 10 years ago
- Stream your data from any source. Build projections and aggregations. Parallelize. Distribute. Replay.☆17Dec 7, 2016Updated 9 years ago
- .NET controls that display multiple sub-controls without creating a unique window handle for each child. Instead each child is drawn usin…☆23Apr 6, 2023Updated 2 years ago
- modular NL platform for dialogue agents☆17Oct 26, 2017Updated 8 years ago
- Storm / Solr Integration☆19Feb 2, 2024Updated 2 years ago
- Fork of http://nlg.isi.edu/software/nplm/ for threadsafety and efficiency.☆18Nov 7, 2013Updated 12 years ago
- A lightweight Java configuration library☆54Nov 19, 2022Updated 3 years ago
- A 5 node zookeeper ensemble that runs in Docker☆17Dec 2, 2014Updated 11 years ago
- A DSL to build Lucene text queries in Python.☆38Jan 5, 2017Updated 9 years ago
- Transform unstructured document collections to structured Linked Data☆29Sep 12, 2025Updated 5 months ago
- 针对复杂业务逻辑的Java实现系统,抽象出一套编程框架,借鉴领域模型的设计方法,使得开发体验更加环保、更加友好,大大提高代码的后期可维护性☆24Aug 3, 2014Updated 11 years ago
- jw, short for java web, 模仿spring, 实现一个简单可用的java web框架☆20Jun 20, 2022Updated 3 years ago
- efwplus平台的Winform开发框架☆26Apr 24, 2016Updated 9 years ago