commoncrawl/nutch

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/nutch)

commoncrawl / nutch

Common Crawl fork of Apache Nutch

☆42

Alternatives and similar repositories for nutch

Users that are interested in nutch are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Updated this week
markovi / LiDR
View on GitHub
Library for Distributed Retrieval
☆15Feb 21, 2014Updated 12 years ago
hopsparser / hopsparser
View on GitHub
A neural dependency parser that does its best
☆17Mar 6, 2026Updated 4 months ago
logui-framework / server
View on GitHub
The server component of LogUI, a framework-agnostic JavaScript library for logging user interactions on webpages.
☆17Feb 3, 2022Updated 4 years ago
ayoubfaouzi / workspider
View on GitHub
Automate job application
☆12Apr 14, 2017Updated 9 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
biggorilla-gh / koko
View on GitHub
Extracting Entities with Limited Evidence
☆16Dec 26, 2022Updated 3 years ago
ellej16 / SumMe
View on GitHub
An Abstractive summarizer for online news articles.
☆18Mar 25, 2015Updated 11 years ago
gsh199449 / DistributedCrawler
View on GitHub
DistributeCrawler的Maven版
☆10Jun 20, 2022Updated 4 years ago
Bookworm-project / Docs
View on GitHub
Documentation for Bookworm: particularly focusing on creation aspects -
☆10Aug 26, 2016Updated 9 years ago
USCDataScience / AgePredictor
View on GitHub
Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum
☆18Jul 1, 2022Updated 4 years ago
iai-group / nordlys
View on GitHub
Nordlys: Toolkit for entity-oriented and semantic search
☆31Mar 23, 2021Updated 5 years ago
informagi / GeeseDB
View on GitHub
Graph Engine for Exploration and Search
☆42Jan 26, 2024Updated 2 years ago
jiayun / akka_samples
View on GitHub
☆10Feb 26, 2019Updated 7 years ago
mitre / callisto
View on GitHub
☆16Feb 5, 2014Updated 12 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
kbirken / xtendency
View on GitHub
A collection of neat tools related to the Xtend language.
☆10Feb 16, 2015Updated 11 years ago
caarlos0-graveyard / github-vacations
View on GitHub
Automagically ignore all notifications related to work when you are on vacations
☆21Aug 21, 2020Updated 5 years ago
speedment / speedment-code-samples
View on GitHub
Code samples for the Speedment ORM
☆13Jun 21, 2022Updated 4 years ago
GalkonLtd / JProxyChecker
View on GitHub
A free multithreaded proxy checking program written in Java. Load a proxy list and check each proxy to verify it's alive to create a new …
☆11Nov 5, 2015Updated 10 years ago
rjagerman / shoelace
View on GitHub
Neural Learning to Rank using Chainer
☆31Jun 29, 2020Updated 6 years ago
kermitt2 / arxiv_harvester
View on GitHub
Poor man's simple harvester for arXiv resources
☆14Jul 14, 2023Updated 3 years ago
dgleich / libbvg
View on GitHub
A C implementation of a Boldi-Vigna graph decompressor
☆17Jul 5, 2016Updated 10 years ago
dgnsrekt / requests-whaor
View on GitHub
For the filthiest web scrapers that have no time for rate-limits.
☆19Oct 11, 2020Updated 5 years ago
BBC-archive / enzyme-adapter-inferno
View on GitHub
Inferno enzyme adapter
☆16Oct 21, 2018Updated 7 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ianozsvald / pycon2013_applied_parallel_computing
View on GitHub
Applied Parallel Computing tutorial material for PyCon 2013 (Minesh Amin, Ian Ozsvald)
☆17Apr 2, 2013Updated 13 years ago
scrapinghub / arche
View on GitHub
Analyze scraped data
☆47Dec 9, 2019Updated 6 years ago
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
commoncrawl / cc-downloader
View on GitHub
A polite and user-friendly downloader for Common Crawl data
☆86Jul 13, 2026Updated last week
romankierzkowski / langner
View on GitHub
Langner - Programing Language for Expressing Strategies
☆16Oct 5, 2016Updated 9 years ago
commoncrawl / cc-webgraph
View on GitHub
Tools to construct and process Common Crawl webgraphs
☆111Updated this week
izerui / spring-boot-actuator-monitor
View on GitHub
基于spring boot的监控平台
☆11Jun 17, 2015Updated 11 years ago
vthib / tlsh
View on GitHub
Rust port of TLSH
☆14Oct 12, 2025Updated 9 months ago
RitikMody / Coursearch
View on GitHub
A one stop solution to navigate the endless sea of online courses.
☆10Oct 17, 2021Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
mariuszs / hessian-boot-example
View on GitHub
Spring Boot Web with Hessian
☆11Jul 2, 2014Updated 12 years ago
oaqa / cse-framework
View on GitHub
Configuration Space Exploration Framework
☆16Oct 13, 2020Updated 5 years ago
momer / nutch-selenium-grid-plugin
View on GitHub
A Nutch 2.2.1 plugin which allows users to shuffle off the responsibility for retrieving pages to a selenium hub/node spoke system. This …
☆16Jun 9, 2016Updated 10 years ago
GaoleMeng / ActiveLearningAnnotationTool
View on GitHub
An active annotation tool based on brat(https://github.com/nlplab/brat)
☆19Aug 22, 2017Updated 8 years ago
cyberisltd / ProxyDetect
View on GitHub
Perl script to detect the existence of transparent proxies
☆20Jun 24, 2013Updated 13 years ago
alviano / python
View on GitHub
My collection of Python tools!
☆11Jan 27, 2026Updated 5 months ago
iipc / jwarc
View on GitHub
Java library for reading and writing WARC files with a typed API
☆60Jun 27, 2026Updated 3 weeks ago