crawler-commons/url-frontier

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/crawler-commons/url-frontier)

crawler-commons / url-frontier

API definition, resources and reference implementation of URL Frontiers

☆63

Alternatives and similar repositories for url-frontier

Users that are interested in url-frontier are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

DigitalPebble / stormcrawler-docker
View on GitHub
Resources for running StormCrawler with Docker services
☆10Nov 10, 2024Updated last year
crawler-commons / crawler-commons
View on GitHub
A set of reusable Java components that implement functionality common to any web crawler
☆259Jul 2, 2026Updated 2 weeks ago
apache / stormcrawler
View on GitHub
A scalable, mature and versatile web crawler based on Apache Storm
☆986Updated this week
GiovanniTRA / UDCG
View on GitHub
Code and Data of the paper: "Redefining Retrieval Evaluation in the Era of LLMs"
☆15Oct 27, 2025Updated 8 months ago
internetarchive / Sparkling
View on GitHub
Internet Archive's Sparkling Data Processing Library
☆17May 4, 2026Updated 2 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
RovoMe / JIRLbot
View on GitHub
Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and D…
☆17May 25, 2017Updated 9 years ago
harvard-lil / gitspoke
View on GitHub
Download GitHub repositories
☆13May 10, 2025Updated last year
tokenmill / crawling-framework
View on GitHub
Easily crawl news portals or blog sites using Storm Crawler.
☆22Nov 15, 2022Updated 3 years ago
chfoo / huhhttp
View on GitHub
An evil web server.
☆13May 9, 2015Updated 11 years ago
DigitalPebble / TextClassification
View on GitHub
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and …
☆48Sep 24, 2021Updated 4 years ago
apache / opennlp-models
View on GitHub
Apache OpenNLP Models
☆16Updated this week
k-int / gokb-phase1
View on GitHub
Original GOKb repo - Moving to https://github.com/openlibraryenvironment/gokb
☆11Jan 23, 2018Updated 8 years ago
thakur-nandan / income
View on GitHub
INCOME: An Easy Repository for Training and Evaluation of Index Compression Methods in Dense Retrieval. Includes BPR and JPQ.
☆24Sep 24, 2023Updated 2 years ago
internetarchive / trough
View on GitHub
Trough: Big data, small databases.
☆43Jul 25, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ablwr / media-collection-viewer
View on GitHub
visualizations/charts for media collections, based on mediainfo
☆14Sep 15, 2022Updated 3 years ago
MicroAIInc / MicroAI-Security-and-Monitoring
View on GitHub
☆59Dec 11, 2025Updated 7 months ago
lucidworks / storm-solr
View on GitHub
Storm / Solr Integration
☆19Feb 2, 2024Updated 2 years ago
georgi / kontrol
View on GitHub
Kontrol is a small web framework written in Ruby, which runs directly on Rack.
☆19Jan 26, 2013Updated 13 years ago
commoncrawl / cc-citations
View on GitHub
Scientific articles using or citing Common Crawl data
☆29Jul 8, 2026Updated last week
Scicrop / MftReader
View on GitHub
MftReader is a Command-Line interface (CLI) program which reads the Master File Table (MFT) from NTFS volume. (C# Implementation with PIn…
☆14Sep 13, 2018Updated 7 years ago
Laz4rz / matryoshka
View on GitHub
Implementation of "Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions"
☆25Aug 27, 2024Updated last year
tomitribe / jackknife
View on GitHub
A Maven plugin for inspecting, decompiling, and instrumenting Java jar dependencies
☆22Mar 27, 2026Updated 3 months ago
apache / logging-chainsaw
View on GitHub
Apache Chainsaw is a GUI log viewer
☆24May 19, 2026Updated 2 months ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
joshua-decoder / thrax
View on GitHub
Hadoop-based tool for extraction of large scale synchronous grammars for paraphrasing and machine translation
☆15Dec 2, 2016Updated 9 years ago
RichardLitt / Quick-tips-for-making-your-software-outlive-your-job
View on GitHub
The paper repository for "10 quick tips for making your software outlive your job"
☆20Oct 28, 2025Updated 8 months ago
WebarchivCZ / Seeder
View on GitHub
Seeder - Czech webarchive curating tool and public site
☆17Feb 12, 2026Updated 5 months ago
commoncrawl / language-detection-cld2
View on GitHub
Natural language detection, Java bindings for CLD2
☆17Feb 26, 2026Updated 4 months ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 2 weeks ago
pipauwel / ifcParserLib
View on GitHub
ifcParserLib is a set of reusable Java components that implement functionality for IFC file parsing.
☆10Oct 14, 2020Updated 5 years ago
cvanweelden / sequence_labeling_example
View on GitHub
A bidirectional LSTM example for sequence labeling.
☆13May 23, 2018Updated 8 years ago
lucidimagination / Prism
View on GitHub
Solr and LucidWorks search user interface
☆16Apr 29, 2013Updated 13 years ago
dodeeric / omeka-s-docker
View on GitHub
Omeka-S in Docker containers.
☆20Jan 18, 2022Updated 4 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
PREMIS-OWL-Revision-Team / premis-owl
View on GitHub
Repository for revision of PREMIS OWL ontology group
☆13May 12, 2022Updated 4 years ago
dragnet-org / dragnet_data
View on GitHub
code and data used to build a training dataset for dragnet models
☆10Nov 29, 2020Updated 5 years ago
lintool / IR-Reproducibility
View on GitHub
Open-Source Information Retrieval Reproducibility Challenge
☆51Jan 11, 2016Updated 10 years ago
HPI-Information-Systems / Mondrian
View on GitHub
Code repository for Mondrian, a project for multiregion template recognition in spreadsheets.
☆14May 25, 2022Updated 4 years ago
DataResponsibly / MirrorDataGenerator
View on GitHub
MirrorDataGenerator is a python tool that generates synthetic data based on user-specified causal relations among features in the data. I…
☆25Jun 22, 2022Updated 4 years ago
manics / jupyter-notebookparams
View on GitHub
Takes query parameters from a url to create the first cell of a jupyter notebook.
☆17Nov 13, 2024Updated last year
inveniosoftware-contrib / citadel-search
View on GitHub
Citadel: Enterprise Search
☆15May 2, 2023Updated 3 years ago