ept/warc-hadoop

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ept/warc-hadoop)

ept / warc-hadoop

WARC (Web Archive) Input and Output Formats for Hadoop

☆38

Alternatives and similar repositories for warc-hadoop

Users that are interested in warc-hadoop are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Zarkonnen / Longan
View on GitHub
A flexible pure-Java OCR implementation. Eventually.
☆20Jan 2, 2015Updated 11 years ago
ianmilligan1 / Historian-WARC-1
View on GitHub
The Historian's WARC Toolkit
☆16May 14, 2015Updated 11 years ago
g-farrow / boto3_batch_utils
View on GitHub
A Python library to simplify batch requests to AWS Services
☆12Apr 25, 2020Updated 6 years ago
aurbroszniowski / Rainfall-core
View on GitHub
Rainfall is an extensible java framework to implement custom DSL based stress and performance tests
☆12Mar 31, 2026Updated 3 months ago
perlancar / perl-Org-Parser
View on GitHub
☆25Jul 17, 2024Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
OlivierBlanvillain / crawler
View on GitHub
Blog crawler for the blogforever project.
☆23Jan 31, 2014Updated 12 years ago
AnthonyMRios / Sentiment-Classification-Example
View on GitHub
IPython Notebook for Sentiment Classification
☆10Nov 12, 2014Updated 11 years ago
joshlong-attic / activiti-examples
View on GitHub
☆19Feb 7, 2016Updated 10 years ago
trec-core / 2017
View on GitHub
TREC Core track
☆11Jul 5, 2017Updated 9 years ago
crawler-commons / crawler-commons
View on GitHub
A set of reusable Java components that implement functionality common to any web crawler
☆259Jul 2, 2026Updated 2 weeks ago
marcoscoffier / lua---opencv
View on GitHub
bindings to some parts of opencv to lua+torch
☆15Feb 14, 2013Updated 13 years ago
thesurlydev / cdk-kotlin-example
View on GitHub
A simple CDK app written in Kotlin using Gradle DSL
☆12Dec 28, 2018Updated 7 years ago
rossf7 / elasticrawl
View on GitHub
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
kboom / iga-adi-sm
View on GitHub
The shared memory version of the Alternating Directions Implicit Solver for Isogeometric Analysis
☆10Jan 26, 2019Updated 7 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
vitchyr / torch-rl
View on GitHub
A reinforcement learning package implemented in Torch
☆11Jan 24, 2016Updated 10 years ago
thammegowda / tika-ner-corenlp
View on GitHub
Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser
☆13Feb 26, 2022Updated 4 years ago
altfatterz / spring-cloud-dataflow-streaming-example
View on GitHub
Spring Cloud Data Flow Streaming Example
☆10Mar 17, 2018Updated 8 years ago
mndrix / jolog
View on GitHub
Concurrent and distributed Prolog via join patterns (join calculus)
☆12Mar 10, 2015Updated 11 years ago
spring-tips / reactive-mysql-with-jasync-and-r2dbc
View on GitHub
Hi Spring fans! Welcome to another super short mid-season interregnum installment of Spring Tips in which I introduce a *super* prelimina…
☆12Mar 21, 2019Updated 7 years ago
suzaku-io / arteria
View on GitHub
Arteria is a high performance message channel system for IPC and network communication
☆12Jun 21, 2017Updated 9 years ago
graphcommons / gc-instagram
View on GitHub
Generate a graph on Graph Commons from Instagram activity
☆10Jan 25, 2016Updated 10 years ago
lintool / warcbase
View on GitHub
Warcbase is an open-source platform for managing analyzing web archives
☆162Dec 8, 2017Updated 8 years ago
horsfieldsa / exif-extractor
View on GitHub
Lambda Function to extract EXIF data from images uploaded to an S3 bucket and store it in DynamoDB.
☆15Aug 17, 2018Updated 7 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
trecrts / trecrts-eval
View on GitHub
TREC Real-Time Summarization Tools
☆15Jul 19, 2017Updated 9 years ago
spring-attic / spring-data-aerospike
View on GitHub
Spring Data Aerospike
☆37Jan 30, 2020Updated 6 years ago
treasure-data / Lead-List-from-CrunchBase-
View on GitHub
☆10Aug 21, 2015Updated 10 years ago
daveoncode / django-easy-currencies
View on GitHub
Simple app to manage currencies conversion in Django using openexchangerates.org service.
☆10Nov 17, 2014Updated 11 years ago
SerezD / ffcv_pytorch_lightning
View on GitHub
[FFCV-PL] manage fast data loading with ffcv and pytorch lightning
☆16Jul 17, 2023Updated 3 years ago
pietvandongen / pure-bliss-with-pure-java-functions
View on GitHub
This is the source code accompanying my blog post explaining the upside of using pure functions in Java.
☆11Nov 5, 2020Updated 5 years ago
mllite / pytorch2sql
View on GitHub
Deep Learning (PyTorch) Models Deployment using SQL databases
☆10Jul 25, 2021Updated 4 years ago
iipc / jwarc
View on GitHub
Java library for reading and writing WARC files with a typed API
☆60Jun 27, 2026Updated 3 weeks ago
trec-web / trec-web-2014
View on GitHub
☆16Aug 8, 2014Updated 11 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
th3sys / capsule
View on GitHub
TWS Market Data Adapter
☆20May 10, 2018Updated 8 years ago
tomheon / kafkafs
View on GitHub
A FUSE module to exposes a Kafka cluster in the filesystem.
☆19Feb 3, 2014Updated 12 years ago
ximagination80 / Comparator
View on GitHub
Matcher for json and json template. Can help you with testing of REST API, Database, 3d party systems etc
☆15Apr 24, 2021Updated 5 years ago
graalvm / graal-js-archetype
View on GitHub
☆24Jul 13, 2022Updated 4 years ago
fnp / pylucene
View on GitHub
PyLucene with our patches
☆15Apr 11, 2012Updated 14 years ago
r-spark / sparkwarc
View on GitHub
Load WARC files into Apache Spark with sparklyr
☆12Jan 11, 2022Updated 4 years ago
axel22 / scalacheck-tutorial
View on GitHub
A short ScalaCheck tutorial for the Programming Principles course
☆14Oct 4, 2021Updated 4 years ago