centic9/CommonCrawlDocumentDownload

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/centic9/CommonCrawlDocumentDownload)

centic9 / CommonCrawlDocumentDownload

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

☆74

Alternatives and similar repositories for CommonCrawlDocumentDownload

Users that are interested in CommonCrawlDocumentDownload are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tballison / commoncrawl-fetcher-lite
View on GitHub
Simplified version of a common crawl fetcher
☆16Dec 24, 2025Updated 6 months ago
chrismattmann / etllib
View on GitHub
This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading …
☆18Jan 27, 2024Updated 2 years ago
file / file-tests
View on GitHub
File-tests is test-suite for File tool. Previous home: https://fedorahosted.org/file-tests/
☆21Jun 3, 2026Updated last month
edygert / runsc
View on GitHub
runsc loads 32/64 bit shellcode (depending on how runsc is compiled) in a way that makes it easy to load in a debugger. This code is base…
☆39Dec 12, 2022Updated 3 years ago
gvwilson / webonomicon
View on GitHub
An introduction to Web Programming for the Cautious and Weary
☆13May 10, 2026Updated 2 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
yotsubo / o-checker
View on GitHub
☆11Feb 8, 2026Updated 5 months ago
mcburton / lc-python-training
View on GitHub
A repository containing the materials for a Python workshop a the Library of Congress in May 2019
☆12May 23, 2019Updated 7 years ago
delimitry / compressed_rtf
View on GitHub
Compressed Rich Text Format (RTF) compression and decompression in Python
☆25Jun 29, 2025Updated last year
idiom / OLEPackagerFormat
View on GitHub
OLE Package Format Documentation
☆23Jun 13, 2020Updated 6 years ago
shexSpec / schemas
View on GitHub
ShEx schemas for common vocabularies and use cases.
☆13Oct 7, 2019Updated 6 years ago
ukwa / opendata
View on GitHub
Repository of documentation about the open datasets published by the UK Web Archive.
☆15Jun 21, 2019Updated 7 years ago
bagit-profiles / bagit-profiles-specification
View on GitHub
☆35Nov 2, 2023Updated 2 years ago
NYPL / digarch_scripts
View on GitHub
☆12May 13, 2026Updated 2 months ago
umd-mith / ndnp_iiif
View on GitHub
convert NDNP data to IIIF
☆12Jun 7, 2016Updated 10 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
jeremydmoore / coding4ch
View on GitHub
Utilities useful in cultural heritage imaging and mass digitization projects
☆17Sep 10, 2020Updated 5 years ago
mitre / rhapsode
View on GitHub
Advanced desktop search/corpus exploration prototype
☆21Jun 23, 2021Updated 5 years ago
thammegowda / tika-ner-corenlp
View on GitHub
Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser
☆13Feb 26, 2022Updated 4 years ago
tklengyel / guestrace
View on GitHub
Unofficial mirror of
☆12Feb 2, 2018Updated 8 years ago
libyal / libfole
View on GitHub
Library for Object Linking and Embedding (OLE) data types
☆12Jun 24, 2026Updated 3 weeks ago
tech-at-arl / Digital-Scholarship-Institute
View on GitHub
Repository for course materials and related resources for the ARL Digital Scholarship Institute.
☆23Aug 3, 2021Updated 4 years ago
joshua-decoder / thrax
View on GitHub
Hadoop-based tool for extraction of large scale synchronous grammars for paraphrasing and machine translation
☆15Dec 2, 2016Updated 9 years ago
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
riusksk / rp
View on GitHub
rp++ is a full-cpp written tool that aims to find ROP sequences in PE/Elf/Mach-O x86/x64 binaries. It is open-source and has been tested …
☆11Apr 2, 2016Updated 10 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
GaurangPohankar / Python-Google-Places-Extraction
View on GitHub
Python script to scrape the data from the google places with reviews , website , name , total reviews , phone number etc . and stores it …
☆11Aug 27, 2019Updated 6 years ago
bioteam / minio-irods-gateway
View on GitHub
🔬Experimental Minio (S3) Gateway for iRODS 💾
☆12Aug 13, 2019Updated 6 years ago
Tathagatd96 / Deep-Autoencoder-using-Tensorflow
View on GitHub
☆11Jan 16, 2021Updated 5 years ago
webrecorder / warcio
View on GitHub
Streaming WARC/ARC library for fast web archive IO
☆461Jun 10, 2026Updated last month
nasa-jpl-memex / topic_space
View on GitHub
Topic modeling web application
☆40Jul 23, 2015Updated 10 years ago
pdf-association / pdf-corpora
View on GitHub
An index of PDF-centric corpora
☆182Jun 29, 2026Updated 3 weeks ago
pcodding / hadoop_ctakes
View on GitHub
Hadoop integration code for working with with Apache cTAKES
☆10Feb 11, 2014Updated 12 years ago
joesecurity / DocBleachShell
View on GitHub
DocBleachShell is the integration of the great DocBleach, https://github.com/docbleach/DocBleach Content Disarm and Reconstruction tool i…
☆21Jan 15, 2022Updated 4 years ago
MITLibraries / archivesspace-api-python-scripts
View on GitHub
Scripts for performing various tasks with the ArchivesSpace API
☆15Jun 27, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
umd-mith / irads
View on GitHub
Internet Research Agency Facebook ads as structured data
☆22Dec 10, 2019Updated 6 years ago
mouse-reeve / fruit
View on GitHub
☆19Jan 17, 2020Updated 6 years ago
DLFMetadataAssessment / DLFMetadataAssessment.github.io
View on GitHub
DLF AIG Metadata Assessment Working Group Site
☆16Nov 15, 2025Updated 8 months ago
wragge / omeka_s_tools
View on GitHub
☆17Jun 20, 2024Updated 2 years ago
MIT-Informatics / PreservationSimulation
View on GitHub
Code for preservation simulation/modeling project
☆10Aug 24, 2021Updated 4 years ago
marhop / literate-binary
View on GitHub
Integrate handcrafted binary and documentation
☆36Oct 20, 2025Updated 9 months ago
acikyazilimagi / deduplication
View on GitHub
deduplication
☆15Feb 20, 2023Updated 3 years ago