A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
☆74Jun 26, 2026Updated this week
Alternatives and similar repositories for CommonCrawlDocumentDownload
Users that are interested in CommonCrawlDocumentDownload are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A DropWizard wrapper around Apache Tika.☆10Dec 22, 2016Updated 9 years ago
- Single server/laptop grade file-observatory☆10Mar 30, 2023Updated 3 years ago
- Simplified version of a common crawl fetcher☆16Dec 24, 2025Updated 6 months ago
- File-tests is test-suite for File tool. Previous home: https://fedorahosted.org/file-tests/☆21Jun 3, 2026Updated 3 weeks ago
- Efficient indexing and retrieval of OCR bounding boxes in Solr☆22Mar 13, 2019Updated 7 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Collection of my Python Scripts☆41Aug 14, 2020Updated 5 years ago
- This project has been archived and is no longer being developed or supported. The Curator's Workbench is an extensible digital collectio…☆24Jun 25, 2020Updated 6 years ago
- ☆11Feb 8, 2026Updated 4 months ago
- ☆13Oct 21, 2022Updated 3 years ago
- Compressed Rich Text Format (RTF) compression and decompression in Python☆25Jun 29, 2025Updated 11 months ago
- ShEx schemas for common vocabularies and use cases.☆12Oct 7, 2019Updated 6 years ago
- Repository of documentation about the open datasets published by the UK Web Archive.☆15Jun 21, 2019Updated 7 years ago
- convert NDNP data to IIIF☆12Jun 7, 2016Updated 10 years ago
- Colors in Library of Congress digital images.☆32Jan 8, 2018Updated 8 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Library for Object Linking and Embedding (OLE) data types☆12Updated this week
- A framework to fuzz Word Quick Fields☆20Jul 15, 2018Updated 7 years ago
- Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser☆13Feb 26, 2022Updated 4 years ago
- It's like DocBleach, but in your browser☆18Oct 24, 2019Updated 6 years ago
- Parse paths (local paths, urls: ssh/git/etc)☆21Apr 15, 2025Updated last year
- Hadoop-based tool for extraction of large scale synchronous grammars for paraphrasing and machine translation☆15Dec 2, 2016Updated 9 years ago
- Vizlinc☆15Jan 14, 2016Updated 10 years ago
- Index Common Crawl archives in tabular format☆131Updated this week
- ☆14Jan 3, 2024Updated 2 years ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Codemeta paper.☆10Jul 10, 2017Updated 8 years ago
- Utilities useful in cultural heritage imaging and mass digitization projects☆17Sep 10, 2020Updated 5 years ago
- Streaming WARC/ARC library for fast web archive IO☆459Jun 10, 2026Updated 2 weeks ago
- Hadoop integration code for working with with Apache cTAKES☆10Feb 11, 2014Updated 12 years ago
- An index of PDF-centric corpora☆181Jul 4, 2025Updated 11 months ago
- Topic modeling web application☆40Jul 23, 2015Updated 10 years ago
- Code for preservation simulation/modeling project☆10Aug 24, 2021Updated 4 years ago
- API implementation, User Interface, and more modules of the IPTC EXTRA project☆13Feb 14, 2022Updated 4 years ago
- Scripts for performing various tasks with the ArchivesSpace API☆15Jun 27, 2024Updated 2 years ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Viewers for statistics and dashboarding of Domain Search Engine data☆128Jan 19, 2016Updated 10 years ago
- TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS☆16Sep 26, 2017Updated 8 years ago
- OSS2017 - Open Science for Synthesis: Gulf Research Program☆10May 12, 2019Updated 7 years ago
- MEMEX Weapons Pilot for the illegal weapons domain.☆15May 20, 2016Updated 10 years ago
- ☆19Jan 17, 2020Updated 6 years ago
- Internet Research Agency Facebook ads as structured data☆22Dec 10, 2019Updated 6 years ago
- utility to fetch provenance information from Internet Archive's Wayback Machine☆15Feb 5, 2026Updated 4 months ago