A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
☆73Jan 16, 2026Updated last month
Alternatives and similar repositories for CommonCrawlDocumentDownload
Users that are interested in CommonCrawlDocumentDownload are comparing it to the libraries listed below
Sorting:
- This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading …☆18Jan 27, 2024Updated 2 years ago
- A DropWizard wrapper around Apache Tika.☆10Dec 22, 2016Updated 9 years ago
- ☆11Feb 8, 2026Updated 3 weeks ago
- A tool for detecting viruses and NSFW material in WARC files☆17Dec 16, 2025Updated 2 months ago
- Simplified version of a common crawl fetcher☆17Dec 24, 2025Updated 2 months ago
- It's like DocBleach, but in your browser☆18Oct 24, 2019Updated 6 years ago
- File-tests is test-suite for File tool. Previous home: https://fedorahosted.org/file-tests/☆21Dec 18, 2025Updated 2 months ago
- Compressed Rich Text Format (RTF) compression and decompression in Python☆23Jun 29, 2025Updated 8 months ago
- OLE Package Format Documentation☆23Jun 13, 2020Updated 5 years ago
- CDXJ Indexing of WARC/ARCs☆33Dec 10, 2024Updated last year
- This project has been archived and is no longer being developed or supported. The Curator's Workbench is an extensible digital collectio…☆24Jun 25, 2020Updated 5 years ago
- Streaming WARC/ARC library for fast web archive IO☆451Dec 10, 2024Updated last year
- Integrate handcrafted binary and documentation☆36Oct 20, 2025Updated 4 months ago
- ☆17Feb 20, 2026Updated last week
- ☆12Aug 4, 2018Updated 7 years ago
- ☆10Dec 30, 2020Updated 5 years ago
- Description des formats de fichier☆11Feb 4, 2022Updated 4 years ago
- PowerShell script to disable NetBIOS on Windows☆12Jul 19, 2021Updated 4 years ago
- ☆14Jan 3, 2024Updated 2 years ago
- CalDav Parser☆11Nov 30, 2024Updated last year
- Tool support for literature review subtle process.☆18May 5, 2017Updated 8 years ago
- OpenPGP in Python using Sequoia PGP☆18Feb 25, 2026Updated last week
- A crate to help you fetch and serve WebFinger resources☆10Nov 13, 2022Updated 3 years ago
- Tools to cluster visually similar images into groups in an image dataset☆11Jul 29, 2022Updated 3 years ago
- Automatically generate tests for your website by using LLM models☆17Aug 7, 2023Updated 2 years ago
- Hadoop-based tool for extraction of large scale synchronous grammars for paraphrasing and machine translation☆15Dec 2, 2016Updated 9 years ago
- Application which supports the UNC Libraries' Digital Collections Repository☆12Feb 25, 2026Updated last week
- SIARD (Software Independent Archiving of Relational Databases) - an open file format for the long-term archiving of relational databases☆12Nov 14, 2024Updated last year
- OSS2017 - Open Science for Synthesis: Gulf Research Program☆10May 12, 2019Updated 6 years ago
- The code in this repository which function is to extract the shellcode from the maldoc.☆10Jul 17, 2023Updated 2 years ago
- A fork of the disktype disk and disk image format detection tool☆11Nov 16, 2016Updated 9 years ago
- Cookiecutter template for creating Ansible roles. Includes tests for TravisCI using Molecule.☆13Dec 14, 2021Updated 4 years ago
- ☆44Mar 29, 2023Updated 2 years ago
- Download a demo version of Open Network Insight, which can be run standalone on a windows laptop using Winpython https://sourceforge.net/…☆10Feb 1, 2017Updated 9 years ago
- Add IIP layering support to the Leaflet library☆14Jul 28, 2016Updated 9 years ago
- A small POC using Caddy as a TLS-terminating MQTT proxy☆12Aug 31, 2022Updated 3 years ago
- Library for Object Linking and Embedding (OLE) data types☆12Nov 27, 2025Updated 3 months ago
- Rewrapping FieryIceStickie's Deobfuscation Tools☆11Feb 2, 2026Updated last month
- Library for the Test-based Calibration Error (TCE) metric to quantify the degree to classifier calibration.☆13Sep 15, 2023Updated 2 years ago