centic9 / CommonCrawlDocumentDownload
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
☆64Updated last month
Alternatives and similar repositories for CommonCrawlDocumentDownload:
Users that are interested in CommonCrawlDocumentDownload are comparing it to the libraries listed below
- Simplified version of a common crawl fetcher☆13Updated this week
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆50Updated 4 years ago
- Advanced desktop search/corpus exploration prototype☆21Updated 3 years ago
- Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser☆13Updated 2 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆43Updated 7 years ago
- Trying to generate name synonyms from wikidata☆32Updated 4 years ago
- Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.☆58Updated 2 weeks ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- ☆21Updated 6 years ago
- An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.☆188Updated last year
- Extraction Toolkit☆82Updated 3 years ago
- Common Crawl Index Server☆65Updated this week
- A Named-Entity Recogniser based on Grobid.☆50Updated 4 months ago
- GROBID extension for identifying and normalizing physical quantities.☆77Updated 4 months ago
- Efficient indexing and retrieval of OCR bounding boxes in Solr☆22Updated 5 years ago
- Common web archive utility code.☆52Updated last month
- Index URLs in Common Crawl☆194Updated 7 years ago
- An open relation extraction system☆46Updated 3 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.☆108Updated 10 months ago
- This is the facade for installation and access to the individual components☆16Updated 6 years ago
- PST extraction and analytic pipeline☆37Updated 6 years ago
- A toolkit for clustering web pages based on various similarity measures.☆33Updated 3 years ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆85Updated 3 years ago
- A Utility Library for Wikipedia dumps☆33Updated 7 years ago
- Python bindings for Apache Tika☆22Updated 4 years ago
- All that entity matching, resolution, normalization, enhancement and reconciliation madness, but with a focus on data, not platforms.☆24Updated 2 years ago
- Solr Query Segmenter for structuring unstructured queries☆21Updated 3 years ago
- This repository contains the Domain Discovery Tool (DDT) project. DDT is an interactive system that helps users explore and better unders…☆46Updated 3 years ago