cocrawler / cdx_toolkitLinks

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

☆189

Alternatives and similar repositories for cdx_toolkit

Users that are interested in cdx_toolkit are comparing it to the libraries listed below

Sorting:

webrecorder / warcio
Streaming WARC/ARC library for fast web archive IO
☆441Updated 11 months ago
commoncrawl / cc-index-table
Index Common Crawl archives in tabular format
☆124Updated 3 weeks ago
ikreymer / cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
☆205Updated 7 years ago
commoncrawl / cc-pyspark
Process Common Crawl data with Python and Spark
☆448Updated 3 weeks ago
adbar / htmldate
Fast and robust date extraction from web pages, with Python or on the command-line
☆142Updated last month
ukwa / webarchive-discovery
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…
☆131Updated 2 weeks ago
ikreymer / webarchive-indexing
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Updated 8 years ago
commoncrawl / cc-webgraph
Tools to construct and process Common Crawl webgraphs
☆102Updated this week
scrapinghub / article-extraction-benchmark
Article extraction benchmark: dataset and evaluation scripts
☆339Updated 2 months ago
cocrawler / cocrawler
CoCrawler is a versatile web crawler built using modern tools and concurrency.
☆191Updated 3 years ago
mediacloud / date_guesser
A library to extract a publication date from a web page, along with a measure of the accuracy.
☆41Updated 6 years ago
internetarchive / warc
Python library for reading and writing warc files
☆245Updated 3 years ago
helgeho / ArchiveSpark
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…
☆154Updated last month
kensho-technologies / qwikidata
Python tools for interacting with Wikidata
☆158Updated 2 years ago
commoncrawl / cc-mrjob
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
☆168Updated 3 years ago
kermitt2 / entity-fishing
A machine learning tool for fishing entities
☆265Updated 6 months ago
jmriebold / BoilerPy3
Python port of Boilerpipe library
☆96Updated last year
MartinoMensio / spacy-dbpedia-spotlight
A spaCy wrapper for DBpedia Spotlight
☆112Updated 2 years ago
TeamHG-Memex / html-text
Extract text from HTML
☆135Updated 5 years ago
UB-Mannheim / spacyopentapioca
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata
☆94Updated 2 years ago
ikreymer / cc-index-server
Deployment of pywb as a CommonCrawl Index Server
☆21Updated 8 years ago
commoncrawl / news-crawl
News crawling with StormCrawler - stores content as WARC
☆359Updated 9 months ago
miso-belica / jusText
Heuristic based boilerplate removal tool
☆809Updated 9 months ago
Lucaterre / spacyfishing
A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata
☆169Updated 3 years ago
TheScienceMuseum / elastic-wikidata
CLI for loading Wikidata subsets (or all of it) into Elasticsearch
☆70Updated 3 years ago
AlonEirew / wikipedia-to-elastic
Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…
☆47Updated 2 years ago
opensanctions / fingerprints
Now included in rigour
☆152Updated last week
usc-isi-i2 / etk
Extraction Toolkit
☆83Updated 4 years ago
internetarchive / surt
Sort-friendly URI Reordering Transform (SURT) python module
☆44Updated 2 months ago
jpotts18 / stylometry
A Stylometry Library for Python
☆146Updated 2 years ago