An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors
☆35Mar 19, 2015Updated 11 years ago
Alternatives and similar repositories for crawl-to-the-future
Users that are interested in crawl-to-the-future are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.☆431Jan 16, 2026Updated 2 months ago
- Analysis related to article on FOIA Online Database.☆11Feb 2, 2017Updated 9 years ago
- Investigative tool for extracting relevant areas from many documents☆14Nov 17, 2015Updated 10 years ago
- Fast structured perceptron sequential labeler☆15Dec 8, 2015Updated 10 years ago
- Autocomplete - light-weight, next-word prediction Python utility☆451Jan 16, 2026Updated 2 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Extract data from websites using basic statistical magic☆506Oct 2, 2020Updated 5 years ago
- Parse live video and extract Chyron text☆20Aug 17, 2017Updated 8 years ago
- A Lit web-component for viewing a Whisper JSON transcript file☆14Feb 12, 2026Updated last month
- This project deals with hierarchical classification of web pages based on dmoz dataset.☆14Apr 10, 2014Updated 11 years ago
- A distributed in-memory fabric based on shared-memory blocks and datashape. Any language can operate on the data.☆13Feb 12, 2016Updated 10 years ago
- pythonic processes☆11Jun 12, 2015Updated 10 years ago
- A how-to do a mass collection of FEC data using the command-line and regular expressions☆29Mar 18, 2016Updated 10 years ago
- Manage and load dataprotocols.org Data Packages☆27Sep 17, 2015Updated 10 years ago
- Tweets annotated with coarse-grained sense labels (supersenses)☆13Jun 13, 2014Updated 11 years ago
- NordVPN Special Discount Offer • AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- Failover AWS Spot Instances☆11Dec 8, 2017Updated 8 years ago
- A Flask+Elasticsearch UI for exploring the DC Inbox dataset from http://web.stevens.edu/dcinbox/Home.html☆16Jan 21, 2022Updated 4 years ago
- A semantic web crawler☆20Sep 20, 2010Updated 15 years ago
- System for mining Wikipedia Usage data to read our collective mind☆20Sep 28, 2014Updated 11 years ago
- Content-based Recommendation Generator☆13Jan 21, 2015Updated 11 years ago
- Linked SDMX☆17Oct 26, 2014Updated 11 years ago
- Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum☆18Jul 1, 2022Updated 3 years ago
- mltk - Moz Language Tool Kit☆12Mar 6, 2015Updated 11 years ago
- Data science tools from Moz☆23Jan 11, 2017Updated 9 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 11 years ago
- Data for our analysis of Amtrak 188 derailment.☆10May 14, 2015Updated 10 years ago
- Fast links parser for Python & Humans☆11Dec 27, 2012Updated 13 years ago
- Tools for web page segmentation. In development☆17Nov 7, 2018Updated 7 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Linum is yet another Linux enumeration script written in shell script.☆11Oct 20, 2020Updated 5 years ago
- A Cython interface to FLANN☆24Nov 25, 2020Updated 5 years ago
- transform a datapoint from a website into a CSV time-series dataset using the wayback machine☆12May 24, 2023Updated 2 years ago
- Vizlinc☆15Jan 14, 2016Updated 10 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- A one-page cheat sheet for VisiData, available in multiple languages.☆29Mar 8, 2024Updated 2 years ago
- How to use curl and other Bash tools to make a mirror of dmv.ca.gov "Report of Traffic Accident Involving an Autonomous Vehicle (OL 316)"…☆11Jun 22, 2017Updated 8 years ago
- Set of scripts to aid in the download of the GDELT data files from www.gdeltproject.org☆12May 17, 2014Updated 11 years ago
- Semanticizest: dump parser and client☆20May 11, 2016Updated 9 years ago
- A simple app to add OAuth-based authentication in front of an S3 bucket-based static website.☆11Dec 8, 2022Updated 3 years ago
- MOTHBALLED: See README note.☆10Jul 11, 2022Updated 3 years ago
- rapid nlp prototyping☆71Sep 30, 2022Updated 3 years ago