rodricios/crawl-to-the-future

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/rodricios/crawl-to-the-future)

rodricios / crawl-to-the-future

An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors

☆35

Alternatives and similar repositories for crawl-to-the-future

Users that are interested in crawl-to-the-future are comparing it to the libraries listed below

Sorting:

coastalcph / rungsted
View on GitHub
Fast structured perceptron sequential labeler
☆15Dec 8, 2015Updated 10 years ago
charlescharles / mixcoin
View on GitHub
An implementation of the Mixcoin mixing protocol
☆13Nov 12, 2014Updated 11 years ago
rashidakamal / foia-online
View on GitHub
Analysis related to article on FOIA Online Database.
☆11Feb 2, 2017Updated 9 years ago
bhavishya235 / Web-Classification
View on GitHub
This project deals with hierarchical classification of web pages based on dmoz dataset.
☆14Apr 10, 2014Updated 11 years ago
cx-lukas-salkauskas-x / FastLinks
View on GitHub
Fast links parser for Python & Humans
☆11Dec 27, 2012Updated 13 years ago
LEMS / pylems
View on GitHub
LEMS interpreter implemented in Python
☆12Nov 26, 2025Updated 3 months ago
newslynx / zuckup
View on GitHub
get facebook data
☆10Sep 14, 2014Updated 11 years ago
rodricios / eatiht
View on GitHub
An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
☆431Jan 16, 2026Updated last month
sburns / pycon-2015-compendium
View on GitHub
Presenters, titles & links
☆10Apr 13, 2015Updated 10 years ago
apache / usergrid-javascript
View on GitHub
Mirror of Apache usergrid JavaScript SDK
☆15Apr 28, 2017Updated 8 years ago
numercfd / aws-fasi
View on GitHub
Failover AWS Spot Instances
☆11Dec 8, 2017Updated 8 years ago
trickvi / datapackage
View on GitHub
Manage and load dataprotocols.org Data Packages
☆27Sep 17, 2015Updated 10 years ago
mitll / graph-qube
View on GitHub
Pattern-of-Behavior Search Tool
☆11Jun 20, 2022Updated 3 years ago
mitll / vizlinc
View on GitHub
Vizlinc
☆15Jan 14, 2016Updated 10 years ago
paulhoule / telepath
View on GitHub
System for mining Wikipedia Usage data to read our collective mind
☆20Sep 28, 2014Updated 11 years ago
kmi / iserve
View on GitHub
iServe is what we refer to as service warehouse which unifies service publication, analysis, and discovery through the use of lightweigh…
☆24Feb 18, 2016Updated 10 years ago
thoughtpolice / vacuum
View on GitHub
DEPRECATED: Use ghc-heap, ghc-heap-view in GHC 8.x instead.
☆18Sep 17, 2016Updated 9 years ago
zygmuntz / stardose
View on GitHub
A recommender system for GitHub repositories
☆14Jun 21, 2014Updated 11 years ago
blaze / datafabric
View on GitHub
A distributed in-memory fabric based on shared-memory blocks and datashape. Any language can operate on the data.
☆13Feb 12, 2016Updated 10 years ago
nournia / wikifier
View on GitHub
Links parts of input text to Wikipedia articles
☆16Sep 9, 2012Updated 13 years ago
edsu / whisper-transcript
View on GitHub
A Lit web-component for viewing a Whisper JSON transcript file
☆14Feb 12, 2026Updated 3 weeks ago
csirtfoundry / BulkWhois
View on GitHub
Python interfaces to popular bulk WHOIS servers such as Shadowserver and Team Cymru.
☆21Sep 12, 2011Updated 14 years ago
technomancy / lein-tar
View on GitHub
Create tarballs from Leiningen projects.
☆25Oct 20, 2014Updated 11 years ago
commoncrawl / commoncrawl-examples
View on GitHub
A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)
☆65Aug 5, 2016Updated 9 years ago
coastalcph / supersense-data-twitter
View on GitHub
Tweets annotated with coarse-grained sense labels (supersenses)
☆13Jun 13, 2014Updated 11 years ago
asanoja / segmentations
View on GitHub
Tools for web page segmentation. In development
☆17Nov 7, 2018Updated 7 years ago
reapp / reapp-pack
View on GitHub
Webpack config generator for React apps
☆12May 5, 2016Updated 9 years ago
rkrzr / dataset-popular
View on GitHub
A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.
☆15Feb 9, 2014Updated 12 years ago
draperlaboratory / user-ale
View on GitHub
The User Activity Logging Engine, or User-ALE, is a logging mechanism used to quantitatively assess the behavioural and cognitive state o…
☆13Aug 26, 2016Updated 9 years ago
TellMeFirst / tellmefirst
View on GitHub
TellMeFirst is a tool for classifying and enriching textual documents via Linked Open Data.
☆25Sep 1, 2022Updated 3 years ago
the-pudding / last-two-minute-report
View on GitHub
☆16Jun 7, 2018Updated 7 years ago
idiap / cbrec
View on GitHub
Content-based Recommendation Generator
☆13Jan 21, 2015Updated 11 years ago
blalpert / best_ex
View on GitHub
Replication files for the March 2, 2015 Barron's story "The Little Guy Wins!," measuring market makers' trade execution quality.
☆13Mar 12, 2015Updated 10 years ago
socialsensor / storm-focused-crawler
View on GitHub
Collects multimedia content shared through social networks.
☆19Feb 18, 2015Updated 11 years ago
seagatesoft / sde
View on GitHub
Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignm…
☆49Jun 9, 2012Updated 13 years ago
cnwzhjs / python.erl
View on GitHub
Python interpreter written in pure Erlang.
☆60Jan 10, 2013Updated 13 years ago
philippesaade-wmde / WikidataTextEmbedding
View on GitHub
Convert Wikidata Items to vector embeddings
☆37Feb 25, 2026Updated last week
valpackett / octohipster
View on GitHub
[UNMAINTAINED] A hypermedia REST HTTP API library for Clojure
☆76Jul 12, 2015Updated 10 years ago
Sotera / Datawake
View on GitHub
Browser add-on and web server to support collection and analysis of web browsing data.
☆14Mar 9, 2016Updated 10 years ago