An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors
☆35Mar 19, 2015Updated 10 years ago
Alternatives and similar repositories for crawl-to-the-future
Users that are interested in crawl-to-the-future are comparing it to the libraries listed below
Sorting:
- Fast structured perceptron sequential labeler☆15Dec 8, 2015Updated 10 years ago
- An implementation of the Mixcoin mixing protocol☆13Nov 12, 2014Updated 11 years ago
- Analysis related to article on FOIA Online Database.☆11Feb 2, 2017Updated 9 years ago
- This project deals with hierarchical classification of web pages based on dmoz dataset.☆14Apr 10, 2014Updated 11 years ago
- Fast links parser for Python & Humans☆11Dec 27, 2012Updated 13 years ago
- LEMS interpreter implemented in Python☆12Nov 26, 2025Updated 3 months ago
- get facebook data☆10Sep 14, 2014Updated 11 years ago
- An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.☆431Jan 16, 2026Updated last month
- Presenters, titles & links☆10Apr 13, 2015Updated 10 years ago
- Mirror of Apache usergrid JavaScript SDK☆15Apr 28, 2017Updated 8 years ago
- Failover AWS Spot Instances☆11Dec 8, 2017Updated 8 years ago
- Manage and load dataprotocols.org Data Packages☆27Sep 17, 2015Updated 10 years ago
- Pattern-of-Behavior Search Tool☆11Jun 20, 2022Updated 3 years ago
- Vizlinc☆15Jan 14, 2016Updated 10 years ago
- System for mining Wikipedia Usage data to read our collective mind☆20Sep 28, 2014Updated 11 years ago
- iServe is what we refer to as service warehouse which unifies service publication, analysis, and discovery through the use of lightweigh…☆24Feb 18, 2016Updated 10 years ago
- DEPRECATED: Use ghc-heap, ghc-heap-view in GHC 8.x instead.☆18Sep 17, 2016Updated 9 years ago
- A recommender system for GitHub repositories☆14Jun 21, 2014Updated 11 years ago
- A distributed in-memory fabric based on shared-memory blocks and datashape. Any language can operate on the data.☆13Feb 12, 2016Updated 10 years ago
- Links parts of input text to Wikipedia articles☆16Sep 9, 2012Updated 13 years ago
- A Lit web-component for viewing a Whisper JSON transcript file☆14Feb 12, 2026Updated 3 weeks ago
- Python interfaces to popular bulk WHOIS servers such as Shadowserver and Team Cymru.☆21Sep 12, 2011Updated 14 years ago
- Create tarballs from Leiningen projects.☆25Oct 20, 2014Updated 11 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Tweets annotated with coarse-grained sense labels (supersenses)☆13Jun 13, 2014Updated 11 years ago
- Tools for web page segmentation. In development☆17Nov 7, 2018Updated 7 years ago
- Webpack config generator for React apps☆12May 5, 2016Updated 9 years ago
- A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.