pjlsergeant / ziprip
Extract postal addresses from the DOM
☆66Updated 12 years ago
Alternatives and similar repositories for ziprip:
Users that are interested in ziprip are comparing it to the libraries listed below
- Launch AWS Elastic MapReduce jobs that process Common Crawl data.☆49Updated 8 years ago
- Index URLs in Common Crawl☆193Updated 7 years ago
- A node.js library for extracting data from scanned forms.☆117Updated 2 years ago
- ☆24Updated 9 years ago
- conceptnet 4 bridge☆71Updated 10 years ago
- mltk - Moz Language Tool Kit☆12Updated 10 years ago
- Curated synonym files and Helpers for Elasticsearch Synonym Token Filter☆64Updated last year
- Dedupe/batch geocode addresses and venues around the world with libpostal☆83Updated 3 years ago
- Model Training tool for MITIE☆79Updated 9 years ago
- Nodejs wrapper for Stanford Classifier.☆47Updated 4 years ago
- Free & ready-to-use geocoder☆57Updated 8 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆166Updated 2 years ago
- An attempt at creating a silver/gold standard dataset for backtesting yesterday & today's content-extractors☆34Updated 10 years ago
- Vocabulary using n-grams☆16Updated 6 years ago
- Algorithms for URL Classification☆19Updated 9 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 7 years ago
- Wrapper to pocketsphinx phoneme labeling tools☆18Updated 8 years ago
- A simple algorithm for clustering web pages, suitable for crawlers☆34Updated 8 years ago
- email dataset for email signature parsing☆55Updated 8 years ago
- XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approac…☆43Updated 9 years ago
- A statistics extension for Google Refine.☆33Updated 13 years ago
- Client for Stanford Named Entity Reconginiton☆27Updated 6 years ago
- Text classification using Naive Bayes and Elasticsearch☆154Updated 8 years ago
- Open Source implementation of Summly☆47Updated 8 years ago
- A space for code and projects around analysing news content☆23Updated 7 years ago
- ☆21Updated 6 years ago
- Rewrite text in linear time.☆81Updated 2 years ago
- A company/project name generator for Python. Uses NLTK and diverse techniques derived from existing corporate etymologies and naming agen…☆49Updated 8 years ago
- Traptor -- A distributed Twitter feed☆26Updated 2 years ago
- Tools to manipulate and extract data from wikipedia dumps☆46Updated 11 years ago