pmandera / duometerLinks
Near-duplicate detection tool
☆24Updated 9 years ago
Alternatives and similar repositories for duometer
Users that are interested in duometer are comparing it to the libraries listed below
Sorting:
- A queue-controlled browser automation tool for improving web crawl quality☆64Updated 4 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- A platform for collecting, analyzing, and visualizing social media data.☆12Updated 5 years ago
- Collects multimedia content shared through social networks.☆19Updated 10 years ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Updated 13 years ago
- General Architecture for Text Engineering☆49Updated 9 years ago
- A project for clustering text streams using locality-sensitive hashing (LSH) in Python☆26Updated 14 years ago
- Topic modeling web application☆40Updated 10 years ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆12Updated 4 years ago
- Navigating around a grid of cells like XPath for spreadsheets; supports Python 3.5+☆48Updated 2 years ago
- Want to learn more about Free Law Project technologies, policies and thinking? Get the literature here.☆25Updated 4 years ago
- Algorithms for "schema matching"☆26Updated 9 years ago
- Schemas to convert common fixed-width file formats into CSV using in2csv.☆125Updated 4 years ago
- A pipeline for crawling of RSS feeds and the associated content. Demo at newsfeed.ijs.si.☆21Updated 13 years ago
- Raw Wikipedia counts for entity linking☆19Updated 8 years ago
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆17Updated 10 years ago
- Sauna - a social news reader and curation tool☆55Updated 11 years ago
- ☆14Updated 4 years ago
- Detecting near duplicates usign Moses Charikars Algorithm☆20Updated 11 years ago
- ☆48Updated 11 years ago
- Command line tool to convert spreadsheets to databases, made for the UK's Office for National Statistics.☆80Updated 2 years ago
- ScraperWiki Python library for scraping and saving data; in maintenance mode☆158Updated this week
- Simple taxonomy management tool and document classifier.☆57Updated 5 years ago
- Vizlinc☆15Updated 9 years ago
- Wikipedia API wrapper for humans and elk. (en.wikipedia.org/w/api.php, get it?)☆38Updated 11 years ago
- Easily identify and label sentence intervals using various taggers.☆16Updated 8 years ago
- Stylometric framework in Python☆17Updated 10 years ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Updated 8 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 5 years ago
- A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.☆15Updated 11 years ago