pmandera / duometerLinks
Near-duplicate detection tool
☆24Updated 9 years ago
Alternatives and similar repositories for duometer
Users that are interested in duometer are comparing it to the libraries listed below
Sorting:
- A queue-controlled browser automation tool for improving web crawl quality☆64Updated 5 months ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 4 years ago
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆17Updated 10 years ago
- Simple taxonomy management tool and document classifier.☆57Updated 6 years ago
- Tools for tracking stories on news homepages☆48Updated 6 years ago
- Easily identify and label sentence intervals using various taggers.☆16Updated 9 years ago
- A platform for collecting, analyzing, and visualizing social media data.☆13Updated 5 years ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Updated 13 years ago
- Topic modeling web application☆40Updated 10 years ago
- A project for clustering text streams using locality-sensitive hashing (LSH) in Python☆26Updated 14 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated last year
- Homebase of the IPTC EXTRA project about rule-based text categorization☆13Updated 8 years ago
- Knowledge-based Semantic Role Labeling☆16Updated last year
- Want to learn more about Free Law Project technologies, policies and thinking? Get the literature here.☆25Updated 4 years ago
- Collects multimedia content shared through social networks.☆19Updated 10 years ago
- Just like on ScraperWiki Classic; now a part of QuickCode.☆38Updated 9 years ago
- A tool for semantic relation extraction. The program finds pairs of semantically related words based on the text definitions coming from …☆26Updated 11 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Updated 7 years ago
- Command-line tool to extract a ranked list of relevant keywords from a corpus with the option of using either topic modeling or tf-idf sc…☆41Updated 8 years ago
- Document clustering based on Latent Semantic Analysis☆96Updated 15 years ago
- Raw Wikipedia counts for entity linking☆19Updated 8 years ago
- Extracts character names from a text file and performs analysis of text sentences containing the names.☆55Updated 2 years ago
- General Architecture for Text Engineering☆49Updated 9 years ago
- An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors☆35Updated 10 years ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby☆17Updated 3 years ago
- Python library and command line tool for converting data from one format to another☆99Updated 5 years ago
- A semantic analysis tool to generate synonym.txt files for Solr. [RETIRED]☆25Updated 9 years ago
- Stanford Tregex-inspired language for rule-based dependency tree manipulation.☆21Updated 8 years ago
- A toolkit for clustering web pages based on various similarity measures.☆34Updated 4 years ago