pmandera / duometerLinks
Near-duplicate detection tool
☆24Updated 8 years ago
Alternatives and similar repositories for duometer
Users that are interested in duometer are comparing it to the libraries listed below
Sorting:
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆16Updated 10 years ago
- General Architecture for Text Engineering☆49Updated 9 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆62Updated last month
- Tools for tracking stories on news homepages☆48Updated 5 years ago
- A platform for collecting, analyzing, and visualizing social media data.☆12Updated 4 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- Topic modeling web application☆40Updated 10 years ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Updated 13 years ago
- Navigating around a grid of cells like XPath for spreadsheets; supports Python 3.5+☆48Updated 2 years ago
- framework for scraping legislative/government data☆88Updated last year
- Scraper built with Scrapy.☆18Updated last year
- A pipeline for crawling of RSS feeds and the associated content. Demo at newsfeed.ijs.si.☆21Updated 12 years ago
- Simple taxonomy management tool and document classifier.☆56Updated 5 years ago
- Collects multimedia content shared through social networks.☆19Updated 10 years ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 3 years ago
- Record Linkage ToolKit (Find and link entities)☆109Updated 2 years ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Updated 8 years ago
- stav text annotation visualiser☆34Updated 13 years ago
- ☆48Updated 11 years ago
- Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io☆38Updated 9 years ago
- Raw Wikipedia counts for entity linking☆19Updated 8 years ago
- A library for extracting tables from PDF files☆89Updated 12 years ago
- Take streaming tweets, extract hashtags & usernames, create graph, export graphml for Gephi visualisation☆38Updated 12 years ago
- Advanced similarity and duplicate source code proof of concept for our research efforts.☆52Updated 3 years ago
- Wandora is a general purpose information extraction, management and publishing application based on Topic Maps and Java.☆133Updated 2 years ago
- Quickly analyze and explore email with advanced analytics and visualization.☆56Updated 4 years ago
- The news homepage archive☆80Updated 4 years ago
- Serapis is a sentence identifier and modeling pipeline / built for Wordnik☆24Updated 9 years ago
- Schemas to convert common fixed-width file formats into CSV using in2csv.☆125Updated 4 years ago
- Contains the implementation of algorithms that estimate the geographic location of media content based on their content and metadata. It …☆15Updated 8 years ago