openpreserve / pagelyzerLinks
Suite of tools for detecting changes in web pages and their rendering
☆54Updated last year
Alternatives and similar repositories for pagelyzer
Users that are interested in pagelyzer are comparing it to the libraries listed below
Sorting:
- Tools for web page segmentation. In development☆17Updated 6 years ago
- An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)☆25Updated 7 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Updated 12 years ago
- Implementation of Vision Based Page Segmentation algorithm in Java☆102Updated 5 years ago
- Python API for Various DB-Backed Simhash Clusters☆64Updated 8 years ago
- Web page segmentation and noise removal☆55Updated last year
- A toolkit for clustering web pages based on various similarity measures.☆33Updated 3 years ago
- Automatic Item List Extraction☆87Updated 9 years ago
- General Architecture for Text Engineering☆50Updated 9 years ago
- ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (image…☆95Updated 6 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 4 years ago
- ☆33Updated last year
- Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…☆55Updated last year
- Results for intent classification benchmark (Botfuel, DialogFlow, Luis, Watson, RASA, Recast, Snips)☆11Updated 7 years ago
- An automatic news summary generator☆24Updated 8 years ago
- Dmoz RDF parser☆28Updated 9 years ago
- ☆21Updated 7 years ago
- Keyword Extraction system using Brown Clustering - (This version is trained to extract keywords from job listings)☆18Updated 10 years ago
- Common Crawl Index Server☆68Updated 3 months ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 3 years ago
- Pipeline for distributed Natural Language Processing, made in Python☆65Updated 8 years ago
- Reduction is a python script which automatically summarizes a text by extracting the sentences which are deemed to be most important.☆55Updated 10 years ago
- A simple algorithm for clustering web pages, suitable for crawlers☆34Updated 8 years ago
- Automated NLP sentiment predictions- batteries included, or use your own data☆18Updated 7 years ago
- An open source search engine for corporate data and websites.☆106Updated 7 years ago
- Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum☆17Updated 2 years ago
- Additional opennlp mapping type for elasticsearch in order to perform named entity recognition☆136Updated 9 years ago
- A tool for semantic relation extraction. The program finds pairs of semantically related words based on the text definitions coming from …☆26Updated 10 years ago