mediacloud / metadata-libLinks
How Media Cloud approaches extracting metadata from online news stories
☆13Updated 5 months ago
Alternatives and similar repositories for metadata-lib
Users that are interested in metadata-lib are comparing it to the libraries listed below
Sorting:
- Fast and robust date extraction from web pages, with Python or on the command-line☆127Updated 5 months ago
- A helper library full of URL-related heuristics.☆69Updated 2 months ago
- Index Common Crawl archives in tabular format☆120Updated 2 weeks ago
- Ultimate Website Sitemap Parser☆212Updated last month
- A list of over 5000 US news domains and their social media accounts☆45Updated 2 years ago
- Python port of Boilerpipe library☆88Updated 9 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆138Updated 5 months ago
- Tools to construct and process Common Crawl webgraphs☆90Updated this week
- Article extraction benchmark: dataset and evaluation scripts☆315Updated last year
- A web scraper for TikTok using Playwright☆91Updated last month
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆173Updated 5 months ago
- Python wrapper for google people-alos-ask☆107Updated 8 months ago
- Common crawl extractor☆75Updated last year
- Newsfeed based on GDELT Project☆26Updated last year
- A Python scraper for the Facebook Ad Library, using the official Facebook Ad Library API.☆119Updated 5 years ago
- Cross-platform GUI Client for Computer Vision APIs (Google Vision, Microsoft Cognitive Services, Clarifai and Keras' open source models)☆22Updated 2 years ago
- Measure the readability of a given text using surface characteristics☆79Updated 4 months ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- A Directory of Online Newspaper Sources for 70+ Languages☆32Updated 4 years ago
- Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity.☆11Updated last year
- Extract dates from text☆64Updated 4 years ago
- This repository contains an implementation of a US address parser built using spaCy NLP library.☆37Updated last year
- Scrapers for U.S. county court sites.☆67Updated 2 years ago
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆161Updated 2 years ago
- ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of diff…☆88Updated 3 years ago
- Tag news stories based on models trained on the NYT corpus.☆42Updated 2 years ago
- This repository provides usage examples for the Python module Newspaper3k.☆147Updated last year
- Norwegian Speech Transformer Models☆18Updated 6 months ago
- Faster, modernized fork of the language identification tool langid.py☆56Updated 6 months ago
- How can we improve name matching in screening tools?☆12Updated 4 months ago