medialab / ural
A helper library full of URL-related heuristics.
☆56Updated last week
Related projects: ⓘ
- Extract networks of entities from journalistic reporting☆46Updated last year
- Make it easier to compare and cross-reference the names of companies and people by applying strong normalisation.☆142Updated 7 months ago
- Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.☆54Updated last month
- API client for Aleph, supports bulk entity and document upload.☆27Updated last month
- Web interface for network analysis.☆20Updated last year
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- A base library for building web scrapers for statistical data, and a helper ontology for (primarily Swedish) statistical data.☆13Updated last year
- Alternative robots parser module for Python☆16Updated this week
- Python based Wikidata framework for easy dataframe extraction☆39Updated 9 months ago
- ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of diff…☆87Updated 2 years ago
- Web scraping Page Objects core library☆93Updated 2 months ago
- scraper for facebook, gab, google and tiktok☆22Updated 2 months ago
- Inspect a URL and estimate if it contains a news story☆39Updated 3 weeks ago
- An alpha project combining beneficial ownership and contracting data☆13Updated 3 years ago
- etl pipeline, graphical explorer and general toolbox for investigations with follow the money data☆13Updated 8 months ago
- A pure-Python robots.txt parser with support for modern conventions.☆54Updated 3 months ago
- Adds a reconciliation API endpoint to Datasette, based on the Reconciliation Service API specification.☆22Updated 7 months ago
- Extract text from HTML☆129Updated 4 years ago
- A webmining CLI tool & library for python.☆277Updated last week
- A browser user interface for manual labeling of record pairs.☆41Updated last year
- Detect and visualize text reuse☆115Updated 2 weeks ago
- The most advanced debugging and testing tool for Scrapy☆16Updated last year
- Find rss, atom, xml, and rdf feeds on webpages☆30Updated last year
- Trying to generate name synonyms from wikidata☆33Updated 4 years ago
- Simple tool to pull posts and users from Gab☆15Updated 2 months ago
- A maximum-strength name parser for record linkage.☆29Updated last month
- Specialized & performant CSV readers, writers and enrichers for python.☆12Updated 7 months ago
- A Python library for defining rule-based overrides on messy data☆11Updated 8 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆118Updated 2 weeks ago
- DocumentCloud's back end source code - Please report bugs, issues and feature requests to info@documentcloud.org☆32Updated this week