lovasoa / wikipedia-externallinks-fast-extractionLinks

Fast extraction of all external links from wikipedia

☆12

Alternatives and similar repositories for wikipedia-externallinks-fast-extraction

Users that are interested in wikipedia-externallinks-fast-extraction are comparing it to the libraries listed below

Sorting:

hckr / fb-post-screenshot
Firefox Web Extension to save Facebook posts as images
☆21Updated 4 years ago
medialab / ural
A helper library full of URL-related heuristics.
☆70Updated this week
DocNow / diffengine
track changes to the news, where news is anything with an RSS feed
☆179Updated 5 years ago
machawk1 / Mink
Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user ac…
☆55Updated 3 weeks ago
fmacpro / horseman-article-parser-ui
Web Page Inspection Tool UI. Google SERP Preview, Sentiment Analysis, Keyword Extraction, Named Entity Recognition & Spell Check
☆24Updated this week
cosmiclattes / wikireplay
A javascript tool to visualize the diff's in wikipedia
☆35Updated 2 years ago
paulfitz / sheetsite
sync a website or local spreadsheet with a google sheet
☆35Updated 2 years ago
hypothesis / bouncer
The "hyp.is" service that takes a user to a URL with Hypothesis activated
☆54Updated 2 weeks ago
internetarchive / pdf_trio
A PDF classifier ensemble with REST API service
☆23Updated 4 years ago
internetarchive / trough
Trough: Big data, small databases.
☆41Updated last year
tylerpearson / twitter-most-followed-scripts
Scripts to find the most commonly followed Twitter accounts by a group of people
☆27Updated 7 years ago
JustAnotherArchivist / little-things
The little things give you away... A collection of various small helper stuff – Mirror repo only, no longer kept in sync, refer to gitea.…
☆25Updated 5 years ago
opencivicdata / pupa
framework for scraping legislative/government data
☆88Updated last year
reimertz / curse-words
☆30Updated 11 years ago
cdrini / wikidata-timeline
web app for visualizing Wikidata items on a timeline
☆16Updated 6 years ago
votinginfoproject / ElectionDesk
A social media monitoring dashboard for election officials
☆33Updated 10 years ago
peterk / warcworker
A dockerized, queued high fidelity web archiver based on Squidwarc
☆61Updated last year
Apoc2400 / Reftag
Wikipedia citation tool for Google Books, New York Times, ISBN, DOI and more
☆22Updated 8 years ago
ecrmnn / builtwith
Scrape data from BuiltWith.com
☆18Updated 8 years ago
internetarchive / umbra
A queue-controlled browser automation tool for improving web crawl quality
☆62Updated last month
medialab / SearchEnginesBookmarklet
Extract list of results from search engines pages as CSV with a bookmarklet directly within the browser
☆24Updated 5 months ago
ArchiveTeam / WebArchiver
Decentralized web archiving
☆20Updated 7 years ago
N0taN3rd / Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
☆171Updated 5 years ago
ArchiveTeam / NewsGrabber
Grabbing all news.
☆62Updated 5 years ago
datatogether / research
📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity
☆97Updated 7 years ago
sangaline / scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
☆118Updated last year
eregs / regulations-parser
Parser for U.S. federal regulations and other regulatory information
☆40Updated 2 years ago
thisisparker / linkarchiver
🗄 Bot powering the @LinkArchiver Twitter tool to send tweeted URLs to the Wayback Machine
☆46Updated 8 years ago
rootVIII / proxy_web_crawler
Automates the process of repeatedly searching for a website via scraped proxy IP and search keywords
☆45Updated last year
crock / dropfilter-cli
command-line tool to filter expiring domains by configurable criteria
☆17Updated 2 years ago