GateNLP/ultimate-sitemap-parser

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/GateNLP/ultimate-sitemap-parser)

GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser

☆255

Alternatives and similar repositories for ultimate-sitemap-parser

Users that are interested in ultimate-sitemap-parser are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mediacloud / date_guesser
View on GitHub
A library to extract a publication date from a web page, along with a measure of the accuracy.
☆41Aug 13, 2019Updated 6 years ago
mediacloud / feed_seeker
View on GitHub
Find rss, atom, xml, and rdf feeds on webpages
☆31Nov 6, 2025Updated 8 months ago
the-real-tokai / grablinks
View on GitHub
A simple and streamlined Python script to extract and filter links from a remote HTML resource.
☆24Jan 12, 2025Updated last year
ThomasAitken / Scrapy-Testmaster
View on GitHub
The most advanced debugging and testing tool for Scrapy
☆16Apr 19, 2023Updated 3 years ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
malthe / skipdict
View on GitHub
A skip dict is a Python dictionary which is permanently sorted by value.
☆19Sep 25, 2014Updated 11 years ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆966Apr 1, 2026Updated 3 months ago
CasAndreu / ldaRobust
View on GitHub
This is a package to implement the Robust Latent Dirichlet Approach in R.
☆10Apr 25, 2019Updated 7 years ago
mediacloud / metadata-lib
View on GitHub
How Media Cloud approaches extracting metadata from online news stories
☆17Apr 15, 2026Updated 3 months ago
simonw / datasette-insert
View on GitHub
Datasette plugin for inserting and updating data
☆20Mar 29, 2024Updated 2 years ago
bisohns / search-engine-parser
View on GitHub
Lightweight package to query popular search engines and scrape for result titles, links and descriptions
☆489Jun 23, 2026Updated 3 weeks ago
BenjaminDHorne / Language-Features-for-News
View on GitHub
Language features used in the NELA Toolkit and other news studies
☆13Oct 14, 2020Updated 5 years ago
TeamHG-Memex / html-text
View on GitHub
Extract text from HTML
☆135Apr 8, 2026Updated 3 months ago
Mondego / spacetime-crawler4py
View on GitHub
Yet another Web crawler
☆14May 1, 2024Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
MrDebugger / bs2json
View on GitHub
A python3 module that converts your bs4 Tag into json object (dict)
☆16Mar 17, 2026Updated 4 months ago
koheiw / newsmap
View on GitHub
Semi-supervised algorithm for geographical document classification
☆66May 14, 2026Updated 2 months ago
povilasb / scrapy-html-storage
View on GitHub
Scrapy downloader middleware that stores response HTMLs to disk.
☆18Apr 14, 2026Updated 3 months ago
freedmand / interpogate
View on GitHub
A visual tool to interpret and understand PyTorch machine learning models
☆17Feb 11, 2024Updated 2 years ago
scrapy / itemadapter
View on GitHub
Common interface for data container classes
☆70Jul 12, 2026Updated last week
scrapy / itemloaders
View on GitHub
Library to populate items using XPath and CSS with a convenient API
☆49Updated this week
NicolasLM / atoma
View on GitHub
Atom, RSS and JSON feed parser for Python 3
☆117Oct 28, 2022Updated 3 years ago
freedmand / stepfunction-visualizer
View on GitHub
A toolkit to debug and visualize local AWS step functions
☆15Oct 3, 2023Updated 2 years ago
scrapinghub / autopager
View on GitHub
Detect and classify pagination links
☆15Sep 9, 2020Updated 5 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Koshqua / scrapio
View on GitHub
Simple and easy-to-use scraper and crawler in Go.
☆12May 4, 2020Updated 6 years ago
c4software / python-sitemap
View on GitHub
Mini website crawler to make sitemap from a website.
☆378May 2, 2024Updated 2 years ago
kasperwelbers / RNewsflow
View on GitHub
☆38Apr 3, 2024Updated 2 years ago
ushahidi / geograpy
View on GitHub
Extract countries, regions and cities from a URL or text
☆216Sep 10, 2020Updated 5 years ago
reanalytics-databoutique / advanced-scrapy-proxies
View on GitHub
Scrapy rotation proxy package with advanced functions
☆94Jul 4, 2022Updated 4 years ago
jandix / mediacloudr
View on GitHub
API Wrapper for the mediacloud.org API
☆16Aug 20, 2019Updated 6 years ago
russomi-labs / appengine-python-flask-travis-ci
View on GitHub
Skeleton repo for setting up flask + travis-ci + unittests + db migrations with Google App Engine!
☆11May 12, 2015Updated 11 years ago
jaeyk / tidyethnicnews
View on GitHub
R package for turning Ethnic NewsWatch search results into tidyverse-ready dataframes
☆11Dec 7, 2021Updated 4 years ago
hybridtheory / floc-simhash
View on GitHub
A fast python implementation of the SimHash algorithm.
☆27Oct 27, 2021Updated 4 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
buriy / python-readability
View on GitHub
fast python port of arc90's readability tool, updated to match latest readability.js!
☆2,894Jan 26, 2026Updated 5 months ago
simonw / datasette-write
View on GitHub
Datasette plugin providing a UI for executing SQL writes against the database
☆12Nov 11, 2025Updated 8 months ago
jexp / chagpt-coding
View on GitHub
Coding with ChatGPT 4
☆12Jun 15, 2023Updated 3 years ago
impshum / simple-twitter-fact-bot
View on GitHub
Simple facts bot (includes bs4 scraper example)
☆10Feb 24, 2017Updated 9 years ago
opentestimonials / opentestimonials
View on GitHub
☆14Mar 15, 2024Updated 2 years ago
EducationalTestingService / match
View on GitHub
Match tokenized words and phrases within the original, untokenized, often messy, text.
☆19Apr 11, 2023Updated 3 years ago
Suleman-Elahi / D1py
View on GitHub
A very simple wrapper for Cloudflare D1 Databases' REST API in using Python for Python xD
☆13Jul 7, 2026Updated 2 weeks ago