TeamHG-Memex/MaybeDont

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/TeamHG-Memex/MaybeDont)

TeamHG-Memex / MaybeDont

A component that tries to avoid downloading duplicate content

☆28

Alternatives and similar repositories for MaybeDont

Users that are interested in MaybeDont are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

TeamHG-Memex / autologin-middleware
View on GitHub
Scrapy middleware for the autologin
☆36Apr 8, 2026Updated 3 months ago
TeamHG-Memex / autopager
View on GitHub
Detect and classify pagination links
☆107Apr 8, 2026Updated 3 months ago
TeamHG-Memex / extract-html-diff
View on GitHub
extract difference between two html pages
☆33Apr 8, 2026Updated 3 months ago
TeamHG-Memex / Formasaurus
View on GitHub
Formasaurus tells you the type of an HTML form and its fields using machine learning
☆121Apr 8, 2026Updated 3 months ago
TeamHG-Memex / soft404
View on GitHub
A classifier for detecting soft 404 pages
☆65Apr 8, 2026Updated 3 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
TeamHG-Memex / undercrawler
View on GitHub
A generic crawler
☆81Apr 8, 2026Updated 3 months ago
TeamHG-Memex / domain-discovery-crawler
View on GitHub
Broad crawler for domain discovery
☆20Apr 8, 2026Updated 3 months ago
TeamHG-Memex / tor-proxy
View on GitHub
a tor socks proxy docker image
☆12Apr 8, 2026Updated 3 months ago
scrapinghub / scrapy-mosquitera
View on GitHub
Restrict crawl and scraping scope using matchers.
☆26Jun 8, 2016Updated 10 years ago
scrapinghub / product-extraction-benchmark
View on GitHub
☆16Apr 10, 2026Updated 3 months ago
TeamHG-Memex / sitehound-frontend
View on GitHub
Site Hound (previously THH) is a Domain Discovery Tool
☆24Apr 8, 2026Updated 3 months ago
TeamHG-Memex / autologin
View on GitHub
A project to attempt to automatically login to a website given a single seed
☆129Apr 8, 2026Updated 3 months ago
TeamHG-Memex / scrapy-dockerhub
View on GitHub
[UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.
☆12Apr 8, 2026Updated 3 months ago
scrapinghub / scrapy-poet
View on GitHub
Page Object pattern for Scrapy
☆127Jun 8, 2026Updated last month
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
scrapinghub / andi
View on GitHub
Library for annotation-based dependency injection
☆24Updated this week
commonsearch / gumbocy
View on GitHub
Python binding for gumbo-parser using Cython
☆14Aug 16, 2016Updated 9 years ago
TeamHG-Memex / scrapy-crawl-once
View on GitHub
Scrapy middleware which allows to crawl only new content
☆80Apr 8, 2026Updated 3 months ago
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago
ArturGaspar / scrapy-qtwebkit
View on GitHub
☆13Dec 4, 2019Updated 6 years ago
croqaz / Stones
View on GitHub
🗿Stones: Persistent key-value containers, compatible with Python dict
☆17Jul 15, 2024Updated 2 years ago
rmax / databrewer-recipes
View on GitHub
DataBrewer Recipes Repository.
☆21Jul 5, 2016Updated 10 years ago
RohanGautam / rust-aws-lambda
View on GitHub
Make a rust executable that runs on AWS lambda
☆10Mar 2, 2021Updated 5 years ago
SimonSapin / html5ever-python
View on GitHub
Python bindings for html5ever, using CFFI
☆39Nov 9, 2017Updated 8 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
TeamHG-Memex / aquarium
View on GitHub
Splash + HAProxy + Docker Compose
☆195Apr 8, 2026Updated 3 months ago
redapple / parslepy
View on GitHub
Python implementation of the Parsley language for extracting structured data from web pages
☆92Oct 26, 2017Updated 8 years ago
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
sethmlarson / whatwg-url
View on GitHub
Python implementation of WHATWG URL Living Standard
☆20Jun 20, 2024Updated 2 years ago
rmax / scrapydo
View on GitHub
Crochet-based blocking API for Scrapy.
☆47Feb 24, 2017Updated 9 years ago
scrapy / itemadapter
View on GitHub
Common interface for data container classes
☆70Updated this week
lopuhin / scrapy-pyppeteer
View on GitHub
Use pyppeteer from a Scrapy spider
☆59Feb 5, 2020Updated 6 years ago
scrapinghub / webstruct
View on GitHub
NER toolkit for HTML data
☆259May 3, 2024Updated 2 years ago
cdunklau / fbemissary
View on GitHub
A bot framework for the Facebook Messenger platform, built on asyncio and aiohttp
☆30May 31, 2017Updated 9 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
ericflo / django-session-user
View on GitHub
A simple piece of middleware that can be added to your Django project which will store and retrieve the logged-in user's information from…
☆25Jun 9, 2011Updated 15 years ago
rmax / scrapy-boilerplate
View on GitHub
Small set of utilities to simplify writing Scrapy spiders.
☆50Jul 24, 2015Updated 11 years ago
scrapinghub / aduana
View on GitHub
Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…
☆54May 21, 2024Updated 2 years ago
jvanz / libwarc
View on GitHub
C++ library to parse WARC files
☆11Jan 27, 2019Updated 7 years ago
scrapy / pypydispatcher
View on GitHub
A fork of http://pydispatcher.sourceforge.net/ with PyPy support
☆16Jul 3, 2017Updated 9 years ago
Chennaipy / website
View on GitHub
Chennaipy's website at chennaipy.org
☆13Jun 27, 2026Updated 3 weeks ago
msegala / Kaggle-National_Data_Science_Bowl
View on GitHub
☆15Mar 28, 2016Updated 10 years ago