karust / gogetcrawl
Extract web archive data using Wayback Machine and Common Crawl
☆156Updated 6 months ago
Alternatives and similar repositories for gogetcrawl
Users that are interested in gogetcrawl are comparing it to the libraries listed below
Sorting:
- Common crawl extractor☆75Updated 11 months ago
- Yet another googlesearch - A Python library for executing intelligent, realistic-looking, and tunable Google searches.☆275Updated last year
- Easy to deploy API for transcribing and translating audio / video using OpenAI's whisper model.☆67Updated last year
- The Architecture of a Web Crawler: Building a Google-Inspired Distributed Web Crawler☆115Updated 5 months ago
- Retrieves archived tweets from Wayback Machine in HTML, CSV, and JSON☆104Updated 3 weeks ago
- Curated list of categorized User Agents☆91Updated this week
- A UserScript to detect GPT generated comments on Hackernews.☆14Updated 2 years ago
- Drill into WARC web archives☆138Updated 6 months ago
- An open source investigation tool to collect and analyse public VK community wall posts☆36Updated 2 years ago
- DomainsProject.org HTTP worker☆23Updated 2 years ago
- ☆21Updated 7 months ago
- This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.☆166Updated 3 weeks ago
- Statistics of Common Crawl monthly archives mined from URL index files☆178Updated this week
- A fast GitHub stargazers information gathering tool☆73Updated 3 years ago
- Run a base query (plus optional add-ons) through ask, bing, brave, duck duck go, yahoo, and yandex.☆22Updated 2 years ago
- go-trafilatura is a Go port of the trafilatura Python library.☆61Updated 6 months ago
- CLI utility to scrape emails from websites☆161Updated last year
- A tool for searching common variations of a human name☆47Updated 7 months ago
- A Rumble, BitChute, and YouTube scraper☆42Updated 2 years ago
- Search in Google Lens in lingo! Multi language search of image with export in HTML report☆77Updated last year
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆137Updated 4 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆171Updated 4 months ago
- 📝 This repository contains dumps of the monthly "Chrome UX Report" (CrUX) datasets.☆42Updated last month
- Pivot from a Twitter profile to Medium, Product Hunt, Mastodon, and more with OSINT☆37Updated last year
- Given a subreddit name and a keyword, this program returns all top (by default) posts that contain the specified keyword.☆90Updated last year
- Reverse Engineered Twitter's API☆76Updated last year
- Community curated list of search queries for various products across multiple search engines.☆175Updated last week
- Guide to searching in different file types (documents, breaches, databases, etc.)☆52Updated last year
- A selection of useful Custom Serch Engines for OSINT.☆64Updated 3 months ago
- Index Common Crawl archives in tabular format☆119Updated this week