A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
☆298May 19, 2025Updated last year
Alternatives and similar repositories for extractnet
Users that are interested in extractnet are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Article extraction benchmark: dataset and evaluation scripts☆375May 29, 2026Updated last month
- Just the facts -- web page content extraction☆1,275Jul 8, 2025Updated 11 months ago
- Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…☆6,203Updated this week
- AI based web-wrapper for web-content-extraction☆102Feb 6, 2023Updated 3 years ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆175Updated this week
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆169Oct 28, 2021Updated 4 years ago
- Heuristic based boilerplate removal tool☆819Feb 25, 2025Updated last year
- ☆16Apr 10, 2026Updated 2 months ago
- code and data used to build a training dataset for dragnet models☆10Nov 29, 2020Updated 5 years ago
- Scrapyd on container infrastructure☆16May 29, 2026Updated last month
- quadipy is a python package to help transform structured data into RDF graph format☆19Apr 14, 2023Updated 3 years ago
- A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!☆94Feb 25, 2025Updated last year
- news-please - an integrated web crawler and information extractor for news that just works☆2,456Apr 14, 2026Updated 2 months ago
- 🤖 Scrape data from HTML websites automatically by just providing examples☆1,385Mar 17, 2024Updated 2 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Detect and classify pagination links☆107Apr 8, 2026Updated 2 months ago
- Parallelized automatic corpus collection for ASR. Forked from https://github.com/EgorLakomkin/KTSpeechCrawler☆23Mar 21, 2021Updated 5 years ago
- Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages☆542Jul 17, 2021Updated 4 years ago
- YouTube-Based Multimodal Recipe Recommender☆14Jul 11, 2024Updated last year
- A utility package for getting image dimensions without loading files into memory. No dependencies!☆16May 23, 2023Updated 3 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Aug 13, 2019Updated 6 years ago
- ☆26Apr 14, 2026Updated 2 months ago
- This is a proof-of-concept of using an LLM to find and extract meaningful data without parsing the html too much.☆30Apr 18, 2023Updated 3 years ago
- A crowdsourced list of public datasets on the topic of Food☆26Jan 22, 2018Updated 8 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Turn any webpage into structured data using LLMs☆6,822Jun 15, 2026Updated 2 weeks ago
- Simplified DOM Trees for Transferable Attribute Extraction from the Web☆43Sep 27, 2024Updated last year
- LD-Explorer is the missing tool for exploring, federating and querying linked data resources directly from the browser☆22Jun 22, 2026Updated last week
- fast python port of arc90's readability tool, updated to match latest readability.js!☆2,896Jan 26, 2026Updated 5 months ago
- Targeted Data Generation with Large Language Models☆19Jun 25, 2024Updated 2 years ago
- 📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.☆1,119Apr 30, 2026Updated 2 months ago
- Python infrastructure to train paths selectors for symbolic execution engines.☆15Updated this week
- Trigger an LLM in your CI/CD to auto-complete your work☆11Apr 5, 2023Updated 3 years ago
- spaCy module for linking text to Wikidata items☆244Mar 9, 2023Updated 3 years ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment☆16Apr 13, 2022Updated 4 years ago
- A TUI for Managing and Searching with Meilisearch☆20Aug 26, 2025Updated 10 months ago
- Scraping assistant tool. Editing and maintaining CSS/XPath selectors across webpages.☆129May 19, 2018Updated 8 years ago
- A graph query engine☆27May 29, 2026Updated last month
- A collection of pre-built speech synthesis settings used to convey emotion☆11Jul 9, 2019Updated 6 years ago
- Python scraper based on AI☆27,473Jun 23, 2026Updated last week
- Scripts to automatically sync Claude Code generated TODO to TaskWarrior☆17Jun 22, 2025Updated last year