adbar / trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
β4,205Updated last month
Alternatives and similar repositories for trafilatura:
Users that are interested in trafilatura are comparing it to the libraries listed below
- A language for constraint-guided and efficient LLM programming.β3,922Updated 11 months ago
- LLM(π½)β1,669Updated 3 months ago
- Article extraction benchmark: dataset and evaluation scriptsβ315Updated last year
- π‘ All-in-one open-source AI framework for semantic search, LLM orchestration and language model workflowsβ10,876Updated this week
- A Repo For Document AIβ2,810Updated 3 weeks ago
- π¦ Integrating LLMs into structured NLP pipelinesβ1,240Updated 4 months ago
- π° Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.β752Updated last month
- fast python port of arc90's readability tool, updated to match latest readability.js!β2,768Updated this week
- Chat language model that can use tools and interpret the resultsβ1,548Updated this week
- Modular Python framework for AI agents and workflows with chain-of-thought reasoning, tools, and memory.β2,287Updated this week
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.β303Updated 5 months ago
- PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.β7,089Updated last week
- Heuristic based boilerplate removal toolβ769Updated 2 months ago
- A Bulletproof Way to Generate Structured JSON from Language Modelsβ4,715Updated last year
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pacβ¦β276Updated last year
- [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddingsβ1,940Updated 3 months ago
- π» Experimental library for scraping websites using OpenAI's GPT API.β1,433Updated 6 months ago
- Enforce the output format (JSON Schema, Regex etc) of a language modelβ1,796Updated 2 months ago
- A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.htmlβ863Updated 4 months ago
- Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.β11,059Updated last week
- A blazing fast inference solution for text embeddings modelsβ3,505Updated last week
- Structured Text Generationβ11,525Updated last week
- structured outputs for llmsβ10,366Updated this week
- A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChainβ3,480Updated last year
- Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.β998Updated last month
- Superfast AI decision making and intelligent processing of multi-modal data.β2,569Updated 2 weeks ago
- Improved file parsing for LLMβsβ2,943Updated 5 months ago
- Just the facts -- web page content extractionβ1,263Updated 10 months ago
- Seamlessly integrate LLMs as Python functionsβ2,289Updated last week
- Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Traβ¦β1,299Updated last year