adbar / trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
☆4,015Updated 3 weeks ago
Alternatives and similar repositories for trafilatura:
Users that are interested in trafilatura are comparing it to the libraries listed below
- Article extraction benchmark: dataset and evaluation scripts☆305Updated 10 months ago
- A Bulletproof Way to Generate Structured JSON from Language Models☆4,632Updated last year
- Improved file parsing for LLM’s☆2,849Updated 4 months ago
- Structured Text Generation☆11,020Updated this week
- 💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows☆10,539Updated this week
- A Repo For Document AI☆2,744Updated this week
- 📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.☆663Updated this week
- Seamlessly integrate LLMs as Python functions☆2,221Updated last week
- An easy way to extract information from documents☆1,739Updated last year
- Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.☆10,450Updated this week
- fast python port of arc90's readability tool, updated to match latest readability.js!☆2,744Updated last month
- This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.☆1,195Updated 5 months ago
- ✨ AI agents that spark joy☆5,552Updated this week
- Developer APIs to Accelerate LLM Projects☆1,597Updated 4 months ago
- Heuristic based boilerplate removal tool☆758Updated 2 weeks ago
- Large Language Model Text Generation Inference☆9,877Updated this week
- Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Ge…☆5,789Updated this week
- Numbers every LLM developer should know☆4,190Updated last year
- 🦙 Integrating LLMs into structured NLP pipelines☆1,210Updated 2 months ago
- A blazing fast inference solution for text embeddings models☆3,280Updated this week
- A language for constraint-guided and efficient LLM programming.☆3,851Updated 9 months ago
- Prompt Engineering | Prompt Versioning | Use GPT or other prompt based models to get structured output. Join our discord for Prompt-Engin…☆3,429Updated last month
- Adding guardrails to large language models.☆4,593Updated last week
- Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…☆3,309Updated last month
- Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).☆1,226Updated 2 weeks ago
- The LLM Evaluation Framework☆5,484Updated this week
- SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 15+ clouds). Get unified execution, cost savings, and high GPU availability v…☆7,495Updated this week
- [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which ach…☆4,929Updated this week
- Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali☆1,896Updated last month
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,029Updated this week