adbar / trafilatura
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
β4,069Updated last week
Alternatives and similar repositories for trafilatura:
Users that are interested in trafilatura are comparing it to the libraries listed below
- π‘ All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflowsβ10,631Updated this week
- The most accurate natural language detection library for Python, suitable for short text and mixed-language textβ1,294Updated last week
- Structured Text Generationβ11,152Updated this week
- A language for constraint-guided and efficient LLM programming.β3,871Updated 9 months ago
- LLM(π½)β1,661Updated last month
- π° Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.β687Updated 2 weeks ago
- Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasetsβ4,397Updated this week
- Article extraction benchmark: dataset and evaluation scriptsβ309Updated 11 months ago
- Improved file parsing for LLMβsβ2,877Updated 4 months ago
- A Bulletproof Way to Generate Structured JSON from Language Modelsβ4,662Updated last year
- Seamlessly integrate LLMs as Python functionsβ2,233Updated 3 weeks ago
- π¦ Integrating LLMs into structured NLP pipelinesβ1,217Updated 2 months ago
- A fast inference library for running LLMs locally on modern consumer-class GPUsβ4,064Updated last week
- Go ahead and axolotl questionsβ8,960Updated this week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,322Updated this week
- ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)β3,296Updated 4 months ago
- [ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddingsβ1,925Updated 2 months ago
- Prompt Engineering | Prompt Versioning | Use GPT or other prompt based models to get structured output. Join our discord for Prompt-Enginβ¦β3,450Updated last month
- Rapid fuzzy string matching in Python using various string metricsβ2,970Updated last week
- OpenChat: Advancing Open-source Language Models with Imperfect Dataβ5,318Updated 6 months ago
- structured outputs for llmsβ9,919Updated this week
- Adding guardrails to large language models.β4,692Updated 2 weeks ago
- Large Language Model Text Generation Inferenceβ9,922Updated this week
- Heuristic based boilerplate removal toolβ764Updated last month
- β2,892Updated 6 months ago
- A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChainβ3,472Updated last year
- Easy token price estimates for 400+ LLMs. TokenOps.β1,609Updated this week
- 20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.β11,878Updated this week
- [EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achβ¦β4,969Updated 2 weeks ago
- Python package for easily interfacing with chat apps, with robust features and minimal code complexity.β3,503Updated 8 months ago