Webhose / free-news-datasetsLinks
Weekly free datasets from global news sites
β33Updated last week
Alternatives and similar repositories for free-news-datasets
Users that are interested in free-news-datasets are comparing it to the libraries listed below
Sorting:
- Chrome Extension for exploring Hugging Face datasets πβ48Updated last year
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Largβ¦β26Updated 10 months ago
- Common crawl extractorβ84Updated last year
- Tools to construct and process Common Crawl webgraphsβ103Updated 3 weeks ago
- A pipeline using LLMs for Knowledge Engineering, combining knowledge probing and Wikidata entity mapping.β38Updated last year
- Automated Document Intelligence Workflowβ34Updated last month
- Small python package to measure OCR quality and other related metrics.β25Updated last year
- VerifAI initiative to build open-source easy-to-deploy generative question-answering engine that can reference and verify answers for corβ¦β77Updated 3 months ago
- Newsfeed based on GDELT Projectβ31Updated last week
- TextGraphs + LLMs + graph ML for entity extraction, linking, ranking, and constructing a lemma graphβ25Updated last year
- Scripts to load the GDELT data set into MongoDBβ14Updated 3 years ago
- Statistics of Common Crawl monthly archives mined from URL index filesβ207Updated this week
- The Official NewsCatcher News API V2 SDK for Pythonβ20Updated last year
- A News Article Collection Libraryβ22Updated 2 years ago
- Newsdata.io Official Python Clientβ14Updated last month
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.β65Updated last week
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extractionβ83Updated last year
- Interactive visual tool for the demonstration of topic evolutionβ42Updated 4 years ago
- π€οΈ Pathik - High-Performance Web Crawler β‘β31Updated 9 months ago
- Pivotal Token Searchβ142Updated 3 weeks ago
- A public repo that contains integrations for Argilla and LlamaIndex.β17Updated last year
- Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.β58Updated 10 months ago
- Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Buiβ¦β15Updated last year
- Automated Qualitative Analysis of LLMs (ICLR 2025)β53Updated 6 months ago
- Strwythura: construct an entity-resolved knowledge graph from structured data sources and unstructured content sources, implementing an oβ¦β190Updated this week
- LLM plugin for clustering embeddingsβ82Updated last year
- Various Jupyter notebooks about Common Crawl dataβ61Updated last month
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.β125Updated this week
- This repository provides various Python methods for finding and aggregating synonyms for an individual word or a list of words.β35Updated 2 years ago
- β62Updated last year