LukasKriesch / CommonCrawlNewsDataSetLinks
This repository contains code to download, extract, filter and geocode news articles from the Common Crawl News Dataset
☆24Updated 8 months ago
Alternatives and similar repositories for CommonCrawlNewsDataSet
Users that are interested in CommonCrawlNewsDataSet are comparing it to the libraries listed below
Sorting:
- Python based Wikidata framework for easy dataframe extraction☆45Updated 2 years ago
- A web application that interfaces between openalex.org and Gephi☆11Updated 5 months ago
- Citation Classification using hybrid neural network model for Wikipedia References☆31Updated 3 years ago
- Archiving and transforming official Italian General Election text-only polls into machine readable data using Large Language Models☆16Updated 2 weeks ago
- Libraries, Archives and Museums (LAM)☆88Updated 3 years ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆69Updated last month
- Linked SDMX☆17Updated 11 years ago
- Collection de romans français du dix-huitième siècle (1751-1800) / Collection of Eighteenth-Century French Novels (1751-1800)☆23Updated last year
- Generic platform for large scale collaborative planning☆17Updated 4 months ago
- Imports Wiktionary's grammatical data into Wikidata☆18Updated 6 years ago
- Repository for the Procedural Knowledge Ontology (PKO)☆27Updated 2 months ago
- Natural language processing on 12k+ country lyrics🍺☆30Updated 7 years ago
- RDF Community Discussions. Ask anything here!☆13Updated last year
- ☆40Updated 7 years ago
- Extract networks of entities from journalistic reporting☆49Updated 2 years ago
- ☆10Updated 9 years ago
- Ontologies of Linguistic Annotation. Machine-readable tagsets and annotation schemata for more than 100 languages.☆22Updated last week
- A repository of datasets for learning and mastering Gephi☆25Updated last month
- A Python module to manipulate data on a Wikibase instance (like Wikidata) through the MediaWiki Wikibase API and the Wikibase SPARQL endp…☆87Updated 2 weeks ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆152Updated 2 months ago
- Fixed and optimized OMI polygons from Agenzia Dell Entrate☆29Updated 7 months ago
- Lehigh University Benchmark (LUBM).☆10Updated 5 years ago
- Bagpipes spaCy is a collection of custom spaCy pipeline components designed to enhance text processing capabilities.☆21Updated last year
- A library for working with prompt templates locally or on the Hugging Face Hub.☆55Updated 11 months ago
- Homebase of the IPTC EXTRA project about rule-based text categorization☆13Updated 8 years ago
- etl pipeline, graphical explorer and general toolbox for investigations with follow the money data☆25Updated 6 months ago
- Wikidata authority file mapping tool☆11Updated 7 years ago
- Samples of Entando applications☆12Updated 3 years ago
- This repository hosts materials from the CLiC-IT 2023 tutorial☆30Updated last year
- Neo4j powered web application for multimedia collections: bring graph-based exploration and crowd-based indexation.☆24Updated 5 years ago