Scripts for building a geo-located web corpus using Common Crawl data
☆11Jan 18, 2026Updated last month
Alternatives and similar repositories for common_crawl_corpus
Users that are interested in common_crawl_corpus are comparing it to the libraries listed below
Sorting:
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆11Feb 6, 2024Updated 2 years ago
- Here are all of the PowerPoint presentations that I have ever created and presented.☆12Dec 28, 2020Updated 5 years ago
- Super simple, zero config options, <2kb declarative tooltip library with no dependencies.☆17Jun 2, 2023Updated 2 years ago
- La plateforme derrière nous le peuple. Fork de Pligg.☆10Sep 29, 2015Updated 10 years ago
- Curated list of awesome datasets for various table understanding tasks☆18Sep 5, 2025Updated 5 months ago
- Crawler based on a modified browser to detect online tracking.☆11Jul 19, 2023Updated 2 years ago
- Residual Quantization Autoencoder, used for interpreting LLMs☆14Jan 1, 2025Updated last year
- Multitaper R package available on CRAN☆10Jul 17, 2024Updated last year
- My OpenCode and Oh-My-OpenCode configuration files with API proxy setup documentation☆32Jan 5, 2026Updated last month
- A repository for resources relating to NLP in the Balochi language☆19Jun 3, 2023Updated 2 years ago
- Utilities to gather software metrics from tools (SONAR, etc) and store them into ElasticSearch for later display using Kibana.☆11Dec 31, 2017Updated 8 years ago
- Persian Datasets including: Wikipedia, Twitter, Hamshahri, Hellokish, NSURL'19, Peyma, Text_mining.ir☆11Oct 6, 2023Updated 2 years ago
- FamilyTool benchmark☆12Sep 10, 2025Updated 5 months ago
- Command-line corpus tools☆12May 15, 2017Updated 8 years ago
- ☆10Oct 15, 2020Updated 5 years ago
- Morfessor EM+Prune☆10Jul 22, 2020Updated 5 years ago
- convert subtitles to raw text☆10Nov 23, 2016Updated 9 years ago
- Use Python to Automate the PowerPoint Update☆15May 28, 2023Updated 2 years ago
- 小模型LLM的搭建,学习LLM的建模、训练过程 基于DeepSeek-MOE架构的小模型,用于个人学习,从0开始,解释每一条语句☆14Mar 28, 2025Updated 11 months ago
- A smart distributed crawler that infers navigation models of structured websites, used to cluster pages based on their structure and extr…☆10Aug 17, 2025Updated 6 months ago
- AWS Sample for extracting sensor data and detecting scenes from autonomous driving data collected in ROS bag files.☆12Sep 27, 2021Updated 4 years ago
- Go through the list of accepted papers for ICLR in terminal and add them to your reading list.☆13Jan 30, 2021Updated 5 years ago
- A Python package for learning, evaluating, annotating, and extracting vector representations of construction grammars☆43Oct 17, 2024Updated last year
- Hengam: An Adversarially Trained Transformer for Persian Temporal Tagging (AACL'22)☆11Aug 25, 2023Updated 2 years ago
- KnowMAN: Weakly Supervised Multinomial Adversarial Networks☆12Nov 9, 2021Updated 4 years ago
- My presentation at RStudio::conf(2019), Austin, Tx - "The lazy and easily distracted report writer"☆17Feb 5, 2019Updated 7 years ago
- This repository contains a series of 4 jupyter notebooks demonstrating how AWS AI Services like Amazon Rekognition, Amazon Transcribe and…☆13Nov 26, 2021Updated 4 years ago
- ☆22Feb 3, 2026Updated last month
- a light-weight database application designed to standardize and simplify data entry of archaeological or historical artifacts.☆13Updated this week
- An OSINT tool to find data leaks on a targeted website☆17Mar 30, 2021Updated 4 years ago
- a fast implementation of BM25☆10Sep 15, 2022Updated 3 years ago
- Library and examples to interface a HPGL plotter such as HP7550a to processing.☆10Jan 15, 2015Updated 11 years ago
- Code associated with the project http://predimportance.mit.edu/☆12Aug 7, 2020Updated 5 years ago
- ☆12Jun 25, 2018Updated 7 years ago
- FreeBSD Bash script to run a rclone backup to Backblaze B2☆13Aug 22, 2025Updated 6 months ago
- Benchmarks for Low Latency (Streaming) solutions including Apache Storm, Apache Spark, Apache Flink, Kafka Stream API and Hazelcast Jet☆10Apr 3, 2024Updated last year
- 🕸 GlotWeb: Web Indexing for Minority Languages (WWW 2026)☆17Updated this week
- Regular Expressions for finding wrong punctuation before publishing.☆10May 5, 2017Updated 8 years ago
- SQL and Bash scripts to import the offical Stack Overflow data dump and the SOTorrent data set, to retrieve Stack Overflow references fro…☆15Sep 14, 2025Updated 5 months ago