ilinguistics/common_crawl_corpus

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ilinguistics/common_crawl_corpus)

ilinguistics / common_crawl_corpus

Scripts for building a geo-located web corpus using Common Crawl data

☆11

Alternatives and similar repositories for common_crawl_corpus

Users that are interested in common_crawl_corpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

transducens / linguacrawl
View on GitHub
Crawling engine that crawls a set of top-level domains looking for documents in a list of languages
☆11Feb 6, 2024Updated 2 years ago
archiedb / archie
View on GitHub
a light-weight database application designed to standardize and simplify data entry of archaeological or historical artifacts.
☆13May 28, 2026Updated last month
cjohnson318 / fractal_interpolation
View on GitHub
This implements a technique for curve fitting by fractal interpolation found in a paper by Manousopoulos, Drakopoulos, and Theoharis, fou…
☆17Feb 27, 2014Updated 12 years ago
recurve-inc / flexvalue
View on GitHub
A Python library to calculate avoided costs
☆16Aug 19, 2025Updated 11 months ago
MitchMilam / PowerPoints
View on GitHub
Here are all of the PowerPoint presentations that I have ever created and presented.
☆12Dec 28, 2020Updated 5 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
SamwitAdhikary / WhatsApp-ChatBot
View on GitHub
It is a basic Rule Base ChatBot purely made in Python(Flask) and run on server hosted on PythonAnywhere and works with the help of Twilio…
☆18Aug 13, 2024Updated last year
cydalytics / Python_PowerPoint_Automation
View on GitHub
Use Python to Automate the PowerPoint Update
☆15May 28, 2023Updated 3 years ago
ilinguistics / c2xg
View on GitHub
A Python package for learning, evaluating, annotating, and extracting vector representations of construction grammars
☆43Oct 17, 2024Updated last year
pplcc / ubuntu-tensorflow-pytorch-setup
View on GitHub
Setup of TensorFlow and PyTorch on Ubuntu 18.04 -- the easy way!
☆31May 14, 2020Updated 6 years ago
carlbordum / common-crawl-subdomains
View on GitHub
subdomain list based on Common Crawl data, sorted by popularity
☆18Nov 19, 2019Updated 6 years ago
maxrousseau / pynoter
View on GitHub
Convert powerpoint (pptx) files into raw text org or LaTeX files
☆15Aug 28, 2018Updated 7 years ago
krahim / multitaper
View on GitHub
Multitaper R package available on CRAN
☆10Jul 17, 2024Updated 2 years ago
wartortell / Trollette
View on GitHub
Automated generation of powerpoint slides for fun and profit
☆13Oct 18, 2017Updated 8 years ago
eric-guerin / powerpoint-progressbar
View on GitHub
Automation of the creation of a progress bar in powerpoint, and an overview of the sections on each slide
☆13Nov 14, 2017Updated 8 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
sebschu / Submiterator
View on GitHub
Python script to streamline the process of posting external HITs to Amazon's Mechanical Turk crowdsourcing website.
☆11Oct 20, 2020Updated 5 years ago
a-nap / Digital-Research-Toolkit
View on GitHub
Digital Research Toolkit for Linguists course materials
☆12Jul 23, 2025Updated last year
DavidNemeskey / cc_corpus
View on GitHub
Tools for compiling corpora from Common Crawl
☆14Nov 24, 2024Updated last year
ericjang / pptx-export-notes
View on GitHub
Exports plaintext speaker notes from Microsoft Powerpoint .pptx files
☆20Feb 28, 2018Updated 8 years ago
jaeyk / validated_names
View on GitHub
☆15Dec 23, 2024Updated last year
Smerity / cs205_ga
View on GitHub
How deep does Google Analytics go? Efficiently tackling Common Crawl using AWS & MapReduce
☆17Feb 5, 2014Updated 12 years ago
martinctc / PowerPoint-VBA
View on GitHub
Save yourself from 'Death by PowerPoint'
☆15Feb 18, 2020Updated 6 years ago
jkmcnk / mansplain
View on GitHub
A mansplaining tool for bourne-like shells
☆11Feb 2, 2020Updated 6 years ago
LeoVarnet / fastACI
View on GitHub
fastACI toolbox: the MATLAB toolbox for investigating auditory perception using reverse correlation.
☆16Apr 16, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
matt-dray / r.oguelike
View on GitHub
R package: a tile-based roguelike toy for R's console, featuring procedural dungeons and enemy pathfinding
☆18Jan 3, 2023Updated 3 years ago
pewresearch / pewplots-nicar
View on GitHub
Materials from NICAR session "Customizing ggplot for yourself or your organization"
☆16Feb 2, 2026Updated 5 months ago
harish-kamath / rqae
View on GitHub
Residual Quantization Autoencoder, used for interpreting LLMs
☆14Jan 1, 2025Updated last year
fpdetective / modCrawler
View on GitHub
Crawler based on a modified browser to detect online tracking.
☆11Jul 19, 2023Updated 3 years ago
yobibyte / iclr-viewer
View on GitHub
Go through the list of accepted papers for ICLR in terminal and add them to your reading list.
☆13Jan 30, 2021Updated 5 years ago
sglebs / kibana-software-metrics
View on GitHub
Utilities to gather software metrics from tools (SONAR, etc) and store them into ElasticSearch for later display using Kibana.
☆11Dec 31, 2017Updated 8 years ago
fongandrew / pptx-note-remover
View on GitHub
Python script to remove notes from PPTX Powerpoint files
☆17Nov 18, 2022Updated 3 years ago
MasonPhonLab / MAPS
View on GitHub
Mason-Alberta Phonetic Segmenter
☆15Feb 24, 2026Updated 5 months ago
brodieG / ggbg
View on GitHub
Miscellaneous Ggplot2 Extensions
☆23Oct 3, 2018Updated 7 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ropensci / phonfieldwork
View on GitHub
R package for phonetic research and experimenting
☆20Jul 29, 2024Updated last year
LuisaMaerz / KnowMAN
View on GitHub
KnowMAN: Weakly Supervised Multinomial Adversarial Networks
☆12Nov 9, 2021Updated 4 years ago
alvations / SeedLing
View on GitHub
Building and Using A Seed Corpus for the Human Language Project
☆11Feb 9, 2018Updated 8 years ago
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
byvlstr / ContextDots
View on GitHub
A PowerPoint Macro to see the presentation's progress
☆22Sep 11, 2017Updated 8 years ago
sney2002 / PPTExtractor
View on GitHub
Extract images from PowerPoint files
☆17Dec 1, 2011Updated 14 years ago
networkdynamics / geoinference
View on GitHub
☆32Oct 20, 2015Updated 10 years ago