commoncrawl/ia-web-commons

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/commoncrawl/ia-web-commons)

commoncrawl / ia-web-commons

Web archiving utility library

☆11

Alternatives and similar repositories for ia-web-commons

Users that are interested in ia-web-commons are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

lemurproject / ClueWeb22
View on GitHub
☆17Dec 11, 2024Updated last year
commoncrawl / language-detection-cld2
View on GitHub
Natural language detection, Java bindings for CLD2
☆17Feb 26, 2026Updated 4 months ago
modelpredict / language-identification-survey
View on GitHub
Live survey of off-the-shelf language identification tools for python
☆27Apr 13, 2022Updated 4 years ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
gabrielpetersson / simple-variational-auto-encoder
View on GitHub
a simple variational auto encoder with some exploration
☆13Nov 22, 2024Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
jpoles1 / cq_style
View on GitHub
☆13Sep 1, 2024Updated last year
cmu-flame / FLAME-MoE
View on GitHub
Official repository for FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
☆42Sep 19, 2025Updated 10 months ago
thu-pacman / kaiyuan-mindformers
View on GitHub
The training framework of PCMind-2.1-Kaiyuan-2B built on MindFormers
☆15Dec 9, 2025Updated 7 months ago
ZhihaoPENG-CityU / TIP22---AASSC-Net
View on GitHub
This paper has been accepted by IEEE Transactions on Image Processing.
☆10Feb 24, 2023Updated 3 years ago
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
commoncrawl / nutch
View on GitHub
Common Crawl fork of Apache Nutch
☆42Updated this week
ARTME3D / Desktop-Filament-Extruder-MK2.5
View on GitHub
☆17Jul 9, 2024Updated 2 years ago
sethfischer / cq-electronics
View on GitHub
Pure CadQuery models of various electronic boards and components.
☆20Jul 22, 2025Updated last year
OpenMatch / Augmentation-Adapted-Retriever
View on GitHub
[ACL 2023] This is the code repo for our ACL'23 paper "Augmentation-Adapted Retriever Improves Generalization of Language Models as Gener…
☆60Jul 12, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
thunlp / Seq2Seq-Prompt
View on GitHub
Source code for COLING 2022 paper "Automatic Label Sequence Generation for Prompting Sequence-to-sequence Models"
☆24Sep 21, 2022Updated 3 years ago
syllogismos / facebook-link-prediction
View on GitHub
facebook link prediction kaggle challenge.
☆15Aug 10, 2014Updated 11 years ago
cxcscmu / General-AgentBench
View on GitHub
Benchmark Test-Time Scaling of General LLM Agents
☆20Apr 14, 2026Updated 3 months ago
henryzhao5852 / BeamDR
View on GitHub
☆15Oct 10, 2021Updated 4 years ago
CoderPat / text-generation-inference
View on GitHub
An Apache 2.0 fork of HuggingFace's Large Language Model Text Generation Inference
☆19Mar 10, 2024Updated 2 years ago
arkhn-oss / FHIR2Dataset
View on GitHub
Query FHIR apis with SQL, for analytics and ML.
☆25Jun 15, 2021Updated 5 years ago
bigscience-workshop / metadata
View on GitHub
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
☆29Jun 12, 2023Updated 3 years ago
adaptyvbio / automancer
View on GitHub
Automancer is a software application that enables researchers to design, automate, and manage their experiments.
☆27Jul 26, 2023Updated 2 years ago
EleutherAI / pilev2
View on GitHub
☆13Jan 20, 2023Updated 3 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
yaacoo / graphRagSqlator
View on GitHub
LLM graph-RAG SQL generator for large databases with poor documentation
☆19Sep 12, 2024Updated last year
Kurogoma4D / file-search-mcp
View on GitHub
A specialized Model Context Protocol (MCP) server for full-text search within a filesystem.
☆20Apr 8, 2025Updated last year
HugoTini / DebugTShow
View on GitHub
Visualize PyTorch tensors from VS Code debugger
☆18Nov 28, 2023Updated 2 years ago
violet-zct / group-conditional-DRO
View on GitHub
Group-conditional DRO to alleviate spurious correlations
☆15Jul 15, 2021Updated 5 years ago
allenai / bff
View on GitHub
☆39Apr 17, 2024Updated 2 years ago
sinaahmadi / PersoArabicLID
View on GitHub
PALI: Language identification for Perso-Arabic Scripts
☆11Jul 11, 2023Updated 3 years ago
applicaai / CCpdf
View on GitHub
Index of URLs to pdf files all over the internet and scripts
☆25May 2, 2023Updated 3 years ago
seonghyeonye / RoSPr
View on GitHub
[EMNLP 2023 Findings] Efficiently Enhancing Zero-Shot Performance of Instruction Following Model via Retrieval of Soft Prompt
☆20Nov 2, 2023Updated 2 years ago
leo-liuzy / probe-across-time
View on GitHub
☆22Aug 31, 2021Updated 4 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
gonglinyuan / metro_t0
View on GitHub
Code repo for "Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers" (ACL 2023)
☆22Nov 1, 2023Updated 2 years ago
ruotianluo / lmdbdict
View on GitHub
A simple wrapper for lmdb. Support dict-like operations.
☆23Apr 20, 2023Updated 3 years ago
castorini / TREC-COVID
View on GitHub
TREC-COVID results - this is a mirror of data on the TREC website in a more convenient format.
☆15Aug 31, 2020Updated 5 years ago
polm / ipadic-py
View on GitHub
IPAdic packaged for easy use from Python.
☆24Oct 31, 2021Updated 4 years ago
vthib / tlsh
View on GitHub
Rust port of TLSH
☆14Oct 12, 2025Updated 9 months ago
ProfFan / Snap2LaTeX
View on GitHub
A minimalist macOS app to convert a snap of Equation to LaTeX without paying
☆18Jun 9, 2026Updated last month
danielsc / dogbreeds
View on GitHub
☆21Feb 19, 2019Updated 7 years ago