AI4Bharat/webcorpus

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/AI4Bharat/webcorpus)

AI4Bharat / webcorpus

Generate large textual corpora for almost any language by crawling the web

☆13

Alternatives and similar repositories for webcorpus

Users that are interested in webcorpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Open-Speech-EkStep / data-acquisition-pipeline
View on GitHub
☆18Apr 28, 2021Updated 5 years ago
masakhane-io / masakhane-reading-group
View on GitHub
Agile reading group that works
☆13Feb 2, 2022Updated 4 years ago
Open-Speech-EkStep / indic-punct
View on GitHub
☆45Dec 15, 2022Updated 3 years ago
AI4Bharat / FBI
View on GitHub
FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists
☆31Aug 14, 2025Updated 11 months ago
AI4Bharat / DocSim
View on GitHub
Synthetically generate random text document images with ground-truth
☆14Jul 20, 2021Updated 5 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
alejandro-g-m / DetExt
View on GitHub
Detection of malicious data exfiltration over DNS using Machine Learning techniques
☆13Jul 8, 2020Updated 6 years ago
in-rolls / parse_searchable_rolls
View on GitHub
Parse Searchable Electoral Rolls
☆13Apr 20, 2025Updated last year
microsoft / Lightweight-Low-Resource-NMT
View on GitHub
Official code for "Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Reso…
☆18Oct 9, 2025Updated 9 months ago
AI4Bharat / IndicWav2Vec
View on GitHub
Pretraining, fine-tuning and evaluation scripts for Indic-Wav2Vec2
☆117Aug 28, 2025Updated 10 months ago
RocketChat / Apps.Rasa
View on GitHub
Integration between Rocket.Chat and the RASA Chatbot platform
☆17Jul 31, 2023Updated 2 years ago
microsoft / MMLMCalibration
View on GitHub
Code for EMNLP 2022 Paper: On the Calibration of Massively Multilingual Language Models
☆15Jun 12, 2023Updated 3 years ago
skit-ai / slu-prosody
View on GitHub
Code repository for the paper "Improving End-to-End SLU performance with Prosodic Attention and Distillation" accepted at Interspeech 202…
☆27May 17, 2023Updated 3 years ago
microsoft / Litmus
View on GitHub
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems
☆48Aug 19, 2022Updated 3 years ago
VarunGumma / IndicTransToolkit
View on GitHub
A simple, consistent and extendable toolkit for IndicTrans2. (Pypi: https://pypi.org/project/indictranstoolkit)
☆39Apr 30, 2026Updated 2 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
raj-sutariya / indic-num2words
View on GitHub
Python library for converting numbers to words for all Indian Languages.
☆38May 23, 2025Updated last year
AI4Bharat / Indic-TTS
View on GitHub
Text-to-Speech for languages of India
☆378Nov 8, 2024Updated last year
sumanthd17 / Face-Recognition
View on GitHub
Face Recognition based attendance system for classroom environment. Developed a python API which recognizes the people in a picture(of a …
☆14Dec 8, 2022Updated 3 years ago
Open-Speech-EkStep / vakyansh-tts
View on GitHub
Text to Speech for Indic languages
☆53Mar 23, 2022Updated 4 years ago
zuhairmhtb / AudioClassification
View on GitHub
This software is a demonstration of Audio Signal Processing and Machine Learning using Python and Tensorflow. The software contains a GU…
☆12Dec 7, 2023Updated 2 years ago
melanieshi0120 / COVID-19_global_time_series_panel_data
View on GitHub
☆18Jan 15, 2021Updated 5 years ago
zhunyoung / clingoTutorial
View on GitHub
☆12Feb 6, 2023Updated 3 years ago
holdenk / diversity-analytics
View on GitHub
Analytics on Apache Projects for Diversity
☆18Jun 18, 2019Updated 7 years ago
Open-Speech-EkStep / audio-to-speech-pipeline
View on GitHub
This will hold the data pipeline to convert raw audio data to speech which will act as input dataset for speech-to-text pipeline
☆33Feb 15, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
bitextor / bifixer
View on GitHub
Tool to fix bitexts and tag near-duplicates for removal
☆35Sep 4, 2025Updated 10 months ago
PacktPublishing / -Hands-on-Python-3.x-GUI-Programming
View on GitHub
Hands-on Python 3.x GUI Programming, Published by Packt
☆13Jan 18, 2021Updated 5 years ago
McGill-NLP / latent-translation
View on GitHub
Code for the paper "Modelling Latent Translations for Cross-Lingual Transfer"
☆17Nov 22, 2021Updated 4 years ago
scaleracademy / react-exclusive-bootcamp-14-jul
View on GitHub
☆40Jul 14, 2022Updated 4 years ago
cycloneintensity / CrossKnotHacks-Cyclonet
View on GitHub
CycloNet is a Deep Learning based web-app for Cyclone intensity computation using INSAT-3D Cyclone Imagery
☆13Sep 17, 2023Updated 2 years ago
project-anuvaad / anuvaad-parallel-corpus
View on GitHub
☆24May 5, 2022Updated 4 years ago
kaushal0494 / ZmBART
View on GitHub
☆11Mar 19, 2023Updated 3 years ago
datanizing / oreilly-open-source-llm
View on GitHub
☆41May 21, 2026Updated 2 months ago
adijo / gpt3-alchemy
View on GitHub
GPT-3 attempts to predict & balance chemical reactions
☆13Aug 2, 2020Updated 5 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
amrrs / custom-ner-with-spacy3
View on GitHub
Custom Named Entity Recognition with Spacy3
☆31Dec 30, 2021Updated 4 years ago
pomber / docusaurus-mdx-2
View on GitHub
A Docusaurus theme to add support for MDX v2
☆28Jul 20, 2022Updated 4 years ago
priyanshu2103 / Sanskrit-Hindi-Machine-Translation
View on GitHub
Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning
☆20Jan 16, 2021Updated 5 years ago
bastibe / PySoundFile
View on GitHub
DEPRECATED version of SoundFile
☆14May 26, 2020Updated 6 years ago
vaguenebula / AlpacaDataReflect
View on GitHub
An experiment to see if chatgpt can improve the output of the stanford alpaca dataset
☆12Mar 29, 2023Updated 3 years ago
halolimat / SpExtor
View on GitHub
SpExtor: Sparse Entity Extractor
☆11Feb 10, 2020Updated 6 years ago
llyx97 / Rosita
View on GitHub
[AAAI 2021] "ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques", Yuanxin Liu, Zheng Lin, Fengcheng Yuan
☆14Oct 18, 2022Updated 3 years ago