AI4Bharat/setu

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/AI4Bharat/setu)

AI4Bharat / setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.

☆16

Alternatives and similar repositories for setu

Users that are interested in setu are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Shaikhershad / Bulk-Image-Downloader-Free
View on GitHub
bulk image downloader freeware, reddit bulk image downloader, bulk image downloader extension, bulk image downloader from url, bulk image…
☆26Feb 19, 2026Updated 5 months ago
mireshghallah / ft-memorization
View on GitHub
☆13Oct 20, 2022Updated 3 years ago
HKUST-KnowComp / PrivaCI-Bench
View on GitHub
☆23Apr 23, 2025Updated last year
cretz / esgopeta
View on GitHub
Go implementation of the Gun distributed graph database
☆11Feb 26, 2019Updated 7 years ago
google / BEGIN-dataset
View on GitHub
A benchmark dataset for evaluating dialog system and natural language generation metrics.
☆39Jun 13, 2022Updated 4 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
singhpratyush / index-search-query
View on GitHub
Inverted Index, Query Formulation and Ranking from Scratch in Python
☆10Apr 24, 2018Updated 8 years ago
pires / nats-sniffer
View on GitHub
A simple sniffer for NATS, the cloud native messaging system. https://nats.io
☆11Feb 15, 2016Updated 10 years ago
minaandrawos / golangnews
View on GitHub
Golang news aggregator mobile application written in React Native (source:www.golangnews.com)
☆13Jun 23, 2026Updated last month
hemerajs / go-hemera
View on GitHub
🔬Writing reliable & fault-tolerant microservices with https://nats.io
☆16Mar 27, 2018Updated 8 years ago
imperial-aisp / mia_llms_benchmark
View on GitHub
Benchmarking MIAs against LLMs.
☆30Oct 8, 2024Updated last year
FieteO / doccano_spacy
View on GitHub
Doccano annotation server together with a Spacy backend
☆11Apr 5, 2023Updated 3 years ago
agentsea / taskara
View on GitHub
Task management for AI agents
☆17Jun 25, 2025Updated last year
cbruyndoncx / crewAI-xls
View on GitHub
Gradio UI to load crewAI configuration from excel xls and generate the python code. The source of the crews is in the xls. It allows for …
☆10Oct 17, 2025Updated 9 months ago
plivo / actions-sms
View on GitHub
Plivo github actions
☆14Nov 29, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
Matthew17-21 / go-polymarket-real-time-data-client
View on GitHub
A Go client to receive real-time data messages from Polymarket
☆12Jun 25, 2025Updated last year
artefactory / artefactual
View on GitHub
☆43Updated this week
NorskRegnesentral / text-anonymization-benchmark
View on GitHub
Annotated corpus + evaluation metrics for text anonymisation
☆77Jan 19, 2026Updated 6 months ago
FishPiOffical / fishpi-pyclient
View on GitHub
摸鱼派聊天室python客户端｜🥷 账号多开｜💗免密登陆｜ ❤️丰富的快捷命令｜🌈自定义字体颜色｜🧧 红包脚本｜配置导出
☆10May 7, 2025Updated last year
Uvacoder / uvacodernotes-4261
View on GitHub
☆12Mar 20, 2023Updated 3 years ago
The-Swarm-Corporation / Research-Paper-Writer-Swarm
View on GitHub
Automate the creation of high quality research papers in latex. Powered by Swarms 🤖
☆11Dec 1, 2025Updated 7 months ago
allenai / infinigram-api
View on GitHub
☆102Jul 16, 2026Updated last week
canack / sobot
View on GitHub
Automated social media post sharing
☆12Jan 5, 2022Updated 4 years ago
golangbot / mysqltutorial
View on GitHub
☆11Aug 22, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ottowg / gsap-ner
View on GitHub
☆10Oct 2, 2024Updated last year
yarikama / Agentic-Advanced-RAG
View on GitHub
Building a multi-agent RAG system with advanced RAG methods
☆13Jan 12, 2025Updated last year
kzhekov / GPTNeo-Horoscopes
View on GitHub
Using the new open-sourced GPT Neo and horoscopes scraped and cleaned from various sources, fine-tune the model to generate realistic hor…
☆14Feb 8, 2022Updated 4 years ago
clearsitedesigns / crewai-custom-tools
View on GitHub
Custom tools for agent based crewAI langchain solutions
☆10May 27, 2024Updated 2 years ago
ruvnet / ruv-engineer
View on GitHub
rUv-Engineer - let's you describe UI using your imagination, then see it rendered live.
☆13Sep 28, 2024Updated last year
SkalskiP / YOLO-World
View on GitHub
Real-Time Open-Vocabulary Object Detection
☆12Feb 7, 2024Updated 2 years ago
Sven-Bo / streamit-css-styling-demo
View on GitHub
Demo App
☆11Jan 27, 2026Updated 5 months ago
ipython / ipython.github.com
View on GitHub
Auto-generated sphinx version of the IPython website. Since this is an auto-generated directory, do *not* submit pull requests against th…
☆11Jun 2, 2026Updated last month
maurodelazeri / orderbook
View on GitHub
Orderbook implementation in go with red black tree
☆14Sep 4, 2018Updated 7 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
romanornr / AtomicOTCswap
View on GitHub
Easy to use Atomic Swap for bitcoin & altcoins
☆17Mar 26, 2019Updated 7 years ago
kangoo13 / proxy-checker
View on GitHub
Check for a set of proxies different conditions, is the proxy working, does the proxy bypass cloudflare and so on.
☆13Mar 8, 2020Updated 6 years ago
mannaandpoem / AGIDreamFactory
View on GitHub
Building a more intelligent world.
☆11Apr 29, 2024Updated 2 years ago
Josh-Hicks / scraped-tutorials
View on GitHub
☆11Apr 15, 2022Updated 4 years ago
PacktPublishing / Building-Responsible-AI-with-Python
View on GitHub
Machine Learning Data Fairness and Bias
☆15Apr 29, 2026Updated 2 months ago
chr1st1ank / narrow-down
View on GitHub
Fast fuzzy text search
☆12May 16, 2023Updated 3 years ago
Tyler-Churchill / Embeddable-Widget-Preact-Typescript
View on GitHub
Create embeddable widgets with Preact and Typescript
☆14Jul 8, 2026Updated 2 weeks ago