Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆16May 17, 2024Updated last year
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- ☆47Feb 10, 2026Updated 3 weeks ago
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆392Oct 7, 2024Updated last year
- AI-powered cryptocurrency trading bot built using deep reinforcement learning (DRL). The bot is designed as a research platform for devel…☆10Jan 18, 2025Updated last year
- Jonas Schmedtmann - The Complete JavaScript Course 2018☆12Jan 21, 2019Updated 7 years ago
- FakeChecker is a part of my Engineering thesis project on Warsaw University of Technology. Its aim is to detect fake reviews on Google Ma…☆12Jun 13, 2023Updated 2 years ago
- Conversion of audio files to text using whisper from OpenAI with a simple tkinter GUI☆10Apr 13, 2023Updated 2 years ago
- ☆10Oct 2, 2024Updated last year
- This is a repository to let you know the implementation of a basic RAG pipeline using LangChain in Supabase Edge Functions.☆11May 22, 2024Updated last year
- Parr(B)ot is a Telegram bot framework based on top of Echotron☆10Jan 15, 2023Updated 3 years ago
- Project to generate Fake Reviews using Tensorflow's word RNN model with text smoothing technique☆10Jun 6, 2018Updated 7 years ago
- A curated collection of 650+ AI tools for productivity, creativity, and innovation. Contribute via pull requests to join the community! E…☆15Jun 25, 2025Updated 8 months ago
- Automated social media post sharing☆11Jan 5, 2022Updated 4 years ago
- Demo App☆11Jan 27, 2026Updated last month
- LLM Building Blocks for Python Course☆16Nov 17, 2025Updated 3 months ago
- a blog starter project☆11Oct 29, 2018Updated 7 years ago
- Inverted Index, Query Formulation and Ranking from Scratch in Python☆10Apr 24, 2018Updated 7 years ago
- FinanceCrew is an AI-powered tool that helps day traders analyze markets, develop strategies, and manage risks using CrewAI.☆13Aug 9, 2024Updated last year
- Slop Scoring to Stop Slop☆48Updated this week
- The official evaluation suite and dynamic data release for MixEval.☆11Sep 23, 2024Updated last year
- AI Agent using Resoning and Actions (ReAct) Multi AI Agent system for Cisco Systems☆11Oct 5, 2024Updated last year
- Doccano annotation server together with a Spacy backend☆11Apr 5, 2023Updated 2 years ago
- Romeo GPT is an AI assistant designed to provide a suite of services which range from document management and analysis to multifaceted AI…☆14Jun 12, 2023Updated 2 years ago
- Building a more intelligent world.☆11Apr 29, 2024Updated last year
- Detect potential insider trading activity on Polymarket prediction markets by tracking suspicious wallet behavior patterns - fresh wallet…☆38Jan 5, 2026Updated 2 months ago
- ☆12Feb 16, 2026Updated 2 weeks ago
- A typescript Implementation of OpenAI swarm generated with o1-mini☆10Apr 27, 2025Updated 10 months ago
- 摸鱼派聊天室python客户端 |🥷 账号多开 |💗免密登陆 | ❤️丰富的快捷命令 |🌈自定义字体颜色 |🧧 红包脚本 | 配置导出☆10May 7, 2025Updated 9 months ago
- Radix Primitives Cheatsheet☆12Mar 11, 2022Updated 3 years ago
- Using the new open-sourced GPT Neo and horoscopes scraped and cleaned from various sources, fine-tune the model to generate realistic hor…☆14Feb 8, 2022Updated 4 years ago
- A Cryptocurrency Dashboard build with Vue JS, PWA enabled, Binance Websocket API for realtime price, amChart for displaying historical ch…☆11Jan 5, 2023Updated 3 years ago
- Auto-generated sphinx version of the IPython website. Since this is an auto-generated directory, do *not* submit pull requests against th…☆11Jan 3, 2026Updated 2 months ago
- ☆11Feb 25, 2025Updated last year
- LegalEaseAI simplifies legal topics with a document analyzer, legal counsel chatbot, and lawyer fee estimator. Powered by large language …☆11Nov 4, 2023Updated 2 years ago
- An open source in memory Graph Database for Social Networks☆10Sep 20, 2022Updated 3 years ago
- this is a personal version of automatingosint, you can scan onion sites automatically with onionscan☆11Aug 20, 2020Updated 5 years ago
- ☆14Oct 8, 2025Updated 4 months ago
- Main Panax Documentation☆11Feb 12, 2016Updated 10 years ago
- A simple sniffer for NATS, the cloud native messaging system. https://nats.io☆11Feb 15, 2016Updated 10 years ago
- Augmented Dickey-Fuller implementation in Go☆12Mar 15, 2019Updated 6 years ago