AI4Bharat / setuLinks
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
β15Updated last year
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- Data extraction with LLM on CPUβ68Updated 2 years ago
- π Unstructured Data Connectors for Haystack 2.0β17Updated 2 years ago
- Writing Blog Posts with Generative Feedback Loops!β50Updated last year
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing β‘β69Updated last month
- Recipes and resources for building, deploying, and fine-tuning generative AI with Fireworks.β131Updated 3 weeks ago
- Create a music review RAG application with Neo4jβ22Updated last year
- A personal knowledge base that I can dump information to and help me learnβ24Updated 7 months ago
- Simple customizable evaluation for text retrieval performance of Sentence Transformers embedders on PDFsβ30Updated 11 months ago
- Dataset Viber is your chill repo for data collection, annotation and vibe checks.β45Updated last year
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasksβ38Updated last year
- This repo is the central repo for all the RAG Evaluation reference material and partner workshopβ78Updated 8 months ago
- β21Updated last year
- Median is an open-source flashcard application that leverages the power of spaced repetition and artificial intelligence to transform theβ¦β22Updated last year
- β30Updated last year
- β125Updated 10 months ago
- Agentic RAG to help you build a startupπβ55Updated 9 months ago
- Mistral + Haystack: build RAG pipelines that rock π€β106Updated last year
- Solving data for LLMs - Create quality synthetic datasets!β151Updated 11 months ago
- Repository for fine-tuning gemma models using unsloth for indic languagesβ97Updated last year
- Your local personalised AI agentβ42Updated last year
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafteβ¦β81Updated last year
- Medical Mixture of Experts LLM using Mergekit.β20Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Modelsβ115Updated 9 months ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absoluteβ¦β51Updated last year
- Docutron Toolkit: detection and segmentation analysis for legal data extraction over documents.β26Updated 2 years ago
- purpose of this repo is to Implement LLMOPs as shared in Deeplearning AI courseβ47Updated 2 weeks ago
- β53Updated 11 months ago
- Low latency, High Accuracy, Custom Query routers for Humans and Agents. Built by Prithivi Daβ119Updated 9 months ago
- β38Updated last year
- Data extraction with LLM on CPUβ112Updated 2 years ago