AI4Bharat / setu
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆9Updated 4 months ago
Related projects: ⓘ
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆88Updated last month
- Solving data for LLMs - Create quality synthetic datasets!☆32Updated this week
- Repository for fine-tuning gemma models using unsloth for indic languages☆80Updated 6 months ago
- This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resultin…☆23Updated 8 months ago
- Code repository for "Introducing Airavata: Hindi Instruction-tuned LLM"☆52Updated last month
- Code related to training/fine-tuning Hindi/Hinglish models.☆47Updated 8 months ago
- Fun project: LLM powered RAG Discord Bot that works seamlessly on CPU☆30Updated 10 months ago
- Supervised instruction finetuning for LLM with HF trainer and Deepspeed☆32Updated last year
- ☆59Updated last week
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks, aiding in performance asse…☆31Updated 3 months ago
- End-to-End LLM Guide☆91Updated 2 months ago
- End-to-End Local-First Text-to-SQL Pipelines☆59Updated this week
- Data extraction with LLM on CPU☆62Updated 10 months ago
- A simple, consistent and extendable toolkit for IndicTrans2☆16Updated 3 weeks ago
- Lite weight wrapper for the independent implementation of SPLADE++ models for search & retrieval pipelines. Models and Library created by…☆27Updated 3 weeks ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆48Updated 2 months ago
- Mistral + Haystack: build RAG pipelines that rock 🤘☆99Updated 7 months ago
- Question answering on codebase☆22Updated 3 months ago
- StructuredRAG Benchmarker☆85Updated last week
- a LLM cookbook, for building your own from scratch, all the way from gathering data to training a model☆120Updated 2 months ago
- Shoonya - Platform to Annotate and label data at scale.☆48Updated 2 weeks ago
- Material for the series of seminars on Large Language Models☆24Updated 4 months ago
- A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.☆31Updated 2 weeks ago
- Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)☆72Updated last week
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆58Updated 2 weeks ago
- zero-to-lightning☆27Updated 4 months ago
- Text to Python Objects via a LLM Function Call☆55Updated 5 months ago
- Large Language Model (LLM) Inference API and Chatbot☆123Updated 5 months ago
- Complete implementation of Llama2 with/without KV cache & inference 🚀☆45Updated 3 months ago
- a tiny vectorstore implementation built with numpy.☆50Updated 4 months ago