AI4Bharat / setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
16Updated 11 months ago

Alternatives and similar repositories for setu

Users that are interested in setu are comparing it to the libraries listed below

Sorting: