PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
☆79Apr 27, 2025Updated 10 months ago
Alternatives and similar repositories for spark-pdf
Users that are interested in spark-pdf are comparing it to the libraries listed below
Sorting:
- ScaleDP is an Open-Source extension of Apache Spark for Document Processing☆17Dec 2, 2025Updated 3 months ago
- The Lightning Catalog is an open-source data catalog designed for preparing data at any scale in ad-hoc analytics, data virtualization, …☆37Feb 5, 2026Updated last month
- ☆20Jan 31, 2026Updated last month
- Tool for visualizing Apache Oozie pipelines☆12Feb 15, 2016Updated 10 years ago
- Minutely clientside OpenStreetMap changeset streams☆19Apr 15, 2023Updated 2 years ago
- Notebook Discovery Tool for Databricks notebooks☆19Jul 14, 2022Updated 3 years ago
- Magic to help Spark pipelines upgrade☆34Sep 29, 2024Updated last year
- Lahinch surf predictions with Hopsworks☆15May 21, 2025Updated 9 months ago
- A Spark connector for the Azure Common Data Model☆15May 31, 2023Updated 2 years ago
- Delta Lake helper methods in PySpark☆327Jan 19, 2026Updated last month
- Notebooks for querying Fabric APIs and storing data in Fabric Lakehouses☆25May 20, 2024Updated last year
- a chrome extension that takes an image and turns it into a csv☆45Aug 31, 2025Updated 6 months ago
- Collection of NiFi-related stuff☆24Oct 27, 2022Updated 3 years ago
- Custom PySpark Connectors☆89Updated this week
- Tools for Microsoft Fabric☆25Jul 17, 2025Updated 7 months ago
- SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.☆26Feb 22, 2025Updated last year
- Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!☆235Jan 24, 2025Updated last year
- Civilian Topographic Map (CTM) product☆16Feb 28, 2025Updated last year
- ☆28Oct 14, 2024Updated last year
- example of a Microsoft Fabric Solution☆32Dec 28, 2025Updated 2 months ago
- ☆10Jul 1, 2022Updated 3 years ago
- Apache Polaris Tools, additional tooling for Apache Polaris☆25Updated this week
- Implementation of core-expansion algorithm☆11Jan 26, 2026Updated last month
- ☆14Nov 10, 2025Updated 3 months ago
- A Docker Compose files to compose a NiFi cluster on Docker.☆35May 29, 2017Updated 8 years ago
- Tools for MLflow☆41Jan 31, 2024Updated 2 years ago
- Stock-keeping-oriented Prediction Error Costs (SPEC)☆12Jul 3, 2020Updated 5 years ago
- CONFSEC's ComputeNode component of the OpenPCC standard☆17Dec 15, 2025Updated 2 months ago
- An SBT Plugin that acts as a light wrapper around Buf.☆10Oct 29, 2024Updated last year
- Analyzing the most strategic words to guess on Wordle, based on letter frequency distributions☆11Feb 20, 2022Updated 4 years ago
- End-to-end proof of concept showing core MLOps practices to develop, deploy and monitor a machine learning model for an employee retentio…☆15May 28, 2024Updated last year
- Sample scripts to use with Agentic Document Extraction (ADE).☆34Updated this week
- Cl app / pre-commit hook to clean Jupyter Notebooks metadata, execution_count and optionally output.☆11Mar 3, 2025Updated last year
- Reproducible Research in Finse☆10Aug 5, 2020Updated 5 years ago
- Everything which has to do with Data Integration. Templates for Azure Data Factory and Azure Synapse Analytics☆10Jan 29, 2022Updated 4 years ago
- Android application to connect to DLNA servers. With full-english documentation☆11Jan 13, 2017Updated 9 years ago
- Zabbix Template (>2.4) and resources useful to monitor zfs on linux (zpool)☆13Jan 26, 2017Updated 9 years ago
- A Spark datasource for the HadoopOffice library☆36Sep 29, 2025Updated 5 months ago
- A Python CLI application that demonstrates how you can access AWS services, such as Amazon S3 and Amazon Athena, using trusted identity p…☆12Mar 11, 2025Updated 11 months ago