Wittline / wbz
A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files, and other files on tabular format.
☆13Updated 2 years ago
Alternatives and similar repositories for wbz:
Users that are interested in wbz are comparing it to the libraries listed below
- Genomic BigData Warehousing with Apache Spark and LakeHouse Architecture☆11Updated 2 years ago
- Demo on how to use Prefect with Docker☆25Updated 2 years ago
- A Probabilistic Programming Language in 70 lines of Python. Code for the blog post https://mrandri19.github.io/2022/01/12/a-PPL-in-70-lin…☆17Updated 3 years ago
- Demo of DuckDB Spark API implements. Same Pyspark code, but DuckDB under the hood☆13Updated last year
- ☆11Updated 2 years ago
- ☆20Updated 2 years ago
- Demo of Hydra☆18Updated 3 years ago
- ☆21Updated 2 years ago
- A repo of Flyte-related conference talks☆14Updated last year
- Python implementation of Age-Partitioned Bloom Filter with S3 periodic backup support.☆11Updated 2 months ago
- PyCon Talks 2022 by Antoine Toubhans☆23Updated 2 years ago
- A Python library for reading and manipulating genetic data.☆22Updated 5 months ago
- A library to create lore plots (logistic regression of the prevalence of a categorical variable in function of a continuous feature)☆16Updated 3 weeks ago
- A tutorial that helps Big Data Engineers ramp up faster by getting familiar with PySpark dataframes and functions. It also covers topics …☆20Updated 3 years ago
- DuckDB Extension for working with bioinformatic data.☆15Updated last year
- ☆11Updated 3 years ago
- SparkBLAST is a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computat…☆9Updated 7 years ago
- Operations Research Algorithms☆17Updated last year
- Palantir Python SDK☆38Updated 3 months ago
- Workshop about DVC VSCode Extension☆14Updated 6 months ago
- Pandas ExtensionDtypes for dealing with genomics data☆47Updated 4 months ago
- Distance computations with Dask (akin to scipy.spatial.distance)☆8Updated 7 years ago
- Learn Kubeflow with Arrikto☆15Updated 3 years ago
- 🚕 Self-contained demo using Redpanda, Materialize, River, Redis, and Streamlit to predict taxi trip durations☆47Updated 2 years ago
- CPU and GPU deterministic and therefore fully reproducible machine learning pipelines using MLflow.☆45Updated 2 years ago
- Implementation of LSTM for detecting regions of Neanderthal introgression in modern human genomes☆9Updated 5 years ago
- JumpSpark - A modern cookiecutter template for pyspark projects with batteries included.☆10Updated last year
- Demo repository to lambda-fy your dbt runs☆11Updated last year
- GitHub Action for CML setup☆26Updated 10 months ago
- Exon is an OLAP query engine specifically for biology and life science applications.☆59Updated last week