StatCan / SLICEmyPDF
This project uses SLICE algorithm to extract information from a text-based PDF page containing financial statements (tabular data). It can also be used to extract regular tables but will contain all text on a page.
β64Updated 3 years ago
Alternatives and similar repositories for SLICEmyPDF:
Users that are interested in SLICEmyPDF are comparing it to the libraries listed below
- An open-source XBRL processor for business rules, rendering and custom data reporting. See https://xbrl.us/xule for documentation and httβ¦β30Updated 2 weeks ago
- π Fuzzy Name Matching with Machine Learningβ264Updated 10 months ago
- testβ23Updated 4 years ago
- Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4β283Updated 2 years ago
- Group thousands of similar spreadsheet or database text entries in secondsβ155Updated last year
- Custom recipe and utilities for document processingβ199Updated 2 years ago
- demo using FuzzyWuzzy matching company namesβ75Updated 3 years ago
- Using ML to extract campaign finance data from messy forms for journalismβ76Updated 2 years ago
- Simplifies use of the Dedupe library via Pandasβ136Updated 2 years ago
- A Python library for reading XBRL reportsβ32Updated 3 weeks ago
- Scripts and results from our OCR roundup, available on Sourceβ150Updated 6 years ago
- Google Colab Demo of CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documentsβ46Updated 3 years ago
- Application and python script to identify, remove, and/or recode personally identifiable information (PII) from field experiment datasetsβ¦β45Updated 3 years ago
- sidetable builds simple but useful summary tables of your dataβ389Updated 2 years ago
- Super Fast String Matching in Pythonβ367Updated last month
- Public runnable examples of using John Snow Labs' OCR for Apache Spark.β90Updated this week
- Python API for PDF documentsβ121Updated 8 months ago
- Abydos NLP/IR library for Pythonβ185Updated 2 years ago
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipβ¦β68Updated last month
- Fuzzy matching for companies'namesβ9Updated 5 years ago
- Python tools for Tesseract OCR trainingβ25Updated 3 years ago
- Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.β521Updated 4 years ago
- A Flexible Deep Learning Approach to Fuzzy String Matchingβ145Updated 6 months ago
- Examples for using the dedupe libraryβ411Updated 9 months ago
- Package that returns a company embedding given a company nameβ45Updated 4 years ago
- βοΈ Parallel and distributed training with spaCy and Rayβ54Updated last year
- A comprehensive and scalable set of string tokenizers and similarity measures in Pythonβ138Updated 9 months ago
- Parsing pdf tables using YOLOV3β116Updated 4 years ago
- SimFin's open source PDF crawlerβ125Updated 5 years ago
- Logical structure analysis for visually structured documentsβ89Updated 2 years ago