Convert PDF of tables to EXCEL & CSV-formatted tables.
OpenCV (Python or Java) / Tesseract OCR V4 / .net / any other Language
Want GUI / Command Based Batch Processing
A set PDF Files ( Indian regional Language ) be provided as input . It's important not to optimize the solution for these specific tables. The solution must be generic and will be tested against other pdf files
It is a priority to handle regular tables with high precision.
1. Analyze PDF using OpenCV or Any Other Technology to determine table cells (rows and columns).
2. Slice input image into multiple images based on cells.
2. Use Tesseract 4 to OCR text from each cell.
4. Output data to CSV / excel or As Shown / Attached below File
- Conversion is at least 95% accurate with our test-set. Standard tables but not provided to avoid over fitting.
- Function / Script / API that takes an PDF and outputs Excel Formatted & Unformulated
Readings / Links:
Finding text blocks in an image using OpenCV:
Table Analysis using with histogram:
Docker OpenCV Image: