Easier and Efficient Table Content Extraction from PDF/Image
Shriram KV
Bengaluru, Karnataka
In this system we built, we have attempted to read the table content from an Image or a PDF. Without complicating or using any API, we made it perfectly functioning through the software. Our system is capable of retrieving the table contents from any PDF or Image in a very short duration of time with high accuracy (Almost 100 percent, every time). Also, the most appreciable point to be noted is "the software we have written is ultra lightweight" making it more special. The complete working demo is made available @ https://youtu.be/H1k2aqCQ1u4 The complete code and guidelines is presented in the GIT @ https://github.com/strangest-quark/TableExtraction ...learn more
Project status: Published/In Market
Intel Technologies
Other
Overview / Usage
It is very important to retreive the contents from the table in an efficient and faster manner. When the complex code is used, it becomes computationally intensive and also time taking. If the input file is an image, it becomes further complex. Without using APIs much and through simple python code, we tried this approach. Have a look!
Methodology / Approach
For the PDF:
1: Loading the PDF using pyPDF module
2: Page segmentation and data logging using pyPDF module
3: Iterating and extracting tables from all PDF's using tabula-py
4. Output data logging and Visualisation
5. Complete sample Output
For the image:
- Generate Searchable PDF from image using OCR
- Generate XML from Searchable PDF
- Cluster lines in Image and generate CSV
- Generated output images and CSV files in Image/generated_output folder
Technologies Used
Python, tesseract, AI
Repository
Collaborators
There are no people to show.