Easier and Efficient Table Content Extraction from PDF/Image

Shriram KV

Shriram KV

Bengaluru, Karnataka

In this system we built, we have attempted to read the table content from an Image or a PDF. Without complicating or using any API, we made it perfectly functioning through the software. Our system is capable of retrieving the table contents from any PDF or Image in a very short duration of time with high accuracy (Almost 100 percent, every time). Also, the most appreciable point to be noted is "the software we have written is ultra lightweight" making it more special. The complete working demo is made available @ https://youtu.be/H1k2aqCQ1u4 The complete code and guidelines is presented in the GIT @ https://github.com/strangest-quark/TableExtraction ...learn more

Project status: Published/In Market

Artificial Intelligence

Intel Technologies
Other

Code Samples [1]

Overview / Usage

It is very important to retreive the contents from the table in an efficient and faster manner. When the complex code is used, it becomes computationally intensive and also time taking. If the input file is an image, it becomes further complex. Without using APIs much and through simple python code, we tried this approach. Have a look!

Methodology / Approach

For the PDF:
1: Loading the PDF using pyPDF module
2: Page segmentation and data logging using pyPDF module
3: Iterating and extracting tables from all PDF's using tabula-py
4. Output data logging and Visualisation
5. Complete sample Output

For the image:

  1. Generate Searchable PDF from image using OCR
  2. Generate XML from Searchable PDF
  3. Cluster lines in Image and generate CSV
  4. Generated output images and CSV files in Image/generated_output folder

Technologies Used

Python, tesseract, AI

Repository

https://youtu.be/H1k2aqCQ1u4

Collaborators

There are no people to show.

Comments (0)