A Preprocessing and Analyzing Method of Images in PDF Documents for Mathematical Expression Retrieval

Indonesian Journal of Electrical Engineering and Computer Science

A Preprocessing and Analyzing Method of Images in PDF Documents for Mathematical Expression Retrieval

Abstract

PDF documents are the important information resources for a mathematical expression retrieval system. As a major component of PDF documents, the image objects must be converted to coded form with the help of character recognition and document analysis technology firstly for content based searching. Therefore, the quality of these images becomes the key factor which decides the correctness in this conversion process. Considering the characteristics of PDF images and mathematical expressions, a preprocessing and analyzing method was proposed which includes the modules of PDF image extraction, graying, binarization, denoising, skew correction and layout parameter detection. The features of mathematical expressions were adequately considered to avoid the information loss in image converting process and the adverse interference both to the analysis and correction process resulted from formulas. The experimental results show that the method is effective in improving the accuracy and efficiency of document image recognition, analysis and retrieval. DOI : http://dx.doi.org/10.11591/telkomnika.v12i6.5440

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now
Library 3D Ilustration