This article is part of our “Data Science Digest” series. With this series, we will help you keep up with the developments in Data Science, show you the potential of data science techniques and give you a sneak peek into some of the exciting things we’ve been working on with the Data & AI team at Eraneos. In this article, we will talk about Object Character Recognition (OCR) and how it can be used to turn scanned PDF documents into usable and searchable data with the help of Google Tesseract.
OCR or object/optical character recognition is the technology that recognizes different kinds of texts or characters (both hand-written and scanned) and converts them into machine readable text. This technique has been applicable for years now and has various use cases such as license plate recognition, identification of passports and ID’s, marketing campaigns (scanning voucher codes for example), payments by scanning IBAN codes, etc.
In this article, we will focus on another application of OCR technology – making scanned PDF documents searchable and integrating them in the business processes. Although it might not sound revolutionary compared to some of the other data science use cases we will be talking about in these series, on a bigger scale it can prove to be a game changer for many organizations. Think of companies that make extensive use of documentation or governmental agencies, basically everywhere that you still need to manually scan documents. A solution such as this can save a great deal of FTE’s in the form of time and resources that can be used for other tasks. So, how’s it done?
Everything starts with assessing the data – loading the files that contain information and using standard tooling to find if there is usable data that can be extracted. Google Tesseract is one of the newer tools that is very efficient with this task. After a sample batch of data is assessed, more advanced techniques convert the PDF’s into images, apply image processing techniques, and subsequently convert the images to text to filter out the parts that are actually needed (e.g.: you can filter by names, addresses, serial numbers, landmarks, invoices, etc.).
A solution like this combines both data science and big data engineering techniques. With the former, you extract the data and create a model (and pipeline) that can continue doing this on its own while on the engineering part a”product” is created that will store the information and the search engine that can retrieve data that matches the queries. The model could also have a certain accuracy threshold based on the main goal but also on the quality of the data. Every new document included in the system is automatically converted and if it doesn’t reach that threshold is sent back to be verified by a person. On the other hand, the self-learning part is possible thanks to more modern techniques and tooling, such as neural networks. This allows for the model to improve on its own but also to apply other neat tricks such as to recognize misspelled words (and fix them!), and to deal with rotated images and correct for different fonts.
Converting a pdf to searchable text can also open the door for natural language processing techniques like sentiment analysis of the texts and named entity recognition to extract things like brand names and locations.