This week we will discuss another important topic that many of us had in our mind. Today, we’ll try extracting the texts from scanned, formatted forms. This use case is instrumental when we need to process information prefilled by someone or some process.
To make things easier, I’ve packaged my entire solution & published that as a PyPi package after a long time. But, even before I start, why don’t we see the demo & then discuss it in detail?

Architecture:
Let us understand the architecture flow –

From the above diagram, one can understand the overall flow of this process. We’ll be using our second PyPi package, which will scan the source scanned copy of a formatted page & then tries to extract the relevant information.
Python Packages:
Following are the key python packages that we need apart from these dependent created packages & they are as follows –
cmake==3.22.1 dlib==19.19.0 imutils==0.5.3 jsonschema==4.4.0 numpy==1.23.2 oauthlib==3.1.1 opencv-contrib-python==4.6.0.66 opencv-contrib-python-headless==4.4.0.46 opencv-python==4.6.0.66 opencv-python-headless==4.5.5.62 pandas==1.4.3 python-dateutil==2.8.2 pytesseract==0.3.10 requests==2.27.1 requests-oauthlib==1.3.0
And the newly created package –
ReadingFilledForm==0.0.7
To know more about this, please visit the following PyPi link.
CODE:
Let us now understand the code. For this use case, we will only discuss three python scripts. However, we need more than these three. However, we have already discussed them in some of the early posts. Hence, we will skip them here.
- clsConfigClient.py (This is the configuration class of the python script that will extract the text from the preformatted scanned copy.)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
################################################ | |
#### Written By: SATYAKI DE #### | |
#### Written On: 15-May-2020 #### | |
#### Modified On: 18-Sep-2022 #### | |
#### #### | |
#### Objective: This script is a config #### | |
#### file, contains all the keys for #### | |
#### text extraction via image scanning. #### | |
#### #### | |
################################################ | |
import os | |
import platform as pl | |
my_dict = {} | |
class clsConfigClient(object): | |
Curr_Path = os.path.dirname(os.path.realpath(__file__)) | |
os_det = pl.system() | |
if os_det == "Windows": | |
sep = '\\' | |
else: | |
sep = '/' | |
conf = { | |
'APP_ID': 1, | |
'ARCH_DIR': Curr_Path + sep + 'arch' + sep, | |
'PROFILE_PATH': Curr_Path + sep + 'profile' + sep, | |
'LOG_PATH': Curr_Path + sep + 'log' + sep, | |
'REPORT_PATH': Curr_Path + sep + 'report', | |
'SRC_PATH': Curr_Path + sep + 'data' + sep, | |
'FINAL_PATH': Curr_Path + sep + 'Target' + sep, | |
'IMAGE_PATH': Curr_Path + sep + 'Scans' + sep, | |
'TEMPLATE_PATH': Curr_Path + sep + 'Template' + sep, | |
'APP_DESC_1': 'Text Extraction from Video!', | |
'DEBUG_IND': 'N', | |
'INIT_PATH': Curr_Path, | |
'SUBDIR': 'data', | |
'WIDTH': 320, | |
'HEIGHT': 320, | |
'PADDING': 0.1, | |
'SEP': sep, | |
'MIN_CONFIDENCE':0.5, | |
'GPU':–1, | |
'FILE_NAME':'FilledUp.jpeg', | |
'TEMPLATE_FILE_NAME':'Template.jpeg', | |
'TITLE': "Text Reading!", | |
'ORIG_TITLE': "Camera Source!", | |
'LANG':"en", | |
'OEM_VAL': 1, | |
'PSM_VAL': 7, | |
'DRAW_TAG': (0, 0, 255), | |
'LAYER_DET':[ | |
"feature_fusion/Conv_7/Sigmoid", | |
"feature_fusion/concat_3"], | |
"CACHE_LIM": 1, | |
'ASCII_RANGE': 128, | |
'SUBTRACT_PARAM': (123.68, 116.78, 103.94), | |
'MY_DICT': { | |
"atrib_1": {"id": "FileNo", "bbox": (425, 60, 92, 34), "filter_keywords": tuple(["FILE", "DEPT"])}, | |
"atrib_2": {"id": "DeptNo", "bbox": (545, 60, 87, 40), "filter_keywords": tuple(["DEPT", "CLOCK"])}, | |
"atrib_3": {"id": "ClockNo", "bbox": (673, 60, 75, 36), "filter_keywords": tuple(["CLOCK","VCHR.","NO."])}, | |
"atrib_4": {"id": "VCHRNo", "bbox": (785, 60, 136, 40), "filter_keywords": tuple(["VCHR.","NO."])}, | |
"atrib_5": {"id": "DigitNo", "bbox": (949, 60, 50, 38), "filter_keywords": tuple(["VCHR.","NO.", "056"])}, | |
"atrib_6": {"id": "CompanyName", "bbox": (326, 140, 621, 187), "filter_keywords": tuple(["COMPANY","FILE"])}, | |
"atrib_7": {"id": "StartDate", "bbox": (1264, 143, 539, 44), "filter_keywords": tuple(["Period", "Beginning:"])}, | |
"atrib_8": {"id": "EndDate", "bbox": (1264, 193, 539, 44), "filter_keywords": tuple(["Period", "Ending:"])}, | |
"atrib_9": {"id": "PayDate", "bbox": (1264, 233, 539, 44), "filter_keywords": tuple(["Pay", "Date:"])}, | |
} | |
} |
The only important part of these configurations are the following –
'MY_DICT': { "atrib_1": {"id": "FileNo", "bbox": (425, 60, 92, 34), "filter_keywords": tuple(["FILE", "DEPT"])}, "atrib_2": {"id": "DeptNo", "bbox": (545, 60, 87, 40), "filter_keywords": tuple(["DEPT", "CLOCK"])}, "atrib_3": {"id": "ClockNo", "bbox": (673, 60, 75, 36), "filter_keywords": tuple(["CLOCK","VCHR.","NO."])}, "atrib_4": {"id": "VCHRNo", "bbox": (785, 60, 136, 40), "filter_keywords": tuple(["VCHR.","NO."])}, "atrib_5": {"id": "DigitNo", "bbox": (949, 60, 50, 38), "filter_keywords": tuple(["VCHR.","NO.", "056"])}, "atrib_6": {"id": "CompanyName", "bbox": (326, 140, 621, 187), "filter_keywords": tuple(["COMPANY","FILE"])}, "atrib_7": {"id": "StartDate", "bbox": (1264, 143, 539, 44), "filter_keywords": tuple(["Period", "Beginning:"])}, "atrib_8": {"id": "EndDate", "bbox": (1264, 193, 539, 44), "filter_keywords": tuple(["Period", "Ending:"])}, "atrib_9": {"id": "PayDate", "bbox": (1264, 233, 539, 44), "filter_keywords": tuple(["Pay", "Date:"])}, }
Let us understand this part, as it is very critical for this entire package.
We need to define the areas in terms of pixel position, which we need to extract. Hence, we follow the following pattern –
"atrib_": {"id": , "bbox": (x-Coordinates, y-Coordinates, Width, Height), "filter_keywords": tuple(["Mention the overlapping printed text that you don't want to capture. Make sure you are following the exact Case to proper detection."])}
You can easily get the individual intended text position by using any Photo editor.
Still not clear how to select?
Let’s watch the next video –

The above demo should explain what we are trying to achieve. Also, you need to understand that if your two values are extremely close, then we’re taking both the non-desired labels & put them under the filter keywords to ensure extracting the correct values.
For example, on the top left side, where the values are very close, we’re putting both closed labels as filter keywords. One such example is as follows –
"filter_keywords": tuple(["FILE", "DEPT"])
The same logic applies to the other labels as well.
- readingFormLib.py (This is the main calling python script that will extract the text from the preformatted scanned copy.)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##################################################### | |
#### Written By: SATYAKI DE #### | |
#### Written On: 22-Jul-2022 #### | |
#### Modified On 18-Sep-2022 #### | |
#### #### | |
#### Objective: This is the main calling #### | |
#### python script that will invoke the #### | |
#### clsReadForm class to initiate #### | |
#### the reading capability in real-time #### | |
#### & display text from a formatted forms. #### | |
##################################################### | |
# We keep the setup code in a different class as shown below. | |
from ReadingFilledForm import clsReadForm as rf | |
from clsConfigClient import clsConfigClient as cf | |
import datetime | |
import logging | |
############################################### | |
### Global Section ### | |
############################################### | |
# Instantiating all the main class | |
scannedImagePath = str(cf.conf['IMAGE_PATH']) + str(cf.conf['FILE_NAME']) | |
templatePath = str(cf.conf['TEMPLATE_PATH']) + str(cf.conf['TEMPLATE_FILE_NAME']) | |
x1 = rf.clsReadForm(scannedImagePath, templatePath) | |
############################################### | |
### End of Global Section ### | |
############################################### | |
def main(): | |
try: | |
# Other useful variables | |
debugInd = 'Y' | |
var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") | |
var1 = datetime.datetime.now() | |
print('Start Time: ', str(var)) | |
# End of useful variables | |
# Initiating Log Class | |
general_log_path = str(cf.conf['LOG_PATH']) | |
# Enabling Logging Info | |
logging.basicConfig(filename=general_log_path + 'readingForm.log', level=logging.INFO) | |
print('Started extracting text from formatted forms!') | |
# Getting the dictionary | |
my_dict = cf.conf['MY_DICT'] | |
# Execute all the pass | |
r1 = x1.startProcess(debugInd, var, my_dict) | |
if (r1 == 0): | |
print('Successfully extracted text from the formatted forms!') | |
else: | |
print('Failed to extract the text from the formatted forms!') | |
var2 = datetime.datetime.now() | |
c = var2 – var1 | |
minutes = c.total_seconds() / 60 | |
print('Total difference in minutes: ', str(minutes)) | |
print('End Time: ', str(var1)) | |
except Exception as e: | |
x = str(e) | |
print('Error: ', x) | |
if __name__ == "__main__": | |
main() |
Key snippets from the above script –
# We keep the setup code in a different class as shown below. from ReadingFilledForm import clsReadForm as rf from clsConfigClient import clsConfigClient as cf
The above lines import the newly created PyPi package into the memory.
############################################### ### Global Section ### ############################################### # Instantiating all the main class scannedImagePath = str(cf.conf['IMAGE_PATH']) + str(cf.conf['FILE_NAME']) templatePath = str(cf.conf['TEMPLATE_PATH']) + str(cf.conf['TEMPLATE_FILE_NAME']) x1 = rf.clsReadForm(scannedImagePath, templatePath) ############################################### ### End of Global Section ### ###############################################
Now, the application is fetching both the template copy & the intended scanned copy & load them into the memory.
# Getting the dictionary my_dict = cf.conf['MY_DICT']
After this, the application will try to extract the focus area dictionary, indicating the areas of particular interest.
# Execute all the pass r1 = x1.startProcess(debugInd, var, my_dict)
Finally, pass it inside the new package to get the correct outcome.
FOLDER STRUCTURE:
Here is the folder structure that contains all the files & directories in MAC O/S –

Similar structures are present in the Windows environment as well.
You will get the complete calling codebase in the following GitHub link.
I’ll bring some more exciting topics in the coming days from the Python verse. Please share & subscribe to my post & let me know your feedback.
Till then, Happy Avenging! 🙂
Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. There is always room for improvement & especially in the prediction quality.
You must be logged in to post a comment.