Today, we’ll be implementing the advanced concept of Pandas & Numpy & how one can aggregate data & produce meaningful data insights into your business, which makes an impact on your overall profit.
First, let us understand the complexity of the problem & what we’re looking to achieve here. For that, you need to view the source data & lookup data & how you want to process the data.
Source Data:
The above picture is a sample data-set from a Bank (Data available on U.S public forum), which captures the information of the customer’s current account balance. Let’s look into the look-up files sample data –
First File:
Second File:
So, one can clearly see, Bank is trying to get a number of stories based on the existing data.
Challenges:
The first lookup file contains data in a manner where the column of our source file is row here. Hence, you need to somehow bring the source data as per the lookup file to get the other relevant information & then joining that with the second lookup file to bring all the data point for your storyline.
Look-Up Configuration:
In order to match the look-up data with our source data, we’ll be adding two new columns, which will help the application to process the correct row out of the entries provided in the look-up file 1.
As you can see from the above picture, that two new columns i.e. Category & Stat have added in this context. Here, the category contains metadata information. If a column has a significant number of unique values, then we’re marking it as ‘D‘ in the category. In this case, the bank doesn’t offer any scheme based on the customer’s name. Hence, these fields are marked with ‘I‘. For the Gender column, the application has less number of unique records i.e. either ‘Male‘ or ‘Female‘. As a result, we provided two corresponding entries. Remember, DateJoined is a key column here. Even though we marked its category as ‘I‘, which denote no transformation requires – ‘K‘ will denote that it is the driving column apart from one of the surrogate key [PKEY] that we’ll be generating during our application transformation process. I’ll discuss that in the respective snippet discussion.
Our Goal:
Based on the source data, We need to find the following story & published that in an excel sheet separately.
- The country, Gender wise Bank’s contribution.
- The country, Job-wise Bank’s contribution.
- The country & Age range wise Saving trends & Bank’s contribution.
A little note on Bank’s Contribution:
Let us explain, what exactly means by Bank’s contribution. Sometimes, bank want’s to encourage savings to an individual client based on all the available factors. So, let’s assume that – Bank contribute $1 for every $150 saving of a person. Again this $1 may vary based on the Age Range & gender to promote a specific group. Also, when someone opens any savings account with the bank, by default bank contributed a sum of $100 at the time when they open an account for a short period of time as part of their promotion strategy. These details you will get it from first lookup file. Second lookup file contains the age range category base on the Group that is available in First Lookup file.
Python Scripts:
In this installment, we’ll be reusing the following python scripts, which is already discussed in my earlier post –
- clsFindFile.py
- clsL.py
So, I’m not going to discuss these scripts.
1. clsParam.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as normal verbose debug logging as well. Hence, the name comes into the picture.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | ########################################### #### Written By: SATYAKI DE ######## #### Written On: 04-Apr-2019 ######## ########################################### import os import platform as pl class clsParam(object): os_det = pl.system() dir_sep = '' if os_det == "Windows": dir_sep = "\\" else: dir_sep = '/' config = { 'MAX_RETRY' : 5, 'PATH' : os.path.dirname(os.path.realpath(__file__)) + dir_sep, 'SRC_DIR' : os.path.dirname(os.path.realpath(__file__)) + dir_sep + 'src_files' + dir_sep, 'FIN_DIR': os.path.dirname(os.path.realpath(__file__)) + dir_sep + 'finished' + dir_sep, 'LKP_DIR': os.path.dirname(os.path.realpath(__file__)) + dir_sep + 'lkp_files' + dir_sep, 'LOG_DIR': os.path.dirname(os.path.realpath(__file__)) + dir_sep + 'log' + dir_sep, 'LKP_FILE': 'DataLookUp', 'LKP_CATG_FILE': 'CategoryLookUp', 'LKP_FILE_DIR_NM': 'lkp_files', 'SRC_FILE_DIR_NM': 'src_files', 'FIN_FILE_DIR_NM': 'finished', 'LOG_FILE_DIR_NM': 'log', 'DEBUG_IND': 'Y' } |
2. clsLookUpDataRead.py (This script will look into the lookup file & this will generate the combined lookup result as we’ve two different lookup files. Hence, the name comes into the picture.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | ########################################### #### Written By: SATYAKI DE ######## #### Written On: 04-Apr-2019 ######## ########################################### import pandas as p import clsFindFile as c import clsL as log from clsParam import clsParam as cf import datetime # Disbling Warnings def warn(*args, **kwargs): pass import warnings warnings.warn = warn class clsLookUpDataRead(object): def __init__(self, lkpFilename): self.lkpFilename = lkpFilename self.lkpCatgFilename = cf.config['LKP_CATG_FILE'] self.path = cf.config['PATH'] self.subdir = str(cf.config['LOG_FILE_DIR_NM']) # To disable logging info self.Ind = cf.config['DEBUG_IND'] self.var = datetime.datetime.now().strftime(".%H.%M.%S") def getNaN2Null(self, row): try: str_val = '' str_val = str(row['Group']).replace('nan', '').replace('NaN','') return str_val except: str_val = '' return str_val def ReadTable(self): # Assigning Logging Info lkpF = [] lkpF_2 = [] var = self.var Ind = self.Ind subdir = self.subdir # Initiating Logging Instances clog = log.clsL() try: # Assinging Lookup file name lkpFilename = self.lkpFilename # Fetching the actual look-up file name f = c.clsFindFile(lkpFilename, str(cf.config['LKP_FILE_DIR_NM'])) lkp_file_list = list(f.find_file()) # Ideally look-up will be only one file # Later it will be converted to table for i in range(len(lkp_file_list)): lkpF = lkp_file_list[i] # Fetching the content of the look-up file df_lkpF = p.read_csv(lkpF, index_col=False) # Fetching Category LookUp File LkpCatgFileName = self.lkpCatgFilename f1 = c.clsFindFile(LkpCatgFileName, str(cf.config['LKP_FILE_DIR_NM'])) lkp_file_list_2 = list(f1.find_file()) # Ideally look-up will be only one file # Later it will be converted to table for j in range(len(lkp_file_list_2)): lkpF_2 = lkp_file_list_2[j] # Fetching the content of the look-up file df_lkpF_2 = p.read_csv(lkpF_2, index_col=False) # Changing both the column data type as same type df_lkpF['Group_1'] = df_lkpF['Group'].astype(str) df_lkpF_2['Group_1'] = df_lkpF_2['Group'].astype(str) # Dropping the old column df_lkpF.drop(['Group'], axis=1, inplace=True) df_lkpF_2.drop(['Group'], axis=1, inplace=True) # Renaming the changed data type column with the old column name df_lkpF.rename(columns={'Group_1':'Group'}, inplace=True) df_lkpF_2.rename(columns={'Group_1': 'Group'}, inplace=True) # Merging two lookup dataframes to form Final Consolidated Dataframe df_Lkp_Merge = p.merge( df_lkpF[['TableName', 'ColumnOrder', 'ColumnName', 'MappedColumnName', 'Category', 'Stat', 'Group', 'BankContribution']], df_lkpF_2[['StartAgeRange', 'EndAgeRange', 'Group']], on=['Group'], how='left') # Converting NaN to Nul or empty string df_Lkp_Merge['GroupNew'] = df_Lkp_Merge.apply(lambda row: self.getNaN2Null(row), axis=1) # Dropping the old column & renaming the new column df_Lkp_Merge.drop(['Group'], axis=1, inplace=True) df_Lkp_Merge.rename(columns={'GroupNew': 'Group'}, inplace=True) clog.logr('1.df_Lkp_Merge' + var + '.csv', Ind, df_Lkp_Merge, subdir) return df_Lkp_Merge except(FileNotFoundError, IOError) as s: y = str(s) print(y) # Declaring Empty Dataframe df_error = p.DataFrame() return df_error except Exception as e: x = str(e) print(x) # Declaring Empty Dataframe df_error = p.DataFrame() return df_error |
Key lines from this script –
# Fetching the actual look-up file name
f = c.clsFindFile(lkpFilename, str(cf.config['LKP_FILE_DIR_NM']))
lkp_file_list = list(f.find_file())
# Ideally look-up will be only one file
# Later it will be converted to table
for i in range(len(lkp_file_list)):
lkpF = lkp_file_list[i]
# Fetching the content of the look-up file
df_lkpF = p.read_csv(lkpF, index_col=False)
Here, the application will try to find out the lookup file based on the file name pattern & directory path. And, then load the data into the dataframe.
# Fetching Category LookUp File
LkpCatgFileName = self.lkpCatgFilename
f1 = c.clsFindFile(LkpCatgFileName, str(cf.config['LKP_FILE_DIR_NM']))
lkp_file_list_2 = list(f1.find_file())
# Ideally look-up will be only one file
# Later it will be converted to table
for j in range(len(lkp_file_list_2)):
lkpF_2 = lkp_file_list_2[j]
# Fetching the content of the look-up file
df_lkpF_2 = p.read_csv(lkpF_2, index_col=False)
In this step, the second lookup file will be loaded into the second dataframe.
# Changing both the column data type as same type
df_lkpF['Group_1'] = df_lkpF['Group'].astype(str)
df_lkpF_2['Group_1'] = df_lkpF_2['Group'].astype(str)
# Dropping the old column
df_lkpF.drop(['Group'], axis=1, inplace=True)
df_lkpF_2.drop(['Group'], axis=1, inplace=True)
# Renaming the changed data type column with the old column name
df_lkpF.rename(columns={'Group_1':'Group'}, inplace=True)
df_lkpF_2.rename(columns={'Group_1': 'Group'}, inplace=True)
It is always better to cast the same datatype for those columns, which will be used part of the joining key. The above snippet does exactly that.
# Merging two lookup dataframes to form Final Consolidated Dataframe
df_Lkp_Merge = p.merge(
df_lkpF[['TableName', 'ColumnOrder', 'ColumnName', 'MappedColumnName',
'Category', 'Stat', 'Group', 'BankContribution']],
df_lkpF_2[['StartAgeRange', 'EndAgeRange', 'Group']],
on=['Group'], how='left')
In this step, the first lookup file will be left join with the second lookup file based on Group column.
# Converting NaN to Nul or empty string
df_Lkp_Merge['GroupNew'] = df_Lkp_Merge.apply(lambda row: self.getNaN2Null(row), axis=1)
# Dropping the old column & renaming the new column
df_Lkp_Merge.drop(['Group'], axis=1, inplace=True)
df_Lkp_Merge.rename(columns={'GroupNew': 'Group'}, inplace=True)
Once merge is done, key columns need to suppress ‘NaN’ values to Null for better data process.
3. clsPivotLookUp.py (This script will actually contain the main logic to process & merge the data between source & lookup files & create group data & based on that data point will be produced & captured in the excel. Hence, the name comes into the picture.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 | ########################################### #### Written By: SATYAKI DE ######## #### Written On: 04-Apr-2019 ######## ########################################### import pandas as p import numpy as np import clsFindFile as c import clsL as log import datetime from clsParam import clsParam as cf from pandas import ExcelWriter # Disbling Warnings def warn(*args, **kwargs): pass import warnings warnings.warn = warn class clsPivotLookUp(object): def __init__(self, srcFilename, tgtFileName, df_lkpF): self.srcFilename = srcFilename self.tgtFileName = tgtFileName self.df_lkpF = df_lkpF self.lkpCatgFilename = cf.config['LKP_CATG_FILE'] self.path = cf.config['PATH'] self.subdir = str(cf.config['LOG_FILE_DIR_NM']) self.subdir_2 = str(cf.config['FIN_FILE_DIR_NM']) # To disable logging info self.Ind = cf.config['DEBUG_IND'] self.report_path = cf.config['FIN_DIR'] def dfs_tabs(self, df_list, sheet_list, file_name): try: cnt = 0 number_rows = 0 writer = p.ExcelWriter(file_name, engine='xlsxwriter') for dataframe, sheet in zip(df_list, sheet_list): number_rows = int(dataframe.shape[0]) number_cols = int(dataframe.shape[1]) if cnt == 0: dataframe.to_excel(writer, sheet_name=sheet, startrow=7, startcol=5) else: dataframe.to_excel(writer, sheet_name=sheet, startrow=5, startcol=0) # Get the xlsxwriter workbook & worksheet objects workbook = writer.book worksheet = writer.sheets[sheet] worksheet.set_zoom(90) if cnt == 0: worksheet.set_column('A:E', 4) worksheet.set_column('F:F', 20) worksheet.set_column('G:G', 10) worksheet.set_column('H:J', 20) # Insert an Image worksheet.insert_image('E1', 'Logo.png', {'x_scale':0.6, 'y_scale':0.8}) # Add a number format for cells with money. money_fmt = workbook.add_format({'num_format': '$#,##0', 'border': 1}) worksheet.set_column('H:H', 20, money_fmt) # Define our range for color formatting color_range = "F9:F{}".format(number_rows * 2 + 1) # Add a format. Red fill with the dark red text red_format = workbook.add_format({'bg_color':'#FEC7CE', 'font_color':'#0E0E08', 'border':1}) # Add a format. Green fill with the dark green text green_format = workbook.add_format({'bg_color': '#D0FCA4', 'font_color': '#0E0E08', 'border': 1}) # Add a format. Cyan fill with the dark green text mid_format = workbook.add_format({'bg_color': '#6FC2D8', 'font_color': '#0E0E08', 'border': 1}) # Add a format. Other fill with the dark green text oth_format = workbook.add_format({'bg_color': '#AFC2D8', 'font_color': '#0E0E08', 'border': 1}) worksheet.conditional_format(color_range, {'type':'cell', 'criteria':'equal to', 'value':'"England"', 'format': green_format}) worksheet.conditional_format(color_range, {'type': 'cell', 'criteria': 'equal to', 'value': '"Northern Ireland"', 'format': mid_format}) worksheet.conditional_format(color_range, {'type': 'cell', 'criteria': 'equal to', 'value': '"Scotland"', 'format': red_format}) worksheet.conditional_format(color_range, {'type': 'cell', 'criteria': 'equal to', 'value': '"Wales"', 'format': oth_format}) else: first_row = 5 first_col = 0 last_row = first_row + (number_rows * 2) last_col = number_cols - 1 if cnt == 1: worksheet.set_column('A:D', 20) else: worksheet.set_column('A:E', 20) worksheet.set_column('F:F', 20) # Add a number format for cells with money. # money_fmt = workbook.add_format({'num_format': '$#,##0', 'bold': True, 'border':1}) money_fmt = workbook.add_format({'num_format': '$#,##0', 'border': 1}) # Amount columns if cnt == 1: worksheet.set_row(6, 0, money_fmt) worksheet.set_column('C:C', 20, money_fmt) else: worksheet.set_row(6, 0, money_fmt) worksheet.set_column('D:F', 20, money_fmt) # Insert an Image worksheet.insert_image('B1', 'Logo.png', {'x_scale': 0.5, 'y_scale': 0.5}) # Add a format. Red fill with the dark red text red_format = workbook.add_format({'bg_color': '#FEC7CE', 'font_color': '#0E0E08'}) # Add a format. Green fill with the dark green text green_format = workbook.add_format({'bg_color': '#D0FCA4', 'font_color': '#0E0E08'}) # Add a format. Cyan fill with the dark green text mid_format = workbook.add_format({'bg_color': '#6FC2D8', 'font_color': '#0E0E08'}) # Add a format. Other fill with the dark green text oth_format = workbook.add_format({'bg_color': '#AFC2D8', 'font_color': '#0E0E08'}) # Fill colour based on formula worksheet.conditional_format(first_row, first_col, last_row, last_col, {'type': 'formula', 'criteria': '=INDIRECT("A"&ROW())="England"', 'format': green_format}) worksheet.conditional_format(first_row, first_col, last_row, last_col, {'type': 'formula', 'criteria': '=INDIRECT("A"&ROW())="Northern Ireland"', 'format': mid_format}) worksheet.conditional_format(first_row, first_col, last_row, last_col, {'type': 'formula', 'criteria': '=INDIRECT("A"&ROW())="Scotland"', 'format': red_format}) worksheet.conditional_format(first_row, first_col, last_row, last_col, {'type': 'formula', 'criteria': '=INDIRECT("A"&ROW())="Wales"', 'format': oth_format}) cnt += 1 writer.save() writer.close() return 0 except Exception as e: x = str(e) print(x) return 1 def getIntVal(self, row): try: int_val = 0 int_val = int(row['MCategory']) return int_val except: int_val = 0 return int_val def getSavingsAmount(self, row): try: savings = 0.0 savings = float(row['Balance']) - float(row['BankContribution']) return savings except: savings = 0 return savings def getNaN2Zero_StartAgeRange(self, row): try: int_AgeRange = 0 str_StartAgeRange = '' str_StartAgeRange = str(row['StartAgeRange']).replace('nan','').replace('NaN','') if (len(str_StartAgeRange) > 0): int_AgeRange = int(float(str_StartAgeRange)) else: int_AgeRange = 0 return int_AgeRange except: int_AgeRange = 0 return int_AgeRange def getNaN2Zero_EndAgeRange(self, row): try: int_AgeRange = 0 str_EndAgeRange = '' str_EndAgeRange = str(row['EndAgeRange']).replace('nan','').replace('NaN','') if (len(str_EndAgeRange) > 0): int_AgeRange = int(float(str_EndAgeRange)) else: int_AgeRange = 0 return int_AgeRange except: int_AgeRange = 0 return int_AgeRange def parse_and_write_csv(self): # Assigning Logging Info Ind = self.Ind subdir = self.subdir subdir_2 = self.subdir_2 lkpF = [] lkpF_2 = [] report_path = self.report_path #Initiating Logging Instances clog = log.clsL() if Ind == 'Y': print('Logging Enabled....') else: print('Logging Not Enabled....') # Assigning Source File Basic Name srcFileInit = self.srcFilename tgtFileName = self.tgtFileName df_lkpF = self.df_lkpF try: # Fetching the actual source file name d = c.clsFindFile(self.srcFilename, str(cf.config['SRC_FILE_DIR_NM'])) src_file_list = d.find_file() # Ideally look-up will be only one file # Later it will be converted to table for i in range(len(src_file_list)): # Handling Multiple source files var = datetime.datetime.now().strftime(".%H.%M.%S") print('Target File Extension will contain the following:: ', var) srcF = src_file_list[i] # Reading Source File df = p.read_csv(srcF, index_col=False) # Adding a new surrogate key to the existing records df = df.assign(PKEY=[1 + i for i in range(len(df))])[['PKEY'] + df.columns.tolist()] clog.logr('2.DF_Assign' + var + '.csv', Ind, df, subdir) # Fetching only relevant rows from the Look-up Files # based on Filters with 'I' or No Token # 'K' for Key columns with No Token # 'D' for Single column Token df_lkpFile = df_lkpF[(df_lkpF['TableName'] == srcFileInit) & ((df_lkpF['Category'] == 'I') | (df_lkpF['Category'] == 'K'))] # Fetching the unique records from Look-up table id_list1 = list(df_lkpFile['ColumnName'].drop_duplicates()) id_list2 = ['PKEY'] id_list = id_list2 + id_list1 # Pivoting part of the source file data to be join for merge df_melt = df.melt(id_vars=id_list, var_name='ColumnName') # Changing the generated column Value to Category for upcoming Merge # df_melt = df_tmp_melt.rename_by_col_index(idx_np,'Category') # df_melt.rename(columns={'value': 'Category'}, inplace=True) df_melt.rename(columns={'value': 'MCategory'}, inplace=True) #df_melt.to_csv(path+'1.DF_Melt.csv') clog.logr('3.DF_Melt' + var + '.csv', Ind, df_melt, subdir) # Now fetching look-up file one more time # filtering with the only Table Name # For merge with our temporary df_melt # to get the relevant lookup # information df_lkpFinFile = df_lkpF[(df_lkpF['TableName'] == srcFileInit) & ((df_lkpF['Category'] == 'D') | (df_lkpF['Category'] == 'Male') | (df_lkpF['Category'] == 'K') | (df_lkpF['Category'] == 'Female'))] clog.logr('4.DF_Finlkp' + var + '.csv', Ind, df_lkpFinFile, subdir) # Merging two files based on Keys # df_fin = df_melt.merge(df_lkpFinFile, on=['ColumnName', 'Category'], how='left') df_fin = df_melt.merge(df_lkpFinFile, on=['ColumnName'], how='left') clog.logr('5.DF_FIN_Basic_Merge' + var + '.csv', Ind, df_fin, subdir) df_fin2 = df_fin[((df_fin['MCategory'] == 'I') & (df_fin['Category'] == df_fin['MCategory'])) | ((df_fin['MCategory'] == 'Male') & (df_fin['Category'] == df_fin['MCategory'])) | ((df_fin['MCategory'] == 'Female') & (df_fin['Category'] == df_fin['MCategory'])) | (df_fin['MCategory'] == 'NaN') | (df_fin['MCategory'] == 'D') | ( (df_fin['MCategory'] != 'I') & (df_fin['MCategory'] != 'Male') & (df_fin['MCategory'] != 'Female') & (df_fin['MCategory'] != 'D') & (df_fin['MCategory'] != 'NaN') )] clog.logr('6.Merge_After_Filter' + var + '.csv', Ind, df_fin2, subdir) # Identifying Integer Column for next step df_fin2['Catg'] = df_fin2.apply(lambda row: self.getIntVal(row), axis=1) df_fin2['StAge'] = df_fin2.apply(lambda row: self.getNaN2Zero_StartAgeRange(row), axis=1) df_fin2['EnAge'] = df_fin2.apply(lambda row: self.getNaN2Zero_EndAgeRange(row), axis=1) # Dropping the old Columns df_fin2.drop(['Category'], axis=1, inplace=True) df_fin2.drop(['StartAgeRange'], axis=1, inplace=True) df_fin2.drop(['EndAgeRange'], axis=1, inplace=True) # Renaming the new columns df_fin2.rename(columns={'Catg': 'Category'}, inplace=True) df_fin2.rename(columns={'StAge': 'StartAgeRange'}, inplace=True) df_fin2.rename(columns={'EnAge': 'EndAgeRange'}, inplace=True) clog.logr('7.Catg' + var + '.csv', Ind, df_fin2, subdir) # Handling special cases when Category from source & lookup file won't match # alternative way to implement left outer join due to specific data scenarios df_fin2['Flag'] = np.where(((df_fin2.StartAgeRange == 0) | (df_fin2.EndAgeRange == 0)) | (((df_fin2.StartAgeRange > 0) & (df_fin2.EndAgeRange > 0)) & ((df_fin2.Category >= df_fin2.StartAgeRange) & (df_fin2.Category <= df_fin2.EndAgeRange))), 'Y', 'N') clog.logr('8.After_Special_Filter' + var + '.csv', Ind, df_fin2, subdir) # Removing data where Flag is set to Y newDF = df_fin2[(df_fin2['Flag'] == 'Y')] clog.logr('9.Flag_Filter' + var + '.csv', Ind, newDF, subdir) # Need to drop column called ColumnName newDF.drop(['TableName'], axis=1, inplace=True) newDF.drop(['ColumnOrder'], axis=1, inplace=True) newDF.drop(['ColumnName'], axis=1, inplace=True) newDF.drop(['Category'], axis=1, inplace=True) newDF.drop(['Flag'], axis=1, inplace=True) newDF.drop(['Group'], axis=1, inplace=True) # Need to rename MappedColumnName to ColumnName newDF.rename(columns={'MappedColumnName': 'ColumnName'}, inplace=True) clog.logr('10.newDF' + var + '.csv', Ind, newDF, subdir) df_short = newDF[['PKEY', 'BankContribution', 'StartAgeRange', 'EndAgeRange']] clog.logr('11.df_short' + var + '.csv', Ind, df_short, subdir) # Aggregating information grouped = df_short.groupby(['PKEY']) dfGroup = grouped.aggregate(np.sum) clog.logr('12.dfGroup' + var + '.csv', Ind, dfGroup, subdir) # Let's merge to get evrything in row level df_rowlvl = df.merge(dfGroup, on=['PKEY'], how='inner') clog.logr('13.Rowlvl_Merge' + var + '.csv', Ind, df_rowlvl, subdir) # Dropping PKEY & Unnamed columns from the csv df_rowlvl.drop(['PKEY'], axis=1, inplace=True) clog.logr('14.Final_DF' + var + '.csv', Ind, df_rowlvl, subdir) ############################################################## #### Country & Gender wise Bank's Contribution ##### ############################################################## dfCountryGender = df_rowlvl[['Region', 'Gender', 'BankContribution']] grouped_CG = dfCountryGender.groupby(['Region', 'Gender']) dCountryGen = grouped_CG.aggregate(np.sum) print("-" * 60) print("Country & Gender wise Bank's Contribution") print("-" * 60) print(dCountryGen) clog.logr('15.dCountryGen' + var + '.csv', Ind, dCountryGen, subdir) ############################################################### ###### End Of Country & Gender wise Bank's Contribution ###### ############################################################### ############################################################## #### Country & Job wise Bank's Contribution ##### ############################################################## dfCountryJob = df_rowlvl[['Region', 'Job Classification', 'BankContribution']] grouped_CJ = dfCountryJob.groupby(['Region', 'Job Classification']) dCountryJob = grouped_CJ.aggregate(np.sum) print("-" * 60) print("Country & Job wise Bank's Contribution") print("-" * 60) print(dCountryJob) clog.logr('16.dCountryJob' + var + '.csv', Ind, dCountryJob, subdir) ############################################################### ###### End Of Country & Job wise Bank's Contribution ###### ############################################################### ############################################################## #### Country & Age wise Savings & Bank's Contribution ##### ############################################################## dfCountryAge = df_rowlvl[['Region', 'StartAgeRange', 'EndAgeRange', 'Balance', 'BankContribution']] dfCountryAge['SavingsAmount'] = dfCountryAge.apply(lambda row: self.getSavingsAmount(row), axis=1) grouped_CA = dfCountryAge.groupby(['Region', 'StartAgeRange', 'EndAgeRange']) dCountryAge = grouped_CA.aggregate(np.sum) print("-" * 60) print("Country & Job wise Bank's Contribution") print("-" * 60) print(dCountryAge) clog.logr('17.dCountryAge' + var + '.csv', Ind, dCountryAge, subdir) ############################################################## #### End Of Country & Age wise Savings & Bank's ##### #### Contribution ##### ############################################################## print('Writing to file!!') # Avoiding Index column of dataframe while copying to csv # df_token.to_csv(tgtFileName, index=False) # For Target File Ind should be always Yes/Y Ind = 'Y' FtgtFileName = tgtFileName + var + '.csv' clog.logr(FtgtFileName, Ind, df_rowlvl, subdir_2) ############################################################## ##### Writing to Excel File with Different Tabular Sheet ##### ############################################################## dfs = [dCountryGen, dCountryJob, dCountryAge] sheets = ['Country-Gender-Stats', 'Country-Job-Stats', 'Country-Age-Stats'] x = self.dfs_tabs(dfs, sheets, report_path+tgtFileName + var + '.xlsx') ############################################################## ##### End Of Excel Sheet Writing ##### ############################################################## # Resetting the Filename after every iteration # in case of Mulriple source file exists FtgtFileName = "" return 0 except Exception as e: x = str(e) print(x) return 9 |
Key snippets from this script –
# Adding a new surrogate key to the existing records
df = df.assign(PKEY=[1 + i for i in range(len(df))])[['PKEY'] + df.columns.tolist()]
This is extremely crucial as the application will create its own unique key irrespective of data files, which will be used for most of the places for the data process.
df_lkpFile = df_lkpF[(df_lkpF['TableName'] == srcFileInit) &
((df_lkpF['Category'] == 'I') | (df_lkpF['Category'] == 'K'))]
# Fetching the unique records from Look-up table
id_list1 = list(df_lkpFile['ColumnName'].drop_duplicates())
id_list2 = ['PKEY']
id_list = id_list2 + id_list1
This steps will capture all the columns except our key columns in our source table, which will convert columns to rows & then it will be used to join with our look-up table.
# Pivoting part of the source file data to be join for merge
df_melt = df.melt(id_vars=id_list, var_name='ColumnName')
As in the above step, the application is converting key columns of our source file to rows.
df_lkpFinFile = df_lkpF[(df_lkpF['TableName'] == srcFileInit) &
((df_lkpF['Category'] == 'D') | (df_lkpF['Category'] == 'Male') |
(df_lkpF['Category'] == 'K') | (df_lkpF['Category'] == 'Female'))]
In this step, the application will consider all the rows based on source file name pattern & based on certain data, which will be used for lookup join.
df_fin = df_melt.merge(df_lkpFinFile, on=['ColumnName'], how='left')
In this step, the application will join the transformed data of source file with our lookup file.
df_fin2 = df_fin[((df_fin['MCategory'] == 'I') & (df_fin['Category'] == df_fin['MCategory'])) |
((df_fin['MCategory'] == 'Male') & (df_fin['Category'] == df_fin['MCategory'])) |
((df_fin['MCategory'] == 'Female') & (df_fin['Category'] == df_fin['MCategory'])) |
(df_fin['MCategory'] == 'NaN') |
(df_fin['MCategory'] == 'D') |
(
(df_fin['MCategory'] != 'I') & (df_fin['MCategory'] != 'Male') &
(df_fin['MCategory'] != 'Female') & (df_fin['MCategory'] != 'D') &
(df_fin['MCategory'] != 'NaN')
)]
This step brings the data, which will look like –
# Identifying Integer Column for next step
df_fin2['Catg'] = df_fin2.apply(lambda row: self.getIntVal(row), axis=1)
df_fin2['StAge'] = df_fin2.apply(lambda row: self.getNaN2Zero_StartAgeRange(row), axis=1)
df_fin2['EnAge'] = df_fin2.apply(lambda row: self.getNaN2Zero_EndAgeRange(row), axis=1)
# Dropping the old Columns
df_fin2.drop(['Category'], axis=1, inplace=True)
df_fin2.drop(['StartAgeRange'], axis=1, inplace=True)
df_fin2.drop(['EndAgeRange'], axis=1, inplace=True)
# Renaming the new columns
df_fin2.rename(columns={'Catg': 'Category'}, inplace=True)
df_fin2.rename(columns={'StAge': 'StartAgeRange'}, inplace=True)
df_fin2.rename(columns={'EnAge': 'EndAgeRange'}, inplace=True)
Now, the application will remove NaN from these key columns for important upcoming step.
After this step, the new data looks like –
So, now, it will be easier to filter out these data based on age range against customer age int the next step as follows –
# Handling special cases when Category from source & lookup file won't match
# alternative way to implement left outer join due to specific data scenarios
df_fin2['Flag'] = np.where(((df_fin2.StartAgeRange == 0) | (df_fin2.EndAgeRange == 0)) |
(((df_fin2.StartAgeRange > 0) & (df_fin2.EndAgeRange > 0)) &
((df_fin2.Category >= df_fin2.StartAgeRange)
& (df_fin2.Category <= df_fin2.EndAgeRange))), 'Y', 'N')
After this, new data looks like –
Finally, filter out only records with ‘Y’. And, the data looks like as follows –
Now, the application needs to consolidate Bank Contribution, Start & End Age Range & needs to re-pivot the data to make it a single row per customer. The data should look like this –
Once this is done, our application is ready for all the aggregated data points.
Hence, three different categories of data transformations are self-explanatory –
Data Point – 1:
##############################################################
#### Country & Gender wise Bank's Contribution #####
##############################################################
dfCountryGender = df_rowlvl[['Region', 'Gender', 'BankContribution']]
grouped_CG = dfCountryGender.groupby(['Region', 'Gender'])
dCountryGen = grouped_CG.aggregate(np.sum)
print("-" * 60)
print("Country & Gender wise Bank's Contribution")
print("-" * 60)
print(dCountryGen)
clog.logr('15.dCountryGen' + var + '.csv', Ind, dCountryGen, subdir)
###############################################################
###### End Of Country & Gender wise Bank's Contribution ######
###############################################################
Data Point – 2:
##############################################################
#### Country & Job wise Bank's Contribution #####
##############################################################
dfCountryJob = df_rowlvl[['Region', 'Job Classification', 'BankContribution']]
grouped_CJ = dfCountryJob.groupby(['Region', 'Job Classification'])
dCountryJob = grouped_CJ.aggregate(np.sum)
print("-" * 60)
print("Country & Job wise Bank's Contribution")
print("-" * 60)
print(dCountryJob)
clog.logr('16.dCountryJob' + var + '.csv', Ind, dCountryJob, subdir)
###############################################################
###### End Of Country & Job wise Bank's Contribution ######
###############################################################
Data Point – 3:
##############################################################
#### Country & Age wise Savings & Bank's Contribution #####
##############################################################
dfCountryAge = df_rowlvl[['Region', 'StartAgeRange', 'EndAgeRange', 'Balance', 'BankContribution']]
dfCountryAge['SavingsAmount'] = dfCountryAge.apply(lambda row: self.getSavingsAmount(row), axis=1)
grouped_CA = dfCountryAge.groupby(['Region', 'StartAgeRange', 'EndAgeRange'])
dCountryAge = grouped_CA.aggregate(np.sum)
print("-" * 60)
print("Country & Job wise Bank's Contribution")
print("-" * 60)
print(dCountryAge)
clog.logr('17.dCountryAge' + var + '.csv', Ind, dCountryAge, subdir)
##############################################################
#### End Of Country & Age wise Savings & Bank's #####
#### Contribution #####
##############################################################
Finally, these datasets will invoke an excel generator function to capture all these data into different sheets & beautify the report are as follows –
##############################################################
##### Writing to Excel File with Different Tabular Sheet #####
##############################################################
dfs = [dCountryGen, dCountryJob, dCountryAge]
sheets = ['Country-Gender-Stats', 'Country-Job-Stats', 'Country-Age-Stats']
x = self.dfs_tabs(dfs, sheets, report_path+tgtFileName + var + '.xlsx')
##############################################################
##### End Of Excel Sheet Writing #####
##############################################################
Key snippets from this function –
writer = p.ExcelWriter(file_name, engine='xlsxwriter')
This step will initiate the excel engine.
for dataframe, sheet in zip(df_list, sheet_list):
number_rows = int(dataframe.shape[0])
number_cols = int(dataframe.shape[1])
In this step, the application will unpack one by one sheet & produce the result into excel.
if cnt == 0:
dataframe.to_excel(writer, sheet_name=sheet, startrow=7, startcol=5)
else:
dataframe.to_excel(writer, sheet_name=sheet, startrow=5, startcol=0)
In this step, this will create the data starting from row 7 into the first sheet, whereas the remaining two sheets will capture data from row 5.
worksheet.set_column('A:E', 4)
worksheet.set_column('F:F', 20)
worksheet.set_column('G:G', 10)
worksheet.set_column('H:J', 20)
This will set the length of these columns.
# Insert an Image
worksheet.insert_image('E1', 'Logo.png', {'x_scale':0.6, 'y_scale':0.8})
In this case, the application will insert my blog logo on top of every page of this excel.
# Add a number format for cells with money.
money_fmt = workbook.add_format({'num_format': '$#,##0', 'border': 1})
worksheet.set_column('H:H', 20, money_fmt)
Also, for the column with monetary information, it will generate a specific format.
# Define our range for color formatting
color_range = "F9:F{}".format(number_rows * 2 + 1)
# Add a format. Red fill with the dark red text
red_format = workbook.add_format({'bg_color':'#FEC7CE', 'font_color':'#0E0E08', 'border':1})
# Add a format. Green fill with the dark green text
green_format = workbook.add_format({'bg_color': '#D0FCA4', 'font_color': '#0E0E08', 'border': 1})
# Add a format. Cyan fill with the dark green text
mid_format = workbook.add_format({'bg_color': '#6FC2D8', 'font_color': '#0E0E08', 'border': 1})
# Add a format. Other fill with the dark green text
oth_format = workbook.add_format({'bg_color': '#AFC2D8', 'font_color': '#0E0E08', 'border': 1})
worksheet.conditional_format(color_range, {'type':'cell',
'criteria':'equal to',
'value':'"England"',
'format': green_format})
worksheet.conditional_format(color_range, {'type': 'cell',
'criteria': 'equal to',
'value': '"Northern Ireland"',
'format': mid_format})
worksheet.conditional_format(color_range, {'type': 'cell',
'criteria': 'equal to',
'value': '"Scotland"',
'format': red_format})
worksheet.conditional_format(color_range, {'type': 'cell',
'criteria': 'equal to',
'value': '"Wales"',
'format': oth_format})
In this step, the application will color-code individual start cell to highlight specific category for better decision making visually.
4. callPivotLookUp.py (This script will call the main pivot script & process the data as per business requirement. Hence, the name comes into the picture.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | ##################################################### ### Objective: Purpose of this Library is to call ### ### the parse_and_write_csv method to produce the ### ### tokenized columns based on the look-up file. ### ### ### ### Arguments are as follows: ### ### Source File, Target File & Lookup Files. ### ### ### ##################################################### import clsPivotLookUp as ct from clsParam import clsParam as cf import sys import pandas as p import clsLookUpDataRead as cl def main(): print("Calling the custom Package..") cnt_lkp = 0 try: #Default Look up table Lkp_Filename = cf.config['LKP_FILE'] # Adding New DB Table for Lookup x = cl.clsLookUpDataRead(Lkp_Filename) df_lkpF = x.ReadTable() cnt_lkp = df_lkpF.shape[0] if cnt_lkp > 0: df_lkpF_copy = df_lkpF.copy() # Getting all the unique file names df_list_F1 = list(df_lkpF_copy['TableName'].drop_duplicates()) # File list which has Tokenization df_lkpF_Int = df_lkpF[(df_lkpF['Group'].str.len() >= 1)] df_list_F2 = list(df_lkpF_Int['TableName'].drop_duplicates()) for i in df_list_F1: if i in df_list_F2: try: inputFile = i print("*"*30) print("Reading from " + inputFile + ".csv") print("*" * 30) srcFileName = inputFile tarFileName = srcFileName + '_processed' x = ct.clsPivotLookUp(srcFileName, tarFileName, df_lkpF) ret_val = x.parse_and_write_csv() if ret_val == 0: print("Writing to file -> (" + tarFileName + ".csv) Status: ", ret_val) else: if ret_val == 5: print("File IO Error! Please check your directory whether the file exists with data!") else: print("Data Processing Issue!") print("*" * 30) print("Operation done for " + srcFileName + "!") print("*" *30) except Exception as e: x = str(e) srcFileName = inputFile print('Check the status of ' + srcFileName + ' ' + x) else: pass else: print("No Matching Data to process!") except Exception as e: x = str(e) print(x) print("No Matching Data to process!") if __name__ == "__main__": main() |
And, the key snippet from here –
# Getting all the unique file names
df_list_F1 = list(df_lkpF_copy['TableName'].drop_duplicates())
# File list which has Tokenization
df_lkpF_Int = df_lkpF[(df_lkpF['Group'].str.len() >= 1)]
df_list_F2 = list(df_lkpF_Int['TableName'].drop_duplicates())
This will identify all the source files, which as similar kind of cases & process them one by one.
x = ct.clsPivotLookUp(srcFileName, tarFileName, df_lkpF)
ret_val = x.parse_and_write_csv()
if ret_val == 0:
print("Writing to file -> (" + tarFileName + ".csv) Status: ", ret_val)
else:
if ret_val == 5:
print("File IO Error! Please check your directory whether the file exists with data!")
else:
print("Data Processing Issue!")
This will call the main application class & based on the return result – it will capture the status of success or failure.
Let’s check the directory of both the Windows & MAC.
Windows:
MAC:
Let’s check the run process –
Windows:
MAC:
Let’s see – how it looks in Excel –
Windows:
MAC:
So, finally, we’ve achieved our target.
Horray! We’ve done it! 😀
I hope you’ll like this effort.
Wait for the next installment. Till then, Happy Avenging. 🙂
[Note: All the sample data are available in public domain for research & study.]