Hi Guys!
Today, our main objective is to extend our last post & blending two different kinds of data using Python.
Please refer the earlier post if you didn’t go through it – “Building Azure cosmos application.“.
What is the Objective?
In this post, our objective is to combine traditional RDBMS from the cloud with Azure’s NO SQL, which is, in this case, is Cosmos DB. And, try to forecast some kind of blended information, which can be aggregated further.
Examining Source Data.
No SQL Data from Cosmos:
Let’s check one more time the No SQL data created in our last post.

Total, we’ve created 6 records in our last post.

As you can see in red marked areas. From item, one can check the total number of records created. You can also filter out specific record using the Edit Filter blue color button highlighted with blue box & you need to provide the “WHERE CLAUSE” inside it.
Azure SQL DB:
Let’s create some data in Azure SQL DB.
But, before that, you need to create SQL DB in the Azure cloud. Here is the official Microsoft link to create DB in Azure. You can refer to it here.
I won’t discuss the detailed steps of creating DB here.
From Azure portal, it looks like –

Let’s see how the data looks like in Azure DB. For our case, we’ll be using the hrMaster DB.
Let’s create the table & some sample data aligned as per our cosmos data.

We will join both the data based on subscriberId & then extract our required columns in our final output.

Good. Now, we’re ready for python scripts.
Python Scripts:
In this installment, we’ll be reusing the following python scripts, which is already discussed in my earlier post –
- clsL.py
- clsColMgmt.py
- clsCosmosDBDet.py
So, I’m not going to discuss these scripts.
Before we discuss our scripts, let’s look out the directory structures –

Here is the detailed directory structure between the Windows & MAC O/S.
1. clsConfig.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as usual verbose debug logging as well. Hence, the name comes into the picture.)
############################################## #### Written By: SATYAKI DE #### #### Written On: 25-May-2019 #### #### Updated On: 02-Jun-2019 #### #### #### #### Objective: This script is a config #### #### file, contains all the keys for #### #### azure cosmos db. Application will #### #### process these information & perform #### #### various CRUD operation on Cosmos DB. #### ############################################## import os import platform as pl class clsConfig(object): Curr_Path = os.path.dirname(os.path.realpath(__file__)) db_name = 'rnd-de01-usw2-vfa-cdb' db_link = 'dbs/' + db_name CONTAINER1 = "RealtimeEmail" CONTAINER2 = "RealtimeTwitterFeedback" CONTAINER3 = "RealtimeHR" os_det = pl.system() if os_det == "Windows": sep = '\\' else: sep = '/' config = { 'SERVER': 'xxxx-xxx.database.windows.net', 'DATABASE_1': 'SalesForceMaster', 'DATABASE_2': 'hrMaster', 'DATABASE_3': 'statMaster', 'USERNAME': 'admin_poc_dev', 'PASSWORD': 'xxxxx', 'DRIVER': '{ODBC Driver 17 for SQL Server}', 'ENV': 'pocdev-saty', 'ENCRYPT_FLAG': "yes", 'TRUST_FLAG': "no", 'TIMEOUT_LIMIT': "30", 'PROCSTAT': "'Y'", 'APP_ID': 1, 'EMAIL_SRC_JSON_FILE': Curr_Path + sep + 'src_file' + sep + 'srcEmail.json', 'TWITTER_SRC_JSON_FILE': Curr_Path + sep + 'src_file' + sep + 'srcTwitter.json', 'HR_SRC_JSON_FILE': Curr_Path + sep + 'src_file' + sep + 'srcHR.json', 'COSMOSDB_ENDPOINT': 'https://rnd-de01-usw2-vfa-cdb.documents.azure.com:443/', 'CONFIG_TABLE': 'ETL_CONFIG_TAB', 'COSMOS_PRIMARYKEY': "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXIsI00AxKXXXXXgg==", 'ARCH_DIR': Curr_Path + sep + 'arch' + sep, 'COSMOSDB': db_name, 'COSMOS_CONTAINER1': CONTAINER1, 'COSMOS_CONTAINER2': CONTAINER2, 'COSMOS_CONTAINER3': CONTAINER3, 'CONFIG_ORIG': 'Config_orig.csv', 'ENCRYPT_CSV': 'Encrypt_Config.csv', 'DECRYPT_CSV': 'Decrypt_Config.csv', 'PROFILE_PATH': Curr_Path + sep + 'profile' + sep, 'LOG_PATH': Curr_Path + sep + 'log' + sep, 'REPORT_PATH': Curr_Path + sep + 'report', 'APP_DESC_1': 'Feedback Communication', 'DEBUG_IND': 'N', 'INIT_PATH': Curr_Path, 'SQL_QRY_1': "SELECT c.subscriberId, c.sender, c.orderNo, c.orderDate, c.items.orderQty FROM RealtimeEmail c", 'SQL_QRY_2': "SELECT c.twitterId, c.Twit, c.DateCreated, c.Country FROM RealtimeTwitterFeedback c WHERE c.twitterId=@CrVal", 'DB_QRY': "SELECT * FROM c", 'AZURE_SQL_1': "SELECT DISTINCT subscriberId, state, country, annualIncome, customerType FROM dbo.onboardCustomer", 'COLLECTION_QRY': "SELECT * FROM r", 'database_link': db_link, 'collection_link_1': db_link + '/colls/' + CONTAINER1, 'collection_link_2': db_link + '/colls/' + CONTAINER2, 'collection_link_3': db_link + '/colls/' + CONTAINER3, 'options': { 'offerThroughput': 1000, 'enableCrossPartitionQuery': True, 'maxItemCount': 2 } }
Here, we’ve added a couple of more entries compared to the last time, which points the detailed configuration for Azure SQL DB.
‘SERVER’: ‘xxxx-xxx.database.windows.net’,
‘DATABASE_1’: ‘SalesForceMaster’,
‘DATABASE_2’: ‘hrMaster’,
‘DATABASE_3’: ‘statMaster’,
‘USERNAME’: ‘admin_poc_dev’,
‘PASSWORD’: ‘xxxxx’,
‘DRIVER’: ‘{ODBC Driver 17 for SQL Server}’,
‘ENV’: ‘pocdev-saty’,
‘ENCRYPT_FLAG’: “yes”,
‘TRUST_FLAG’: “no”,
‘TIMEOUT_LIMIT’: “30”,
‘PROCSTAT’: “‘Y'”,
Here, you need to supply your DB credentials accordingly.
2. clsDBLookup.py (This script will look into the Azure SQL DB & fetch data from the traditional RDBMS of Azure environment.)
##################################################### #### Written By: SATYAKI DE #### #### Written On: 25-May-2019 #### #### #### #### Objective: This script will check & #### #### test the connection with the Azure #### #### SQL DB & it will fetch all the records #### #### name resied under the same DB of a table. #### ##################################################### import pyodbc as py import pandas as p from clsConfig import clsConfig as cdc class clsDBLookup(object): def __init__(self, lkpTableName = ''): self.server = cdc.config['SERVER'] self.database = cdc.config['DATABASE_1'] self.database1 = cdc.config['DATABASE_2'] self.database2 = cdc.config['DATABASE_3'] self.username = cdc.config['USERNAME'] self.password = cdc.config['PASSWORD'] self.driver = cdc.config['DRIVER'] self.env = cdc.config['ENV'] self.encrypt_flg = cdc.config['ENCRYPT_FLAG'] self.trust_flg = cdc.config['TRUST_FLAG'] self.timeout_limit = cdc.config['TIMEOUT_LIMIT'] self.lkpTableName = cdc.config['CONFIG_TABLE'] self.ProcStat = cdc.config['PROCSTAT'] self.AppId = cdc.config['APP_ID'] def LookUpData(self): try: # Assigning all the required values server = self.server database = self.database1 username = self.username password = self.password driver = self.driver env = self.env encrypt_flg = self.encrypt_flg trust_flg = self.trust_flg timout_limit = self.timeout_limit lkpTableName = self.lkpTableName ProcStat = self.ProcStat AppId = self.AppId # Creating secure connection str_conn = 'Driver=' + driver + ';Server=tcp:' + server + ',1433;' \ 'Database=' + database + ';Uid=' + username + '@' + env + ';' \ 'Pwd=' + password + ';Encrypt=' + encrypt_flg + ';' \ 'TrustServerCertificate=' + trust_flg + ';Connection Timeout=' + timout_limit + ';' db_con_azure = py.connect(str_conn) query = " SELECT [ruleId] as ruleId, [ruleName] as ruleName, [ruleSQL] as ruleSQL, " \ " [ruleFlag] as ruleFlag, [appId] as appId, [DBType] as DBType, " \ " [DBName] as DBName FROM [dbo][" + lkpTableName + "] WHERE ruleFLag = " + ProcStat + " " \ " and appId = " + AppId + " ORDER BY ruleId " df = p.read_sql(query, db_con_azure) # Closing the connection db_con_azure.close() return df except Exception as e: x = str(e) print(x) df = p.DataFrame() return df def azure_sqldb_read(self, sql): try: # Assigning all the required values server = self.server database = self.database1 username = self.username password = self.password driver = self.driver env = self.env encrypt_flg = self.encrypt_flg trust_flg = self.trust_flg timout_limit = self.timeout_limit lkpTableName = self.lkpTableName ProcStat = self.ProcStat AppId = self.AppId # Creating secure connection str_conn = 'Driver=' + driver + ';Server=tcp:' + server + ',1433;' \ 'Database=' + database + ';Uid=' + username + '@' + env + ';' \ 'Pwd=' + password + ';Encrypt=' + encrypt_flg + ';' \ 'TrustServerCertificate=' + trust_flg + ';Connection Timeout=' + timout_limit + ';' # print("Connection Details:: ", str_conn) db_con_azure = py.connect(str_conn) query = sql df = p.read_sql(query, db_con_azure) # Closing the connection db_con_azure.close() return df except Exception as e: x = str(e) print(x) df = p.DataFrame() return df
Major lines to discuss –
azure_sqldb_read(self, sql):
Getting the source SQL supplied from the configuration script.
db_con_azure = py.connect(str_conn) query = sql df = p.read_sql(query, db_con_azure)
After creating a successful connection, our application will read the SQL & fetch the data & store that into a pandas dataframe and return the output to the primary calling function.
3. callCosmosAPI.py (This is the main script, which will call all the methods to blend the data. Hence, the name comes into the picture.)
############################################## #### Written By: SATYAKI DE #### #### Written On: 25-May-2019 #### #### Modified On 02-Jun-2019 #### #### #### #### Objective: Main calling scripts. #### ############################################## import clsColMgmt as cm import clsCosmosDBDet as cmdb from clsConfig import clsConfig as cf import pandas as p import clsLog as cl import logging import datetime import json import clsDBLookup as dbcon # Disbling Warning def warn(*args, **kwargs): pass import warnings warnings.warn = warn def getDate(row): try: d1 = row['orderDate'] d1_str = str(d1) d1_dt_part, sec = d1_str.split('.') dt_part1 = d1_dt_part.replace('T', ' ') return dt_part1 except Exception as e: x = str(e) print(x) dt_part1 = '' return dt_part1 # Lookup functions from # Azure cloud SQL DB var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") def main(): try: df_ret = p.DataFrame() df_ret_2 = p.DataFrame() df_ret_2_Mod = p.DataFrame() debug_ind = 'Y' # Initiating Log Class l = cl.clsLog() general_log_path = str(cf.config['LOG_PATH']) # Enabling Logging Info logging.basicConfig(filename=general_log_path + 'consolidated.log', level=logging.INFO) # Moving previous day log files to archive directory arch_dir = cf.config['ARCH_DIR'] log_dir = cf.config['LOG_PATH'] print("Archive Directory:: ", arch_dir) print("Log Directory::", log_dir) print("*" * 157) print("Testing COSMOS DB Connection!") print("*" * 157) # Checking Cosmos DB Azure y = cmdb.clsCosmosDBDet() ret_val = y.test_db_con() if ret_val == 0: print() print("Cosmos DB Connection Successful!") print("*" * 157) else: print() print("Cosmos DB Connection Failure!") print("*" * 157) raise Exception print("*" * 157) # Accessing from Azure SQL DB x1 = dbcon.clsDBLookup() act_df = x1.azure_sqldb_read(cf.config['AZURE_SQL_1']) print("Azure SQL DB::") print(act_df) print() print("-" * 157) # Calling the function 1 print("RealtimeEmail::") # Fetching First collection data to dataframe print("Fethcing Comos Collection Data!") sql_qry_1 = cf.config['SQL_QRY_1'] msg = "Documents generatd based on unique key" collection_flg = 1 x = cm.clsColMgmt() df_ret = x.fetch_data(sql_qry_1, msg, collection_flg) l.logr('1.EmailFeedback_' + var + '.csv', debug_ind, df_ret, 'log') print('RealtimeEmail Data::') print(df_ret) print() # Checking execution status ret_val = int(df_ret.shape[0]) if ret_val == 0: print("Cosmos DB Hans't returned any rows. Please check your queries!") print("*" * 157) else: print("Successfully fetched!") print("*" * 157) # Calling the 2nd Collection print("RealtimeTwitterFeedback::") # Fetching First collection data to dataframe print("Fethcing Cosmos Collection Data!") # Query using parameters sql_qry_2 = cf.config['SQL_QRY_2'] msg_2 = "Documents generated based on RealtimeTwitterFeedback feed!" collection_flg = 2 val = 'crazyGo' param_det = [{"name": "@CrVal", "value": val}] add_param = 2 x1 = cm.clsColMgmt() df_ret_2 = x1.fetch_data(sql_qry_2, msg_2, collection_flg, add_param, param_det) l.logr('2.TwitterFeedback_' + var + '.csv', debug_ind, df_ret, 'log') print('Realtime Twitter Data:: ') print(df_ret_2) print() # Checking execution status ret_val_2 = int(df_ret_2.shape[0]) if ret_val_2 == 0: print("Cosmos DB hasn't returned any rows. Please check your queries!") print("*" * 157) else: print("Successfuly row feteched!") print("*" * 157) # Merging NoSQL Data (Cosmos DB) with Relational DB (Azure SQL DB) df_Fin_temp = p.merge(df_ret, act_df, on='subscriberId', how='inner') df_fin = df_Fin_temp[['orderDate', 'orderNo', 'sender', 'state', 'country', 'customerType']] print("Initial Combined Data (From Cosmos & Azure SQL DB) :: ") print(df_fin) l.logr('3.InitCombine_' + var + '.csv', debug_ind, df_fin, 'log') # Transforming the orderDate as per standard format df_fin['orderDateM'] = df_fin.apply(lambda row: getDate(row), axis=1) # Dropping the old column & renaming the new column to old column df_fin.drop(columns=['orderDate'], inplace=True) df_fin.rename(columns={'orderDateM': 'orderDate'}, inplace=True) print("*" * 157) print() print("Final Combined & Transformed result:: ") print(df_fin) l.logr('4.Final_Combine_' + var + '.csv', debug_ind, df_fin, 'log') print("*" * 157) except ValueError: print("No relevant data to proceed!") except Exception as e: print("Top level Error: args:{0}, message{1}".format(e.args, e.message)) if __name__ == "__main__": main()
The key lines from this script –
def getDate(row): try: d1 = row['orderDate'] d1_str = str(d1) d1_dt_part, sec = d1_str.split('.') dt_part1 = d1_dt_part.replace('T', ' ') return dt_part1 except Exception as e: x = str(e) print(x) dt_part1 = '' return dt_part1
This function converts NoSQL date data type more familiar format.
NoSQL Date:

Transformed Date:

# Accessing from Azure SQL DB x1 = dbcon.clsDBLookup() act_df = x1.azure_sqldb_read(cf.config['AZURE_SQL_1']) print("Azure SQL DB::") print(act_df) print()
Above lines are calling the Azure SQL DB method to retrieve the RDBMS data into our dataframe.
# Merging NoSQL Data (Cosmos DB) with Relational DB (Azure SQL DB) df_Fin_temp = p.merge(df_ret, act_df, on='subscriberId', how='inner') df_fin = df_Fin_temp[['orderDate', 'orderNo', 'sender', 'state', 'country', 'customerType']]
In these above lines, we’re joining the data retrieved from two different kinds of the database to prepare our initial combined dataframe. Also, we’ve picked only the desired column, which will be useful for us.
# Transforming the orderDate as per standard format df_fin['orderDateM'] = df_fin.apply(lambda row: getDate(row), axis=1) # Dropping the old column & renaming the new column to old column df_fin.drop(columns=['orderDate'], inplace=True) df_fin.rename(columns={'orderDateM': 'orderDate'}, inplace=True)
In the above lines, we’re transforming our date field, as shown above in one of our previous images by calling the getDate method.
Let’s see the directory structure of our program –

Let’s see how it looks when it runs –
Windows:


MAC:


So, finally, we’ve successfully blended the data & make more meaningful data projection.
Following python packages are required to run this application –
pip install azure
pip install azure-cosmos
pip install pandas
pip install requests
pip install pyodbc
This application tested on Python3.7.1 & Python3.7.2 as well. As per Microsoft, their official supported version is Python3.5.
I hope you’ll like this effort.
Wait for the next installment. Till then, Happy Avenging. 😀
[Note: All the sample data are available/prepared in the public domain for research & study.]
You must log in to post a comment.