Pandas Archives

Building Azure Cosmos solution using Python, Pandas ( A crossover of space stone, a reality stone, soul stone & time stone)

Posted on May 28, 2019January 2, 2021 by SatyakiDe in api, call, Data Science, data warehouse, function, member function, Pandas, Python, sql, Technology

Hi Guys,

Here is the latest installment from the Python verse. For the first time, we’ll be dealing with Python with Azure cloud along with the help from Pandas & json.

Why post on this topic?

I always try to post something based on some kind of used cases, which might be useful in real-life scenarios. And, on top of that, I really don’t find significant posts on Azure dealing with Python. So, thought of sharing some first used cases, which will encourage others to join this club & used more python based application in the Azure platform.

First, let us check the complexity of today’s post & our objective.

What is the objective?

Today, our objective is to load a couple of json payload & stored them into multiple Cosmos Containers & finally fetch the data from the Cosmos DB & store the output into our log files apart from printing the same over the terminal screen.

Before we start discussing our post, let us explain some basic terminology of Azure Cosmos DB. So, that, next time whenever we refer them, it will be easier for you to understand those terminologies.

Learning basic azure terminology.

Since this is an unstructured DB, all the data will be stored in this following fashion –

Azure Cosmos DB -> Container -> Items

Let’s simplify this in words. So, each azure DB may have multiple containers, which you can compare with the table of any conventional RDBMS. And, under containers, you will have multiple items, which represents rows of an RDBMS table. The only difference is in each item you might have a different number of elements, which is equivalent to the columns in traditional RDBMS tables. The traditional table always has a fixed number of columns.

Input Payload:

Let’s review three different payloads, which we’ll be loading into three separate containers.

srcEmail.json

As you can see in the items, first sub-row has 3 elements, whereas the second one has 4 components. Traditional RDBMS, the table will always have the same number of columns.

srcTwitter.json

srcHR.json

So, from the above three sample payload, our application will try to put user’s feedback & consolidate at a single place for better product forecasts.

Azure Portal:

Let’s look into the Azure portal & we’ll be identifying a couple of crucial information, which will require in python scripts for authentication. But, before that, I’ll show – how to get those details in steps –

As shown highlighted in Red, click the Azure Cosmos DB. You will find the following screen –

If you click this, you will find all the collections/containers that are part of the same DB as follows –

After, that we’ll be trying to extract the COSMOS Key & the Endpoint/URI from the portal. Without this, python application won’t be able to interact with the Azure portal. This is sensitive information. So, I’ll be providing some dummy details here just to show how to extract it. Never share these details with anyone outside of your project or group.

Good. Now, we’re ready for python scripts.

Python Scripts:

In this installment, we’ll be reusing the following python scripts, which is already discussed in my earlier post –

clsL.py

So, I’m not going to discuss these scripts.

Before we discuss our scripts, let’s look out the directory structures –

1. clsConfig.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as usual verbose debug logging as well. Hence, the name comes into the picture.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 25-May-2019              ####
####                                      ####
#### Objective: This script is a config   ####
#### file, contains all the keys for      ####
#### azure cosmos db. Application will    ####
#### process these information & perform  ####
#### various CRUD operation on Cosmos DB. ####
##############################################
import os
import platform as pl

class clsConfig(object):
    Curr_Path = os.path.dirname(os.path.realpath(__file__))
    db_name = 'rnd-de01-usw2-vfa-cdb'
    db_link = 'dbs/' + db_name
    CONTAINER1 = "RealtimeEmail"
    CONTAINER2 = "RealtimeTwitterFeedback"
    CONTAINER3 = "RealtimeHR"

    os_det = pl.system()
    if os_det == "Windows":
        sep = '\\'
    else:
        sep = '/'

    config = {
        'EMAIL_SRC_JSON_FILE': Curr_Path + sep + 'src_file' + sep + 'srcEmail.json',
        'TWITTER_SRC_JSON_FILE': Curr_Path + sep + 'src_file' + sep + 'srcTwitter.json',
        'HR_SRC_JSON_FILE': Curr_Path + sep + 'src_file' + sep + 'srcHR.json',
        'COSMOSDB_ENDPOINT': 'https://rnd-de01-usw2-vfa-cdb.documents.azure.com:443/',
        'COSMOS_PRIMARYKEY': "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXIsI00AxKXXXXXgg==",
        'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
        'COSMOSDB': db_name,
        'COSMOS_CONTAINER1': CONTAINER1,
        'COSMOS_CONTAINER2': CONTAINER2,
        'COSMOS_CONTAINER3': CONTAINER3,
        'CONFIG_ORIG': 'Config_orig.csv',
        'ENCRYPT_CSV': 'Encrypt_Config.csv',
        'DECRYPT_CSV': 'Decrypt_Config.csv',
        'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
        'LOG_PATH': Curr_Path + sep + 'log' + sep,
        'REPORT_PATH': Curr_Path + sep + 'report',
        'APP_DESC_1': 'Feedback Communication',
        'DEBUG_IND': 'N',
        'INIT_PATH': Curr_Path,
        'SQL_QRY_1': "SELECT c.subscriberId, c.sender, c.orderNo, c.orderDate, c.items.orderQty  FROM RealtimeEmail c",
        'SQL_QRY_2': "SELECT c.twitterId, c.Twit, c.DateCreated, c.Country FROM RealtimeTwitterFeedback c WHERE c.twitterId=@CrVal",
        'DB_QRY': "SELECT * FROM c",
        'COLLECTION_QRY': "SELECT * FROM r",
        'database_link': db_link,
        'collection_link_1': db_link + '/colls/' + CONTAINER1,
        'collection_link_2': db_link + '/colls/' + CONTAINER2,
        'collection_link_3': db_link + '/colls/' + CONTAINER3,
        'options': {
            'offerThroughput': 1000,
            'enableCrossPartitionQuery': True,
            'maxItemCount': 2
        }
    }

2. clsCosmosDBDet (This script will test the necessary connection with the Azure cosmos DB from the python application. And, if it is successful, then it will fetch all the collection/containers details, which resided under the same DB. Hence, the name comes into the picture.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 25-May-2019              ####
####                                      ####
#### Objective: This script will check &  ####
#### test the connection with the Cosmos  ####
#### & it will fetch all the collection   ####
#### name resied under the same DB.       ####
##############################################

import azure.cosmos.cosmos_client as cosmos_client
import azure.cosmos.errors as errors

from clsConfig import clsConfig as cf

class IDisposable(cosmos_client.CosmosClient):
    def __init__(self, obj):
        self.obj = obj

    def __enter__(self):
        return self.obj

    def __exit__(self, exception_type, exception_val, trace):
        self = None

class clsCosmosDBDet:
    def __init__(self):
        self.endpoint = cf.config['COSMOSDB_ENDPOINT']
        self.primarykey = cf.config['COSMOS_PRIMARYKEY']
        self.db = cf.config['COSMOSDB']
        self.cont_1 = cf.config['COSMOS_CONTAINER1']
        self.cont_2 = cf.config['COSMOS_CONTAINER2']
        self.cont_3 = cf.config['COSMOS_CONTAINER3']
        self.database_link = cf.config['database_link']
        self.collection_link_1 = cf.config['collection_link_1']
        self.collection_link_2 = cf.config['collection_link_2']
        self.collection_link_3 = cf.config['collection_link_3']
        self.options = cf.config['options']
        self.db_qry = cf.config['DB_QRY']
        self.collection_qry = cf.config['COLLECTION_QRY']

    def list_Containers(self, client):
        try:
            database_link = self.database_link
            collection_qry = self.collection_qry
            print("1. Query for collection!")
            print()

            collections = list(client.QueryContainers(database_link, {"query": collection_qry}))

            if not collections:
                return

            for collection in collections:
                print(collection['id'])

            print()

        except errors.HTTPFailure as e:
            if e.status_code == 404:
                print("*" * 157)
                print('A collection with id \'{0}\' does not exist'.format(id))
                print("*" * 157)
            else:
                raise errors.HTTPFailure(e.status_code)

    def test_db_con(self):
        endpoint = self.endpoint
        primarykey = self.primarykey
        options_1 = self.options
        db_qry = self.db_qry

        with IDisposable(cosmos_client.CosmosClient(url_connection=endpoint, auth={'masterKey': primarykey})) as client:
            try:
                try:
                    options = {}
                    query = {"query": db_qry}
                    options = options_1

                    print("-" * 157)
                    print('Options:: ', options)
                    print()
                    print("Database details:: ")

                    result_iterable = client.QueryDatabases(query, options)

                    for item in iter(result_iterable):
                        print(item)

                    print("-" * 157)

                except errors.HTTPFailure as e:
                    if e.status_code == 409:
                        pass
                    else:
                        raise errors.HTTPFailure(e.status_code)

                self.list_Containers(client)

                return 0

            except errors.HTTPFailure as e:
                print("Application has caught an error. {0}".format(e.message))

                return 1

            finally:
                print("Application successfully completed!")

Key lines from the above scripts are –

with IDisposable(cosmos_client.CosmosClient(url_connection=endpoint, auth={'masterKey': primarykey})) as client:

In this step, the python application is building the connection object.

# Refer the entry in our config file
self.db_qry = cf.config['DB_QRY']
..
query = {"query": db_qry}
options = options_1
..

result_iterable = client.QueryDatabases(query, options)

Based on the supplied value from our configuration python script, this will extract the cosmos DB information.

self.list_Containers(client)

This is a function that will identify all the collection under this DB.

def list_Containers(self, client):

..
collections = list(client.QueryContainers(database_link, {"query": collection_qry}))

if not collections:
 return

for collection in collections:
 print(collection['id'])

In these above lines, our application will actually fetch the containers that are associated with this DB.

3. clsColMgmt.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as usual verbose debug logging as well. Hence, the name comes into the picture.)

################################################
#### Written By: SATYAKI DE                 ####
#### Written On: 25-May-2019                ####
####                                        ####
#### Objective: This scripts has multiple   ####
#### features. You can create new items     ####
#### in azure cosmos db. Apart from that    ####
#### you can retrieve data from Cosmos just ####
#### for viewing purpose. You can display   ####
#### data based on specific filters or the  ####
#### entire dataset. Hence, three different ####
#### methods provided here to support this. ####
################################################

import azure.cosmos.cosmos_client as cosmos_client
import azure.cosmos.errors as errors
import pandas as p
import json

from clsConfig import clsConfig as cf

class IDisposable(cosmos_client.CosmosClient):
    def __init__(self, obj):
        self.obj = obj

    def __enter__(self):
        return self.obj

    def __exit__(self, exception_type, exception_val, trace):
        self = None

class clsColMgmt:
    def __init__(self):
        self.endpoint = cf.config['COSMOSDB_ENDPOINT']
        self.primarykey = cf.config['COSMOS_PRIMARYKEY']
        self.db = cf.config['COSMOSDB']
        self.cont_1 = cf.config['COSMOS_CONTAINER1']
        self.cont_2 = cf.config['COSMOS_CONTAINER2']
        self.cont_3 = cf.config['COSMOS_CONTAINER3']
        self.database_link = cf.config['database_link']
        self.collection_link_1 = cf.config['collection_link_1']
        self.collection_link_2 = cf.config['collection_link_2']
        self.collection_link_3 = cf.config['collection_link_3']
        self.options = cf.config['options']
        self.db_qry = cf.config['DB_QRY']
        self.collection_qry = cf.config['COLLECTION_QRY']

    # Creating cosmos items in container
    def CreateDocuments(self, inputJson, collection_flg = 1):
        try:
            # Declaring variable
            endpoint = self.endpoint
            primarykey = self.primarykey

            print('Creating Documents')

            with IDisposable(cosmos_client.CosmosClient(url_connection=endpoint, auth={'masterKey': primarykey})) as client:
                try:
                    if collection_flg == 1:
                        collection_link = self.collection_link_1
                    elif collection_flg == 2:
                        collection_link = self.collection_link_2
                    else:
                        collection_link = self.collection_link_3

                    container = client.ReadContainer(collection_link)

                    # Create a SalesOrder object. This object has nested properties and various types including numbers, DateTimes and strings.
                    # This can be saved as JSON as is without converting into rows/columns.
                    print('Input Json:: ', str(inputJson))
                    nSon = json.dumps(inputJson)
                    json_rec = json.loads(nSon)

                    client.CreateItem(container['_self'], json_rec)

                except errors.HTTPFailure as e:
                    print("Application has caught an error. {0}".format(e.status_code))

                finally:
                    print("Application successfully completed!")

            return 0
        except Exception as e:
            x = str(e)
            print(x)
            return 1

    def CosmosDBCustomQuery_PandasCSVWithParam(self, client, collection_link, query_with_optional_parameters, message="Documents found by query: ", options_sql = {}):
        try:
            # Reading data by SQL & convert it ot Pandas Dataframe
            results = list(client.QueryItems(collection_link, query_with_optional_parameters, options_sql))
            cnt = 0

            dfSrc = p.DataFrame()
            dfRes = p.DataFrame()
            dfSrc2 = p.DataFrame()
            json_data = ''

            for doc in results:
                cnt += 1

            dfSrc = p.io.json.json_normalize(results)
            dfSrc.columns = dfSrc.columns.map(lambda x: x.split(".")[-1])
            dfRes = dfSrc

            print("Total records fetched: ", cnt)
            print("*" * 157)

            return dfRes
        except errors.HTTPFailure as e:
            Df_Fin = p.DataFrame()
            if e.status_code == 404:
                print("*" *157)
                print("Document doesn't exists")
                print("*" *157)
                return Df_Fin
            elif e.status_code == 400:
                print("*" * 157)
                print("Bad request exception occuered: ", e)
                print("*" *157)
                return Df_Fin
            else:
                return Df_Fin
        finally:
            print()

    def CosmosDBCustomQuery_PandasCSV(self, client, collection_link, query_with_optional_parameters, message="Documents found by query: ", options_sql = {}):
        try:
            # Reading data by SQL & convert it ot Pandas Dataframe
            results = list(client.QueryItems(collection_link, query_with_optional_parameters, options_sql))
            cnt = 0

            dfSrc = p.DataFrame()
            dfRes = p.DataFrame()
            dfSrc2 = p.DataFrame()
            json_data = ''

            for doc in results:
                cnt += 1

            dfSrc = p.io.json.json_normalize(results)
            dfSrc.columns = dfSrc.columns.map(lambda x: x.split(".")[-1])
            dfRes = dfSrc

            print("Total records fetched: ", cnt)
            print("*" * 157)

            return dfRes
        except errors.HTTPFailure as e:
            Df_Fin = p.DataFrame()
            if e.status_code == 404:
                print("*" *157)
                print("Document doesn't exists")
                print("*" *157)
                return Df_Fin
            elif e.status_code == 400:
                print("*" * 157)
                print("Bad request exception occuered: ", e)
                print("*" *157)
                return Df_Fin
            else:
                return Df_Fin
        finally:
            print()

    def fetch_data(self, sql_qry, msg="", collection_flg = 1, additional_params = 1, param_det=[]):
        endpoint = self.endpoint
        primarykey = self.primarykey
        options_1 = self.options

        with IDisposable(cosmos_client.CosmosClient(url_connection=endpoint, auth={'masterKey': primarykey})) as client:
            try:
                if collection_flg == 1:
                    collection_link = self.collection_link_1
                elif collection_flg == 2:
                    collection_link = self.collection_link_2
                else:
                    collection_link = self.collection_link_3

                print("Additional parameters: ", additional_params)

                message = msg
                options = options_1

                if additional_params == 1:
                    query = {"query": sql_qry}
                    df_Fin = self.CosmosDBCustomQuery_PandasCSV(client, collection_link, query, message, options)
                else:
                    query = {"query": sql_qry, "parameters": param_det}
                    df_Fin = self.CosmosDBCustomQuery_PandasCSVWithParam(client, collection_link, query, message, options)

                return df_Fin
            except errors.HTTPFailure as e:
                print("Application has caught an error. {0}".format(e.message))

            finally:
                print("Application successfully completed!")

Key lines from the above script –

def CosmosDBCustomQuery_PandasCSV(self, client, collection_link, query_with_optional_parameters, message="Documents found by query: ", options_sql = {}):

This method is generic. It will fetch all the records of a cosmos container.

results = list(client.QueryItems(collection_link, query_with_optional_parameters, options_sql))
..
for doc in results:
cnt += 1

dfSrc = p.io.json.json_normalize(results)
dfSrc.columns = dfSrc.columns.map(lambda x: x.split(".")[-1])
dfRes = dfSrc

In this step, the application fetching the data in the form of json & then serialize them & flatten them & finally stored the result into pandas dataframe for return output. Function –

CosmosDBCustomQuery_PandasCSVWithParam

– Is the same as the previous function. The only thing it can process parameters to filter out the data.

def fetch_data(self, sql_qry, msg="", collection_flg = 1, additional_params = 1, param_det=[]):

This is the primary calling function. Let us find out the key lines –

if collection_flg == 1:
    collection_link = self.collection_link_1
elif collection_flg == 2:
    collection_link = self.collection_link_2
else:
    collection_link = self.collection_link_3

Based on the supplied collection_flag from the main scripts, our application is identifying the collection where we need to process/load our data.

if additional_params == 1:
    query = {"query": sql_qry}
    df_Fin = self.CosmosDBCustomQuery_PandasCSV(client, collection_link, query, message, options)
else:
    query = {"query": sql_qry, "parameters": param_det}
    df_Fin = self.CosmosDBCustomQuery_PandasCSVWithParam(client, collection_link, query, message, options)

Based on the supplied additiona_params value, python application process, the filter queries & based on that it will invoke the function.

def CreateDocuments(self, inputJson, collection_flg = 1):

This is the primary collection for creating items/rows.

if collection_flg == 1:
    collection_link = self.collection_link_1
elif collection_flg == 2:
    collection_link = self.collection_link_2
else:
    collection_link = self.collection_link_3

container = client.ReadContainer(collection_link)

Based on the collection, our application will points to a specific container & create a connection between python & itself.

nSon = json.dumps(inputJson)
json_rec = json.loads(nSon)

client.CreateItem(container['_self'], json_rec)

Once, you’ll receive the input payload. The application will convert it to valid JSON payload & then send it to create item method to insert records.

4. callCosmosAPI.py (This script is the main calling function. Hence, the name comes into the picture.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 25-May-2019              ####
####                                      ####
#### Objective: Main calling scripts.     ####
##############################################

import clsColMgmt as cm
import clsCosmosDBDet as cmdb
from clsConfig import clsConfig as cf
import pandas as p
import clsL as cl
import logging
import datetime
import json

# Disbling Warning
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

# Lookup functions from
# Azure cloud SQL DB


def main():
    try:
        df_ret = p.DataFrame()
        df_ret_2 = p.DataFrame()
        df_ret_2_Mod = p.DataFrame()

        debug_ind = 'Y'

        # Initiating Log Class
        l = cl.clsL()

        general_log_path = str(cf.config['LOG_PATH'])

        # Enabling Logging Info
        logging.basicConfig(filename=general_log_path + 'consolidated.log', level=logging.INFO)

        # Moving previous day log files to archive directory
        arch_dir = cf.config['ARCH_DIR']
        log_dir = cf.config['LOG_PATH']

        print("Archive Directory:: ", arch_dir)
        print("Log Directory::", log_dir)

        print("*" * 157)
        print("Testing COSMOS DB Connection!")
        print("*" * 157)

        # Checking Cosmos DB Azure
        y = cmdb.clsCosmosDBDet()
        ret_val = y.test_db_con()

        if ret_val == 0:
            print()
            print("Cosmos DB Connection Successful!")
            print("*" * 157)
        else:
            print()
            print("Cosmos DB Connection Failure!")
            print("*" * 157)
            raise Exception

        print("*" * 157)

        # Creating Data in Cosmos DB
        print()
        print('Fetching data from Json!')
        print('Creating data for Email..')
        print("-" * 157)

        emailFile = cf.config['EMAIL_SRC_JSON_FILE']
        flg = 1

        with open(emailFile) as json_file:
            dataEmail = json.load(json_file)

        # Creating documents
        a1 = cm.clsColMgmt()
        ret_cr_val1 = a1.CreateDocuments(dataEmail, flg)

        if ret_cr_val1 == 0:
            print('Successful data creation!')
        else:
            print('Failed create data!')

        print("-" * 157)

        print()
        print('Creating data for Twitter..')
        print("-" * 157)

        twitFile = cf.config['TWITTER_SRC_JSON_FILE']
        flg = 2

        with open(twitFile) as json_file:
            dataTwitter = json.load(json_file)

        # Creating documents
        a2 = cm.clsColMgmt()
        ret_cr_val2 = a2.CreateDocuments(dataTwitter, flg)

        if ret_cr_val2 == 0:
            print('Successful data creation!')
        else:
            print('Failed create data!')

        print("-" * 157)

        print()
        print('Creating data for HR..')
        print("-" * 157)

        hrFile = cf.config['HR_SRC_JSON_FILE']
        flg = 3

        with open(hrFile) as json_file:
            hrTwitter = json.load(json_file)

        # Creating documents
        a3 = cm.clsColMgmt()
        ret_cr_val3 = a3.CreateDocuments(hrTwitter, flg)

        if ret_cr_val3 == 0:
            print('Successful data creation!')
        else:
            print('Failed create data!')

        print("-" * 157)

        # Calling the function 1
        print("RealtimeEmail::")

        # Fetching First collection data to dataframe
        print("Fethcing Comos Collection Data!")

        sql_qry_1 = cf.config['SQL_QRY_1']
        msg = "Documents generatd based on unique key"
        collection_flg = 1

        x = cm.clsColMgmt()
        df_ret = x.fetch_data(sql_qry_1, msg, collection_flg)

        l.logr('1.EmailFeedback_' + var + '.csv', debug_ind, df_ret, 'log')
        print('RealtimeEmail Data::')
        print(df_ret)
        print()

        # Checking execution status
        ret_val = int(df_ret.shape[0])

        if ret_val == 0:
            print("Cosmos DB Hans't returned any rows. Please check your queries!")
            print("*" * 157)
        else:
            print("Successfully fetched!")
            print("*" * 157)

        # Calling the 2nd Collection
        print("RealtimeTwitterFeedback::")

        # Fetching First collection data to dataframe
        print("Fethcing Cosmos Collection Data!")

        # Query using parameters
        sql_qry_2 = cf.config['SQL_QRY_2']
        msg_2 = "Documents generated based on RealtimeTwitterFeedback feed!"
        collection_flg = 2

        val = 'crazyGo'
        param_det = [{"name": "@CrVal", "value": val}]
        add_param = 2

        x1 = cm.clsColMgmt()
        df_ret_2 = x1.fetch_data(sql_qry_2, msg_2, collection_flg, add_param, param_det)

        l.logr('2.TwitterFeedback_' + var + '.csv', debug_ind, df_ret, 'log')
        print('Realtime Twitter Data:: ')
        print(df_ret_2)
        print()

        # Checking execution status
        ret_val_2 = int(df_ret_2.shape[0])

        if ret_val_2 == 0:
            print("Cosmos DB hasn't returned any rows. Please check your queries!")
            print("*" * 157)
        else:
            print("Successfuly row feteched!")
            print("*" * 157)

    except ValueError:
        print("No relevant data to proceed!")

    except Exception as e:
        print("Top level Error: args:{0}, message{1}".format(e.args, e.message))

if __name__ == "__main__":
    main()

Key lines from the above script –

with open(twitFile) as json_file:
    dataTwitter = json.load(json_file)

Reading a json file.

val = 'crazyGo'
param_det = [{"name": "@CrVal", "value": val}]
add_param = 2

Passing a specific parameter value to filter out the record, while fetching it from the Cosmos DB.

Now, let’s look at the runtime stats.

Windows:

MAC:

Let’s compare the output log directory –

Windows:

MAC:

Let’s verify the data from Cosmos DB.

Here, subscriberId starting with ‘M‘ denotes data inserted from the MAC environment. Other one inserted through Windows.

Let’s see one more example from Cosmos –

So, I guess – we’ve achieved our final goal here. Successfully, inserted data into Azure Cosmos DB from the python application & retrieve it successfully.

Following python packages are required in order to run this application –

pip install azure
pip install azure-cosmos
pip install pandas
pip install requests

This application tested on Python3.7.1 & Python3.7.2 as well. As per Microsoft, their official supported version is Python3.5.

I hope you’ll like this effort.

Wait for the next installment. Till then, Happy Avenging. 😀

[Note: All the sample data are available/prepared in the public domain for research & study.]

Improvement of Pandas data processing performance using Multi-threading with the Queue (Another crossover of Space Stone, Reality Stone & Power Stone)

Posted on March 24, 2019September 23, 2019 by SatyakiDe in Data Science, function, member function, operating system, Pandas, Python

Today, we’ll discuss how to improve your panda’s data processing power using Multi-threading. Note that, we are not going to use any third party python package. Also, we’ll be using a couple of python scripts, which we’ve already discussed in our previous posts. Hence, this time, I won’t post them here.

Please refer the following scripts –

a. callClient.py
b. callRunServer.py
c. clsConfigServer.py
d. clsEnDec.py
e. clsFlask.py
f. clsL.py
g. clsParam.py
h. clsSerial.py
i. clsWeb.py

Please find the above scripts described here with details.

So, today, we’ll be looking into how the multi-threading really helps the application to gain some performance over others.

Let’s go through our existing old sample files –

And, we’ve four columns that are applicable for encryption. This file contains 10K records. That means the application will make 40K calls to the server for a different kind of encryption for each column.

Now, if you are going with the serial approach, which I’ve already discussed here, will take significant time for data processing. However, if we could club a few rows as one block & in this way we can create multiple blocks out of our data csv like this –

As you can see that blocks are marked with a different color. So, now if you send each block of data in parallel & send the data for encryption. Ideally, you will be able to process data much faster than the usual serial process. And, this what we would be looking for with the help of python’s multi-threading & queue. Without the queue, this program won’t be possible as the queue maintains the data & process integrity.

One more thing we would like to explain here. Whenever this application is sending the block of data. It will be posting that packed into a (key, value) dictionary randomly. Key will be the thread name. The reason, we’re not expecting data after process might arrive in some random order wrapped with the dictionary as well. Once the application received all the dictionary with dataframe with encrypted/decrypted data, the data will be rearranged based on the key & then joined back with the rest of the data.

Let’s see one sample way of sending & receiving random thread –

The left-hand side, the application is splitting the recordset into small chunks of a group. Once, those group created, using python multi-threading the application is now pushing them into the queue for the producer to produce the encrypted/decrypted value. Similar way, after processing the application will push the final product into the queue for consuming the final output.

This is the pictorial representation of dictionary ordering based on the key-value & then the application will extract the entire data to form the target csv file.

Let’s explore the script –

1. clsParallel.py (This script will consume the split csv files & send the data blocks in the form of the dictionary using multi-threading to the API for encryption in parallel. Hence, the name comes into the picture.)

import pandas as p
import clsWeb as cw
import datetime
from clsParam import clsParam as cf
import threading
from queue import Queue
import gc
import signal
import time
import os

# Declaring Global Variable
q = Queue()
m = Queue()
tLock = threading.Lock()
threads = []

fin_dict = {}
fin_dict_1 = {}
stopping = threading.Event()

# Disbling Warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

class clsParallel(object):
    def __init__(self):
        self.path = cf.config['PATH']
        self.EncryptMode = str(cf.config['ENCRYPT_MODE'])
        self.DecryptMode = str(cf.config['DECRYPT_MODE'])
        self.num_worker_threads = int(cf.config['NUM_OF_THREAD'])
        

    # Lookup Methods for Encryption
    def encrypt_acctNbr(self, row):
        # Declaring Local Variable
        en_AcctNbr = ''
        json_source_str = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_acctNbr = row['Acct_Nbr']
        str_acct_nbr = str(lkp_acctNbr)
        fil_acct_nbr = str_acct_nbr.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_nbr)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_AcctNbr = x.getResponse(EncryptMode)
        else:
            en_AcctNbr = ''

        return en_AcctNbr

    def encrypt_Name(self, row):
        # Declaring Local Variable
        en_AcctName = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_acctName = row['Name']
        str_acct_name = str(lkp_acctName)
        fil_acct_name = str_acct_name.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_name + '","dataTemplate":"subGrName"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_name)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_AcctName = x.getResponse(EncryptMode)
        else:
            en_AcctName = ''

        return en_AcctName

    def encrypt_Phone(self, row):
        # Declaring Local Variable
        en_Phone = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_phone = row['Phone']
        str_phone = str(lkp_phone)
        fil_phone = str_phone.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_phone + '","dataTemplate":"subGrPhone"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_phone)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_Phone = x.getResponse(EncryptMode)
        else:
            en_Phone = ''

        return en_Phone

    def encrypt_Email(self, row):
        # Declaring Local Variable
        en_Email = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_email = row['Email']
        str_email = str(lkp_email)
        fil_email = str_email.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_email + '","dataTemplate":"subGrEmail"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_email)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_Email = x.getResponse(EncryptMode)
        else:
            en_Email = ''

        return en_Email

    # Lookup Methods for Decryption
    def decrypt_acctNbr(self, row):
        # Declaring Local Variable
        de_AcctNbr = ''
        json_source_str = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_acctNbr = row['Acct_Nbr']
        str_acct_nbr = str(lkp_acctNbr)
        fil_acct_nbr = str_acct_nbr.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_nbr)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_AcctNbr = x.getResponse(EncryptMode)
        else:
            de_AcctNbr = ''

        return de_AcctNbr

    def decrypt_Name(self, row):
        # Declaring Local Variable
        de_AcctName = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_acctName = row['Name']
        str_acct_name = str(lkp_acctName)
        fil_acct_name = str_acct_name.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_name + '","dataTemplate":"subGrName"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_name)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_AcctName = x.getResponse(EncryptMode)
        else:
            de_AcctName = ''

        return de_AcctName

    def decrypt_Phone(self, row):
        # Declaring Local Variable
        de_Phone = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_phone = row['Phone']
        str_phone = str(lkp_phone)
        fil_phone = str_phone.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_phone + '","dataTemplate":"subGrPhone"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_phone)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_Phone = x.getResponse(EncryptMode)
        else:
            de_Phone = ''

        return de_Phone

    def decrypt_Email(self, row):
        # Declaring Local Variable
        de_Email = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_email = row['Email']
        str_email = str(lkp_email)
        fil_email = str_email.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_email + '","dataTemplate":"subGrEmail"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_email)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_Email = x.getResponse(EncryptMode)
        else:
            de_Email = ''

        return de_Email

    def getEncrypt(self, df_dict):
        try:
            df_input = p.DataFrame()
            df_fin = p.DataFrame()

            # Assigning Target File Basic Name
            for k, v in df_dict.items():
                Thread_Name = k
                df_input = v

            # Checking total count of rows
            count_row = int(df_input.shape[0])
            # print('Part number of records to process:: ', count_row)

            if count_row > 0:

                # Deriving rows
                df_input['Encrypt_Acct_Nbr'] = df_input.apply(lambda row: self.encrypt_acctNbr(row), axis=1)
                df_input['Encrypt_Name'] = df_input.apply(lambda row: self.encrypt_Name(row), axis=1)
                df_input['Encrypt_Phone'] = df_input.apply(lambda row: self.encrypt_Phone(row), axis=1)
                df_input['Encrypt_Email'] = df_input.apply(lambda row: self.encrypt_Email(row), axis=1)

                # Dropping original columns
                df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

                # Renaming new columns with the old column names
                df_input.rename(columns={'Encrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
                df_input.rename(columns={'Encrypt_Name': 'Name'}, inplace=True)
                df_input.rename(columns={'Encrypt_Phone': 'Phone'}, inplace=True)
                df_input.rename(columns={'Encrypt_Email': 'Email'}, inplace=True)

                # New Column List Orders
                column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email', 'Serial_No']
                df_fin = df_input.reindex(column_order, axis=1)

                fin_dict[Thread_Name] = df_fin

            return 0
        except Exception as e:
            df_error = p.DataFrame({'Acct_Nbr':str(e), 'Name':'', 'Acct_Addr_1':'', 'Acct_Addr_2':'', 'Phone':'', 'Email':'', 'Serial_No':''})
            fin_dict[Thread_Name] = df_error

            return 1

    def getEncryptWQ(self):
        item_dict = {}
        item = ''

        while True:
            try:
                #item_dict = q.get()
                item_dict = q.get_nowait()

                for k, v in item_dict.items():
                    # Assigning Target File Basic Name
                    item = str(k)

                if ((item == 'TEND') | (item == '')):
                    break

                if ((item != 'TEND') | (item != '')):
                    self.getEncrypt(item_dict)

                q.task_done()
            except Exception:
                break

    def getEncryptParallel(self, df_payload):
        start_pos = 0
        end_pos = 0
        l_dict = {}
        c_dict = {}
        min_val_list = {}
        cnt = 0
        num_worker_threads = self.num_worker_threads
        split_df = p.DataFrame()
        df_ret = p.DataFrame()

        # Assigning Target File Basic Name
        df_input = df_payload

        # Checking total count of rows
        count_row = df_input.shape[0]
        print('Total number of records to process:: ', count_row)

        interval = int(count_row / num_worker_threads) + 1
        actual_worker_task = int(count_row / interval) + 1

        for i in range(actual_worker_task):
            t = threading.Thread(target=self.getEncryptWQ)
            t.start()
            threads.append(t)
            name = str(t.getName())

            if ((start_pos + interval) < count_row):
                end_pos = start_pos + interval
            else:
                end_pos = start_pos + (count_row - start_pos)

            split_df = df_input.iloc[start_pos:end_pos]
            l_dict[name] = split_df

            if ((start_pos > count_row) | (start_pos == count_row)):
                break
            else:
                start_pos = start_pos + interval

            q.put(l_dict)
            cnt += 1

        # block until all tasks are done
        q.join()

        # stop workers
        for i in range(actual_worker_task):
            c_dict['TEND'] = p.DataFrame()
            q.put(c_dict)

        for t in threads:
            t.join()

        for k, v in fin_dict.items():
            min_val_list[int(k.replace('Thread-',''))] = v

        min_val = min(min_val_list, key=int)

        for k, v in sorted(fin_dict.items(), key=lambda k:int(k[0].replace('Thread-',''))):
            if int(k.replace('Thread-','')) == min_val:
                df_ret = fin_dict[k]
            else:
                d_frames = [df_ret, fin_dict[k]]
                df_ret = p.concat(d_frames)

        # Releasing Memory
        del[[split_df]]
        gc.collect()

        return df_ret

    def getDecrypt(self, df_encrypted_dict):
        try:
            df_input = p.DataFrame()
            df_fin = p.DataFrame()

            # Assigning Target File Basic Name
            for k, v in df_encrypted_dict.items():
                Thread_Name = k
                df_input = v

            # Checking total count of rows
            count_row = int(df_input.shape[0])

            if count_row > 0:

                # Deriving rows
                df_input['Decrypt_Acct_Nbr'] = df_input.apply(lambda row: self.decrypt_acctNbr(row), axis=1)
                df_input['Decrypt_Name'] = df_input.apply(lambda row: self.decrypt_Name(row), axis=1)
                df_input['Decrypt_Phone'] = df_input.apply(lambda row: self.decrypt_Phone(row), axis=1)
                df_input['Decrypt_Email'] = df_input.apply(lambda row: self.decrypt_Email(row), axis=1)

                # Dropping original columns
                df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

                # Renaming new columns with the old column names
                df_input.rename(columns={'Decrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
                df_input.rename(columns={'Decrypt_Name': 'Name'}, inplace=True)
                df_input.rename(columns={'Decrypt_Phone': 'Phone'}, inplace=True)
                df_input.rename(columns={'Decrypt_Email': 'Email'}, inplace=True)

                # New Column List Orders
                column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email']
                df_fin = df_input.reindex(column_order, axis=1)

                fin_dict_1[Thread_Name] = df_fin

            return 0

        except Exception as e:
            df_error = p.DataFrame({'Acct_Nbr': str(e), 'Name': '', 'Acct_Addr_1': '', 'Acct_Addr_2': '', 'Phone': '', 'Email': ''})
            fin_dict_1[Thread_Name] = df_error

            return 1

    def getDecryptWQ(self):
        item_dict = {}
        item = ''

        while True:
            try:
                #item_dict = q.get()
                item_dict = m.get_nowait()

                for k, v in item_dict.items():
                    # Assigning Target File Basic Name
                    item = str(k)

                if ((item == 'TEND') | (item == '')):
                    return True
                    #break

                if ((item != 'TEND') | (item != '')):
                    self.getDecrypt(item_dict)

                m.task_done()
            except Exception:
                break


    def getDecryptParallel(self, df_payload):
        start_pos = 0
        end_pos = 0
        l_dict_1 = {}
        c_dict_1 = {}
        cnt = 0
        num_worker_threads = self.num_worker_threads
        split_df = p.DataFrame()
        df_ret_1 = p.DataFrame()

        min_val_list = {}

        # Assigning Target File Basic Name
        df_input_1 = df_payload

        # Checking total count of rows
        count_row = df_input_1.shape[0]
        print('Total number of records to process:: ', count_row)

        interval = int(count_row / num_worker_threads) + 1
        actual_worker_task = int(count_row / interval) + 1

        for i in range(actual_worker_task):
            t_1 = threading.Thread(target=self.getDecryptWQ)
            t_1.start()
            threads.append(t_1)
            name = str(t_1.getName())

            if ((start_pos + interval) < count_row):
                end_pos = start_pos + interval
            else:
                end_pos = start_pos + (count_row - start_pos)

            split_df = df_input_1.iloc[start_pos:end_pos]
            l_dict_1[name] = split_df

            if ((start_pos > count_row) | (start_pos == count_row)):
                break
            else:
                start_pos = start_pos + interval

            m.put(l_dict_1)
            cnt += 1

        # block until all tasks are done
        m.join()

        # stop workers
        for i in range(actual_worker_task):
            c_dict_1['TEND'] = p.DataFrame()
            m.put(c_dict_1)

        for t_1 in threads:
            t_1.join()

        for k, v in fin_dict_1.items():
            min_val_list[int(k.replace('Thread-',''))] = v

        min_val = min(min_val_list, key=int)

        for k, v in sorted(fin_dict_1.items(), key=lambda k:int(k[0].replace('Thread-',''))):
            if int(k.replace('Thread-','')) == min_val:
                df_ret_1 = fin_dict_1[k]
            else:
                d_frames = [df_ret_1, fin_dict_1[k]]
                df_ret_1 = p.concat(d_frames)

        # Releasing Memory
        del[[split_df]]
        gc.collect()

        return df_ret_1

Let’s explain the key snippet from the code. For your information, we’re not going to describe all the encryption methods such as –

# Encryption Method
encrypt_acctNbr
encrypt_Name
encrypt_Phone
encrypt_Email

# Decryption Method
decrypt_acctNbr
decrypt_Name
decrypt_Phone
decrypt_Email

As we’ve already described the logic of these methods in our previous post.

# Checking total count of rows
count_row = df_input.shape[0]
print('Total number of records to process:: ', count_row)

interval = int(count_row / num_worker_threads) + 1
actual_worker_task = int(count_row / interval) + 1

Fetching the total number of rows from the dataframe. Based on the row count, the application will derive the actual number of threads that will be used for parallelism.

for i in range(actual_worker_task):
    t = threading.Thread(target=self.getEncryptWQ)
    t.start()
    threads.append(t)
    name = str(t.getName())

    if ((start_pos + interval) < count_row):
        end_pos = start_pos + interval
    else:
        end_pos = start_pos + (count_row - start_pos)

    split_df = df_input.iloc[start_pos:end_pos]
    l_dict[name] = split_df

    if ((start_pos > count_row) | (start_pos == count_row)):
        break
    else:
        start_pos = start_pos + interval

    q.put(l_dict)
    cnt += 1

Here, the application is splitting the data into multiple groups of smaller data packs & then combining them into (key, value) dictionary & finally placed them into the individual queue.

# block until all tasks are done
q.join()

This will join the queue process. This will ensure that queues are free after consuming the data.

# stop workers
for i in range(actual_worker_task):
    c_dict['TEND'] = p.DataFrame()
    q.put(c_dict)

for t in threads:
    t.join()

The above lines are essential. As this will help the process to identify that no more data are left to send at the queue. And, the main thread will wait until all the threads are done.

for k, v in fin_dict.items():
    min_val_list[int(k.replace('Thread-',''))] = v

min_val = min(min_val_list, key=int)

Once, all the jobs are done. The application will find the minimum thread value & based on that we can sequence all the data chunks as explained in our previous image & finally clubbed them together to form the complete csv.

for k, v in sorted(fin_dict.items(), key=lambda k:int(k[0].replace('Thread-',''))):
    if int(k.replace('Thread-','')) == min_val:
        df_ret = fin_dict[k]
    else:
        d_frames = [df_ret, fin_dict[k]]
        df_ret = p.concat(d_frames)

As already explained, using the starting point of our data dictionary element, the application is clubbing the data back to the main csv.

Next method, which we’ll be explaining is –

getEncryptWQ

Please find the key lines –

while True:
    try:
        #item_dict = q.get()
        item_dict = q.get_nowait()

        for k, v in item_dict.items():
            # Assigning Target File Basic Name
            item = str(k)

        if ((item == 'TEND') | (item == '')):
            break

        if ((item != 'TEND') | (item != '')):
            self.getEncrypt(item_dict)

        q.task_done()
    except Exception:
        break

This method will consume the data & processing it for encryption or decryption. This will continue to do the work until or unless it receives the key value as TEND or the queue is empty.

Let’s compare the statistics between Windows & MAC.

Let’s see the file structure first –

Windows (16 GB – Core 2) Vs Mac (10 GB – Core 2):

Windows (16 GB – Core 2):

Mac (10 GB – Core 2):

Find the complete directory from both the machine.

Windows (16 GB – Core 2):

Mac (10 GB – Core 2):

Here is the final output –

So, we’ve achieved our target goal.

Let me know – how do you like this post. Please share your suggestion & comments.

I’ll be back with another installment from the Python verse.

Till then – Happy Avenging!

Pandas with Encryption/Decryption along with the JSON – (Client API Access) along with Data Queue (A crossover between Space stone, Reality Stone & Power Stone)

Posted on March 10, 2019September 23, 2019 by SatyakiDe in api, Data Science, function, numpy, operating system, Pandas, Python, regular expression, String Manipulation, Technology, Uncategorized

Today, we’ll be discussing a new cross-over between API, JSON, Encryption along with data distribution through Queue.

The primary objective here is to distribute one csv file through API service & access our previously deployed Encryption/Decryption methods by accessing the parallel call through Queue. In this case, our primary objective is to allow asynchronous calls to Queue for data distributions & at this point we’re not really looking for performance improvement. Instead, our goal to achieve the target.

My upcoming posts will discuss the improvement of performance using Parallel calls.

Let’s discuss it now.

Please find the structure of our Windows & MAC directory are as follows –

We’re not going to discuss any scripts, which we’ve already discussed in my previous posts. Please refer the relevant earlier posts from my blogs.

1. clsL.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as usual verbose debug logging as well. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 25-Jan-2019       ########
####                               ########
#### Objective: Log File           ########
###########################################
import pandas as p
import platform as pl
from clsParam import clsParam as cf

class clsL(object):
    def __init__(self):
        self.path = cf.config['PATH']

    def logr(self, Filename, Ind, df, subdir=None):
        try:
            x = p.DataFrame()
            x = df
            sd = subdir

            os_det = pl.system()

            if sd == None:
                if os_det == "Windows":
                    fullFileName = self.path + '\\' + Filename
                else:
                    fullFileName = self.path + '/' + Filename
            else:
                if os_det == "Windows":
                    fullFileName = self.path + '\\' + sd + "\\" + Filename
                else:
                    fullFileName = self.path + '/' + sd + "/" + Filename

            if Ind == 'Y':
                x.to_csv(fullFileName, index=False)

            return 0

        except Exception as e:
            y = str(e)
            print(y)
            return 3

2. callRunServer.py (This script will create an instance of a server. Once, it is running – it will emulate the Server API functionalities. Hence, the name comes into the picture.)

############################################
#### Written By: SATYAKI DE             ####
#### Written On: 10-Feb-2019            ####
#### Package Flask package needs to     ####
#### install in order to run this       ####
#### script.                            ####
####                                    ####
#### Objective: This script will        ####
#### initiate the encrypt/decrypt class ####
#### based on client supplied data.     ####
#### Also, this will create an instance ####
#### of the server & create an endpoint ####
#### or API using flask framework.      ####
############################################

from flask import Flask
from flask import jsonify
from flask import request
from flask import abort
from clsConfigServer import clsConfigServer as csf
import clsFlask as clf

app = Flask(__name__)

@app.route('/process/getEncrypt', methods=['POST'])
def getEncrypt():
    try:
        # If the server application doesn't have
        # valid json, it will throw 400 error
        if not request.get_json:
            abort(400)

        # Capturing the individual element
        content = request.get_json()

        dGroup = content['dataGroup']
        input_data = content['data']
        dTemplate = content['dataTemplate']

        # For debug purpose only
        print("-" * 157)
        print("Group: ", dGroup)
        print("Data: ", input_data)
        print("Template: ", dTemplate)
        print("-" * 157)

        ret_val = ''

        if ((dGroup != '') & (dTemplate != '')):
            y = clf.clsFlask()
            ret_val = y.getEncryptProcess(dGroup, input_data, dTemplate)
        else:
            abort(500)

        return jsonify({'status': 'success', 'encrypt_val': ret_val})
    except Exception as e:
        x = str(e)
        return jsonify({'status': 'error', 'detail': x})


@app.route('/process/getDecrypt', methods=['POST'])
def getDecrypt():
    try:
        # If the server application doesn't have
        # valid json, it will throw 400 error
        if not request.get_json:
            abort(400)

        # Capturing the individual element
        content = request.get_json()

        dGroup = content['dataGroup']
        input_data = content['data']
        dTemplate = content['dataTemplate']

        # For debug purpose only
        print("-" * 157)
        print("Group: ", dGroup)
        print("Data: ", input_data)
        print("Template: ", dTemplate)
        print("-" * 157)

        ret_val = ''

        if ((dGroup != '') & (dTemplate != '')):
            y = clf.clsFlask()
            ret_val = y.getDecryptProcess(dGroup, input_data, dTemplate)
        else:
            abort(500)

        return jsonify({'status': 'success', 'decrypt_val': ret_val})
    except Exception as e:
        x = str(e)
        return jsonify({'status': 'error', 'detail': x})


def main():
    try:
        print('Starting Encrypt/Decrypt Application!')

        # Calling Server Start-Up Script
        app.run(debug=True, host=str(csf.config['HOST_IP_ADDR']))
        ret_val = 0

        if ret_val == 0:
            print("Finished Returning Message!")
        else:
            raise IOError
    except Exception as e:
        print("Server Failed To Start!")

if __name__ == '__main__':
    main()

3. clsFlask.py (This script is part of the server process, which will categorize the encryption logic based on different groups. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE            ####
#### Written On: 25-Jan-2019           ####
#### Package Flask package needs to    ####
#### install in order to run this      ####
#### script.                           ####
####                                   ####
#### Objective: This script will       ####
#### encrypt/decrypt based on the      ####
#### supplied salt value. Also,        ####
#### this will capture the individual  ####
#### element & stored them into JSON   ####
#### variables using flask framework.  ####
###########################################

from clsConfigServer import clsConfigServer as csf
import clsEnDecAuth as cen

class clsFlask(object):
    def __init__(self):
        self.xtoken = str(csf.config['DEF_SALT'])

    def getEncryptProcess(self, dGroup, input_data, dTemplate):
        try:
            # It is sending default salt value
            xtoken = self.xtoken

            # Capturing the individual element
            dGroup = dGroup
            input_data = input_data
            dTemplate = dTemplate

            # This will check the mandatory json elements
            if ((dGroup != '') & (dTemplate != '')):

                # Based on the Group & Element it will fetch the salt
                # Based on the specific salt it will encrypt the data
                if ((dGroup == 'GrDet') & (dTemplate == 'subGrAcct_Nbr')):
                    xtoken = str(csf.config['ACCT_NBR_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.encrypt_str(input_data)
                elif ((dGroup == 'GrDet') & (dTemplate == 'subGrName')):
                    xtoken = str(csf.config['NAME_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.encrypt_str(input_data)
                elif ((dGroup == 'GrDet') & (dTemplate == 'subGrPhone')):
                    xtoken = str(csf.config['PHONE_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.encrypt_str(input_data)
                elif ((dGroup == 'GrDet') & (dTemplate == 'subGrEmail')):
                    xtoken = str(csf.config['EMAIL_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.encrypt_str(input_data)
                else:
                    ret_val = ''
            else:
                ret_val = ''

            # Return value
            return ret_val

        except Exception as e:
            ret_val = ''
            # Return the valid json Error Response
            return ret_val

    def getDecryptProcess(self, dGroup, input_data, dTemplate):
        try:
            xtoken = self.xtoken

            # Capturing the individual element
            dGroup = dGroup
            input_data = input_data
            dTemplate = dTemplate

            # This will check the mandatory json elements
            if ((dGroup != '') & (dTemplate != '')):

                # Based on the Group & Element it will fetch the salt
                # Based on the specific salt it will decrypt the data
                if ((dGroup == 'GrDet') & (dTemplate == 'subGrAcct_Nbr')):
                    xtoken = str(csf.config['ACCT_NBR_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.decrypt_str(input_data)
                elif ((dGroup == 'GrDet') & (dTemplate == 'subGrName')):
                    xtoken = str(csf.config['NAME_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.decrypt_str(input_data)
                elif ((dGroup == 'GrDet') & (dTemplate == 'subGrPhone')):
                    xtoken = str(csf.config['PHONE_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.decrypt_str(input_data)
                elif ((dGroup == 'GrDet') & (dTemplate == 'subGrEmail')):
                    xtoken = str(csf.config['EMAIL_SALT'])
                    print("xtoken: ", xtoken)
                    print("Flask Input Data: ", input_data)
                    x = cen.clsEnDec(xtoken)
                    ret_val = x.decrypt_str(input_data)
                else:
                    ret_val = ''
            else:
                ret_val = ''

            # Return value
            return ret_val

        except Exception as e:
            ret_val = ''
            # Return the valid Error Response
            return ret_val

4. clsEnDec.py (This script will convert the string to encryption or decryption from its previous states based on the supplied group. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 25-Jan-2019       ########
#### Package Cryptography needs to ########
#### install in order to run this  ########
#### script.                       ########
####                               ########
#### Objective: This script will   ########
#### encrypt/decrypt based on the  ########
#### hidden supplied salt value.   ########
###########################################

from cryptography.fernet import Fernet

class clsEnDec(object):

    def __init__(self, token):
        # Calculating Key
        self.token = token

    def encrypt_str(self, data):
        try:
            # Capturing the Salt Information
            salt = self.token

            # Checking Individual Types inside the Dataframe
            cipher = Fernet(salt)
            encr_val = str(cipher.encrypt(bytes(data,'utf8'))).replace("b'","").replace("'","")

            return encr_val

        except Exception as e:
            x = str(e)
            print(x)
            encr_val = ''

            return encr_val

    def decrypt_str(self, data):
        try:
            # Capturing the Salt Information
            salt = self.token

            # Checking Individual Types inside the Dataframe
            cipher = Fernet(salt)
            decr_val = str(cipher.decrypt(bytes(data,'utf8'))).replace("b'","").replace("'","")

            return decr_val

        except Exception as e:
            x = str(e)
            print(x)
            decr_val = ''

            return decr_val

5. clsConfigServer.py (This script contains all the main parameter details of your emulated API server. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 10-Feb-2019       ########
####                               ########
#### Objective: Parameter File     ########
###########################################

import os
import platform as pl

# Checking with O/S system
os_det = pl.system()

class clsConfigServer(object):
    Curr_Path = os.path.dirname(os.path.realpath(__file__))

    if os_det == "Windows":
        config = {
            'FILE': 'acct_addr_20180112.csv',
            'SRC_FILE_PATH': Curr_Path + '\\' + 'src_file\\',
            'PROFILE_FILE_PATH': Curr_Path + '\\' + 'profile\\',
            'HOST_IP_ADDR': '0.0.0.0',
            'DEF_SALT': 'iooquzKtqLwUwXG3rModqj_fIl409vemWg9PekcKh2o=',
            'ACCT_NBR_SALT': 'iooquzKtqLwUwXG3rModqj_fIlpp1vemWg9PekcKh2o=',
            'NAME_SALT': 'iooquzKtqLwUwXG3rModqj_fIlpp1026Wg9PekcKh2o=',
            'PHONE_SALT': 'iooquzKtqLwUwXG3rMM0F5_fIlpp1026Wg9PekcKh2o=',
            'EMAIL_SALT': 'iooquzKtqLwU0653rMM0F5_fIlpp1026Wg9PekcKh2o='
        }
    else:
        config = {
            'FILE': 'acct_addr_20180112.csv',
            'SRC_FILE_PATH': Curr_Path + '/' + 'src_file/',
            'PROFILE_FILE_PATH': Curr_Path + '/' + 'profile/',
            'HOST_IP_ADDR': '0.0.0.0',
            'DEF_SALT': 'iooquzKtqLwUwXG3rModqj_fIl409vemWg9PekcKh2o=',
            'ACCT_NBR_SALT': 'iooquzKtqLwUwXG3rModqj_fIlpp1vemWg9PekcKh2o=',
            'NAME_SALT': 'iooquzKtqLwUwXG3rModqj_fIlpp1026Wg9PekcKh2o=',
            'PHONE_SALT': 'iooquzKtqLwUwXG3rMM0F5_fIlpp1026Wg9PekcKh2o=',
            'EMAIL_SALT': 'iooquzKtqLwU0653rMM0F5_fIlpp1026Wg9PekcKh2o='
        }

6. clsWeb.py (This script will receive the input Pandas dataframe & then convert it to JSON & then send it back to our Flask API Server for encryption/decryption. Hence, the name comes into the picture.)

############################################
#### Written By: SATYAKI DE             ####
#### Written On: 09-Mar-2019            ####
#### Package Flask package needs to     ####
#### install in order to run this       ####
#### script.                            ####
####                                    ####
#### Objective: This script will        ####
#### initiate API based JSON requests   ####
#### at the server & receive the        ####
#### response from it & transform it    ####
#### back to the data-frame.            ####
############################################

import json
import requests
import datetime
import time
import ssl
import os
from clsParam import clsParam as cf

class clsWeb(object):
    def __init__(self, payload):
        self.payload = payload
        self.path = str(cf.config['PATH'])
        self.max_retries = int(cf.config['MAX_RETRY'])
        self.encrypt_ulr = str(cf.config['ENCRYPT_URL'])
        self.decrypt_ulr = str(cf.config['DECRYPT_URL'])

    def getResponse(self, mode):

        # Assigning Logging Info
        max_retries = self.max_retries
        encrypt_ulr = self.encrypt_ulr
        decrypt_ulr = self.decrypt_ulr
        En_Dec_Mode = mode

        try:

            # Bypassing SSL Authentication
            try:
                _create_unverified_https_context = ssl._create_unverified_context
            except AttributeError:
                # Legacy python that doesn't verify HTTPS certificates by default
                pass
            else:
                # Handle target environment that doesn't support HTTPS verification
                ssl._create_default_https_context = _create_unverified_https_context

            # Providing the url
            if En_Dec_Mode == 'En':
                url = encrypt_ulr
            else:
                url = decrypt_ulr

            print("URL::", url)

            # Capturing the payload
            data = self.payload

            # Converting String to Json
            # json_data = json.loads(data)
            json_data = json.loads(data)

            print("JSON:::::::", str(json_data))

            headers = {"Content-type": "application/json"}
            param = headers

            var1 = datetime.datetime.now().strftime("%H:%M:%S")
            print('Json Fetch Start Time:', var1)

            retries = 1
            success = False

            while not success:
                # Getting response from web service
                # response = requests.post(url, params=param, json=data, auth=(login, password), verify=False)
                response = requests.post(url, params=param, json=json_data, verify=False)
                print("Complete Return Code:: ", str(response.status_code))
                print("Return Code Initial::", str(response.status_code)[:1])

                if str(response.status_code)[:1] == '2':
                    # response = s.post(url, params=param, json=json_data, verify=False)
                    success = True
                else:
                    wait = retries * 2
                    print("Retry fails! Waiting " + str(wait) + " seconds and retrying.")
                    time.sleep(wait)
                    retries += 1
                    # print('Return Service::')

                # Checking Maximum Retries
                if retries == max_retries:
                    success = True
                    raise ValueError

                print("JSON RESPONSE:::", response.text)

                var2 = datetime.datetime.now().strftime("%H:%M:%S")
                print('Json Fetch End Time:', var2)

                # Capturing the response json from Web Service
                response_json = response.text
                load_val = json.loads(response_json)

                # Based on the mode application will send the return value
                if En_Dec_Mode == 'En':
                    encrypt_ele = str(load_val['encrypt_val'])
                    return_ele = encrypt_ele
                else:
                    decrypt_ele = str(load_val['decrypt_val'])
                    return_ele = decrypt_ele

            return return_ele

        except ValueError as v:
            raise ValueError

        except Exception as e:
            x = str(e)
            print(x)

            return 'Error'

Let’s discuss the key lines –

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

If you are running in a secure environment. Sometimes, your proxy or firewall blocks you from accessing the API server – if they are using different networks. Hence, we need to bypass that. However, it is advisable not to use this in Prod environment for obvious reasons.

# Capturing the payload
data = self.payload

# Converting String to Json
json_data = json.loads(data)

This snippet will convert your data frame into a JSON object.

response = requests.post(url, params=param, json=json_data, verify=False)
print("Complete Return Code:: ", str(response.status_code))
print("Return Code Initial::", str(response.status_code)[:1])

if str(response.status_code)[:1] == '2':
    # response = s.post(url, params=param, json=json_data, verify=False)
    success = True
else:
    wait = retries * 2
    print("Retry fails! Waiting " + str(wait) + " seconds and retrying.")
    time.sleep(wait)
    retries += 1
    # print('Return Service::')

# Checking Maximum Retries
if retries == max_retries:
    success = True
    raise ValueError

In the first 3 lines, the application is building a JSON response, which will be sent to the API Server. And, it will capture the response from the server.

Next 8 lines will check the status code. And, based on the status code, it will continue or retry the requests in case if there is any failure or lousy response from the server.

Last 3 lines say if the application crosses the maximum allowable error limit, it will terminate the process by raising it as an error.

# Capturing the response json from Web Service
response_json = response.text
load_val = json.loads(response_json)

Once, it receives the valid response, the application will convert it back to the dataframe & send it to the calling methods.

7. clsParam.py (This script contains the fundamental parameter values to run your client application. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 20-Jan-2019       ########
###########################################

import os

class clsParam(object):

    config = {
        'MAX_RETRY' : 5,
        'ENCRYPT_MODE' : 'En',
        'DECRYPT_MODE': 'De',
        'PATH' : os.path.dirname(os.path.realpath(__file__)),
        'SRC_DIR' : os.path.dirname(os.path.realpath(__file__)) + '/' + 'src_files/',
        'FIN_DIR': os.path.dirname(os.path.realpath(__file__)) + '/' + 'finished/',
        'ENCRYPT_URL': "http://192.168.0.13:5000/process/getEncrypt",
        'DECRYPT_URL': "http://192.168.0.13:5000/process/getDecrypt",
        'NUM_OF_THREAD': 20
    }

8. clsSerial.py (This script will show the usual or serial way to convert your data into encryption & then to decrypts & store the result into two separate csv files. Hence, the name comes into the picture.)

############################################
#### Written By: SATYAKI DE             ####
#### Written On: 10-Feb-2019            ####
#### Package Flask package needs to     ####
#### install in order to run this       ####
#### script.                            ####
####                                    ####
#### Objective: This script will        ####
#### initiate the encrypt/decrypt class ####
#### based on client supplied data      ####
#### using serial mode operation.       ####
############################################

import pandas as p
import clsWeb as cw
import datetime
from clsParam import clsParam as cf

# Disbling Warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

class clsSerial(object):
    def __init__(self):
        self.path = cf.config['PATH']
        self.EncryptMode = str(cf.config['ENCRYPT_MODE'])
        self.DecryptMode = str(cf.config['DECRYPT_MODE'])

    # Lookup Methods for Encryption
    def encrypt_acctNbr(self, row):
        # Declaring Local Variable
        en_AcctNbr = ''
        json_source_str = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_acctNbr = row['Acct_Nbr']
        str_acct_nbr = str(lkp_acctNbr)
        fil_acct_nbr = str_acct_nbr.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_nbr)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_AcctNbr = x.getResponse(EncryptMode)
        else:
            en_AcctNbr = ''

        fil_acct_nbr = ''
        fil_acct_nbr = ''

        return en_AcctNbr

    def encrypt_Name(self, row):
        # Declaring Local Variable
        en_AcctName = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_acctName = row['Name']
        str_acct_name = str(lkp_acctName)
        fil_acct_name = str_acct_name.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_name + '","dataTemplate":"subGrName"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_name)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_AcctName = x.getResponse(EncryptMode)
        else:
            en_AcctName = ''

        return en_AcctName

    def encrypt_Phone(self, row):
        # Declaring Local Variable
        en_Phone = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_phone = row['Phone']
        str_phone = str(lkp_phone)
        fil_phone = str_phone.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_phone + '","dataTemplate":"subGrPhone"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_phone)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_Phone = x.getResponse(EncryptMode)
        else:
            en_Phone = ''

        return en_Phone

    def encrypt_Email(self, row):
        # Declaring Local Variable
        en_Email = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_email = row['Email']
        str_email = str(lkp_email)
        fil_email = str_email.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_email + '","dataTemplate":"subGrEmail"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_email)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_Email = x.getResponse(EncryptMode)
        else:
            en_Email = ''

        return en_Email

    # Lookup Methods for Decryption
    def decrypt_acctNbr(self, row):
        # Declaring Local Variable
        de_AcctNbr = ''
        json_source_str = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_acctNbr = row['Acct_Nbr']
        str_acct_nbr = str(lkp_acctNbr)
        fil_acct_nbr = str_acct_nbr.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_nbr)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_AcctNbr = x.getResponse(EncryptMode)
        else:
            de_AcctNbr = ''

        return de_AcctNbr

    def decrypt_Name(self, row):
        # Declaring Local Variable
        de_AcctName = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_acctName = row['Name']
        str_acct_name = str(lkp_acctName)
        fil_acct_name = str_acct_name.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_name + '","dataTemplate":"subGrName"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_name)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_AcctName = x.getResponse(EncryptMode)
        else:
            de_AcctName = ''

        return de_AcctName

    def decrypt_Phone(self, row):
        # Declaring Local Variable
        de_Phone = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_phone = row['Phone']
        str_phone = str(lkp_phone)
        fil_phone = str_phone.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_phone + '","dataTemplate":"subGrPhone"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_phone)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_Phone = x.getResponse(EncryptMode)
        else:
            de_Phone = ''

        return de_Phone

    def decrypt_Email(self, row):
        # Declaring Local Variable
        de_Email = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_email = row['Email']
        str_email = str(lkp_email)
        fil_email = str_email.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_email + '","dataTemplate":"subGrEmail"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_email)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_Email = x.getResponse(EncryptMode)
        else:
            de_Email = ''

        return de_Email

    def getEncrypt(self, df_payload):
        try:
            df_input = p.DataFrame()
            df_fin = p.DataFrame()

            # Assigning Target File Basic Name
            df_input = df_payload

            # Checking total count of rows
            count_row = df_input.shape[0]
            print('Total number of records to process:: ', count_row)

            # Deriving rows
            df_input['Encrypt_Acct_Nbr'] = df_input.apply(lambda row: self.encrypt_acctNbr(row), axis=1)
            df_input['Encrypt_Name'] = df_input.apply(lambda row: self.encrypt_Name(row), axis=1)
            df_input['Encrypt_Phone'] = df_input.apply(lambda row: self.encrypt_Phone(row), axis=1)
            df_input['Encrypt_Email'] = df_input.apply(lambda row: self.encrypt_Email(row), axis=1)

            # Dropping original columns
            df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

            # Renaming new columns with the old column names
            df_input.rename(columns={'Encrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
            df_input.rename(columns={'Encrypt_Name': 'Name'}, inplace=True)
            df_input.rename(columns={'Encrypt_Phone': 'Phone'}, inplace=True)
            df_input.rename(columns={'Encrypt_Email': 'Email'}, inplace=True)

            # New Column List Orders
            column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email', 'Serial_No']
            df_fin = df_input.reindex(column_order, axis=1)

            return df_fin
        except Exception as e:
            df_error = p.DataFrame({'Acct_Nbr':str(e), 'Name':'', 'Acct_Addr_1':'', 'Acct_Addr_2':'', 'Phone':'', 'Email':'', 'Serial_No':''})

            return df_error


    def getDecrypt(self, df_encrypted_payload):
        try:
            df_input = p.DataFrame()
            df_fin = p.DataFrame()

            # Assigning Target File Basic Name
            df_input = df_encrypted_payload

            # Checking total count of rows
            count_row = df_input.shape[0]
            print('Total number of records to process:: ', count_row)


            # Deriving rows
            df_input['Decrypt_Acct_Nbr'] = df_input.apply(lambda row: self.decrypt_acctNbr(row), axis=1)
            df_input['Decrypt_Name'] = df_input.apply(lambda row: self.decrypt_Name(row), axis=1)
            df_input['Decrypt_Phone'] = df_input.apply(lambda row: self.decrypt_Phone(row), axis=1)
            df_input['Decrypt_Email'] = df_input.apply(lambda row: self.decrypt_Email(row), axis=1)

            # Dropping original columns
            df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

            # Renaming new columns with the old column names
            df_input.rename(columns={'Decrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
            df_input.rename(columns={'Decrypt_Name': 'Name'}, inplace=True)
            df_input.rename(columns={'Decrypt_Phone': 'Phone'}, inplace=True)
            df_input.rename(columns={'Decrypt_Email': 'Email'}, inplace=True)

            # New Column List Orders
            column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email']
            df_fin = df_input.reindex(column_order, axis=1)

            return df_fin
        except Exception as e:
            df_error = p.DataFrame({'Acct_Nbr':str(e), 'Name':'', 'Acct_Addr_1':'', 'Acct_Addr_2':'', 'Phone':'', 'Email':''})

            return df_error

Key lines to discuss –

Main two methods, we’ll be looking into & they are –

a. getEncrypt

b. getDecrypt

However, these two functions constructions are identical in nature. One is for encryption & the other one is decryption.

# Deriving rows
df_input['Encrypt_Acct_Nbr'] = df_input.apply(lambda row: self.encrypt_acctNbr(row), axis=1)
df_input['Encrypt_Name'] = df_input.apply(lambda row: self.encrypt_Name(row), axis=1)
df_input['Encrypt_Phone'] = df_input.apply(lambda row: self.encrypt_Phone(row), axis=1)
df_input['Encrypt_Email'] = df_input.apply(lambda row: self.encrypt_Email(row), axis=1)

As you can see, the application is processing row-by-row & column-by-column data transformations using look-up functions.

# Dropping original columns
df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

As the comment suggested, the application is dropping all the unencrypted source columns.

# Renaming new columns with the old column names
df_input.rename(columns={'Encrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
df_input.rename(columns={'Encrypt_Name': 'Name'}, inplace=True)
df_input.rename(columns={'Encrypt_Phone': 'Phone'}, inplace=True)
df_input.rename(columns={'Encrypt_Email': 'Email'}, inplace=True)

Once, the application drops all the source columns, it will rename the new column names back to old columns & based on this data will be merged with the rest of the data from the source csv.

# New Column List Orders
column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email', 'Serial_No']
df_fin = df_input.reindex(column_order, axis=1)

Once, the application finished doing all these transformations, it will now re-sequence the order of the columns, which will create the same column order as it’s source csv files.

Similar logic is applicable for the decryption as well.

As we know, there are many look-up methods take part as part of this drive.

encrypt_acctNbr, encrypt_Name, encrypt_Phone, encrypt_Email
decrypt_acctNbr, decrypt_Name, decrypt_Phone, decrypt_Email

We’ll discuss only one method as these are completely identical.

# Capturing essential values
EncryptMode = self.EncryptMode
lkp_acctNbr = row['Acct_Nbr']
str_acct_nbr = str(lkp_acctNbr)
fil_acct_nbr = str_acct_nbr.strip()

From the row, our application is extracting the relevant column. In this case, it is Acct_Nbr. And, then converts it to string & remove any unnecessary white space from it.

# Forming JSON String for this field
json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

Once extracted, the application will build the target JON string as per column data.

# Identifying Length of the field
len_acct_nbr = len(fil_acct_nbr)

# This will trigger the service if it has valid data
if len_acct_nbr > 0:
    x = cw.clsWeb(json_source_str)
    en_AcctNbr = x.getResponse(EncryptMode)
else:
    en_AcctNbr = ''

Based on the length of the extracted value, our application will trigger the individual JSON requests & will receive the data frame in response.

9. clsParallel.py (This script will use the queue to make asynchronous calls & perform the same encryption & decryption. Hence, the name comes into the picture.)

############################################
#### Written By: SATYAKI DE             ####
#### Written On: 10-Feb-2019            ####
#### Package Flask package needs to     ####
#### install in order to run this       ####
#### script.                            ####
####                                    ####
#### Objective: This script will        ####
#### initiate the encrypt/decrypt class ####
#### based on client supplied data.     ####
#### This script will use the advance   ####
#### queue & asynchronus calls to the   ####
#### API Server to process Encryption & ####
#### Decryption on our csv files.       ####
############################################
import pandas as p
import clsWebService as cw
import datetime
from clsParam import clsParam as cf
from multiprocessing import Lock, Process, Queue, freeze_support, JoinableQueue
import gc
import signal
import time
import os
import queue
import asyncio

# Declaring Global Variable
q = Queue()
lock = Lock()

finished_task = JoinableQueue()
pending_task = JoinableQueue()

sp_fin_dict = {}
dp_fin_dict = {}

# Disbling Warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

class clsParallel(object):
    def __init__(self):
        self.path = cf.config['PATH']
        self.EncryptMode = str(cf.config['ENCRYPT_MODE'])
        self.DecryptMode = str(cf.config['DECRYPT_MODE'])
        self.num_worker_process = int(cf.config['NUM_OF_THREAD'])
        self.lock = Lock()

    # Lookup Methods for Encryption
    def encrypt_acctNbr(self, row):
        # Declaring Local Variable
        en_AcctNbr = ''
        json_source_str = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_acctNbr = row['Acct_Nbr']
        str_acct_nbr = str(lkp_acctNbr)
        fil_acct_nbr = str_acct_nbr.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_nbr)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_AcctNbr = x.getResponse(EncryptMode)
        else:
            en_AcctNbr = ''

        fil_acct_nbr = ''

        return en_AcctNbr

    def encrypt_Name(self, row):
        # Declaring Local Variable
        en_AcctName = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_acctName = row['Name']
        str_acct_name = str(lkp_acctName)
        fil_acct_name = str_acct_name.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_name + '","dataTemplate":"subGrName"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_name)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_AcctName = x.getResponse(EncryptMode)
        else:
            en_AcctName = ''

        return en_AcctName

    def encrypt_Phone(self, row):
        # Declaring Local Variable
        en_Phone = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_phone = row['Phone']
        str_phone = str(lkp_phone)
        fil_phone = str_phone.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_phone + '","dataTemplate":"subGrPhone"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_phone)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_Phone = x.getResponse(EncryptMode)
        else:
            en_Phone = ''

        return en_Phone

    def encrypt_Email(self, row):
        # Declaring Local Variable
        en_Email = ''

        # Capturing essential values
        EncryptMode = self.EncryptMode
        lkp_email = row['Email']
        str_email = str(lkp_email)
        fil_email = str_email.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_email + '","dataTemplate":"subGrEmail"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_email)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            en_Email = x.getResponse(EncryptMode)
        else:
            en_Email = ''

        return en_Email

    # Lookup Methods for Decryption
    def decrypt_acctNbr(self, row):
        # Declaring Local Variable
        de_AcctNbr = ''
        json_source_str = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_acctNbr = row['Acct_Nbr']
        str_acct_nbr = str(lkp_acctNbr)
        fil_acct_nbr = str_acct_nbr.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_nbr + '","dataTemplate":"subGrAcct_Nbr"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_nbr)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_AcctNbr = x.getResponse(EncryptMode)
        else:
            de_AcctNbr = ''

        return de_AcctNbr

    def decrypt_Name(self, row):
        # Declaring Local Variable
        de_AcctName = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_acctName = row['Name']
        str_acct_name = str(lkp_acctName)
        fil_acct_name = str_acct_name.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_acct_name + '","dataTemplate":"subGrName"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_acct_name)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_AcctName = x.getResponse(EncryptMode)
        else:
            de_AcctName = ''

        return de_AcctName

    def decrypt_Phone(self, row):
        # Declaring Local Variable
        de_Phone = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_phone = row['Phone']
        str_phone = str(lkp_phone)
        fil_phone = str_phone.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_phone + '","dataTemplate":"subGrPhone"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_phone)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_Phone = x.getResponse(EncryptMode)
        else:
            de_Phone = ''

        return de_Phone

    def decrypt_Email(self, row):
        # Declaring Local Variable
        de_Email = ''

        # Capturing essential values
        EncryptMode = self.DecryptMode
        lkp_email = row['Email']
        str_email = str(lkp_email)
        fil_email = str_email.strip()

        # Forming JSON String for this field
        json_source_str = '{"dataGroup":"GrDet","data":"' + fil_email + '","dataTemplate":"subGrEmail"}'

        # Identifying Length of the field
        len_acct_nbr = len(fil_email)

        # This will trigger the service if it has valid data
        if len_acct_nbr > 0:
            x = cw.clsWeb(json_source_str)
            de_Email = x.getResponse(EncryptMode)
        else:
            de_Email = ''

        return de_Email

    def getEncrypt(self, df_dict):
        try:
            en_fin_dict = {}

            df_input = p.DataFrame()
            df_fin = p.DataFrame()

            # Assigning Target File Basic Name
            for k, v in df_dict.items():
                Process_Name = k
                df_input = v

            # Checking total count of rows
            count_row = int(df_input.shape[0])
            print('Part number of records to process:: ', count_row)

            if count_row > 0:

                # Deriving rows
                df_input['Encrypt_Acct_Nbr'] = df_input.apply(lambda row: self.encrypt_acctNbr(row), axis=1)
                df_input['Encrypt_Name'] = df_input.apply(lambda row: self.encrypt_Name(row), axis=1)
                df_input['Encrypt_Phone'] = df_input.apply(lambda row: self.encrypt_Phone(row), axis=1)
                df_input['Encrypt_Email'] = df_input.apply(lambda row: self.encrypt_Email(row), axis=1)

                # Dropping original columns
                df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

                # Renaming new columns with the old column names
                df_input.rename(columns={'Encrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
                df_input.rename(columns={'Encrypt_Name': 'Name'}, inplace=True)
                df_input.rename(columns={'Encrypt_Phone': 'Phone'}, inplace=True)
                df_input.rename(columns={'Encrypt_Email': 'Email'}, inplace=True)

                # New Column List Orders
                column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email', 'Serial_No']
                df_fin = df_input.reindex(column_order, axis=1)

                sp_fin_dict[Process_Name] = df_fin

            return sp_fin_dict
        except Exception as e:
            df_error = p.DataFrame({'Acct_Nbr':str(e), 'Name':'', 'Acct_Addr_1':'', 'Acct_Addr_2':'', 'Phone':'', 'Email':'', 'Serial_No':''})
            sp_fin_dict[Process_Name] = df_error

            return sp_fin_dict

    async def produceEncr(self, queue, l_dict):

        m_dict = {}

        m_dict = self.getEncrypt(l_dict)

        for k, v in m_dict.items():
            item = k
            print('producing {}...'.format(item))

        await queue.put(m_dict)


    async def consumeEncr(self, queue):
        result_dict = {}

        while True:
            # wait for an item from the producer
            sp_fin_dict.update(await queue.get())

            # process the item
            for k, v in sp_fin_dict.items():
                item = k
                print('consuming {}...'.format(item))

            # Notify the queue that the item has been processed
            queue.task_done()


    async def runEncrypt(self, n, df_input):
        l_dict = {}

        queue = asyncio.Queue()
        # schedule the consumer
        consumer = asyncio.ensure_future(self.consumeEncr(queue))

        start_pos = 0
        end_pos = 0

        num_worker_process = n

        count_row = df_input.shape[0]
        print('Total number of records to process:: ', count_row)

        interval = int(count_row / num_worker_process) + 1
        actual_worker_task = int(count_row / interval) + 1

        for i in range(actual_worker_task):
            name = 'Task-' + str(i)

            if ((start_pos + interval) < count_row):
                end_pos = start_pos + interval
            else:
                end_pos = start_pos + (count_row - start_pos)

            print("start_pos: ", start_pos)
            print("end_pos: ", end_pos)

            split_df = df_input.iloc[start_pos:end_pos]
            l_dict[name] = split_df

            if ((start_pos > count_row) | (start_pos == count_row)):
                break
            else:
                start_pos = start_pos + interval

            # run the producer and wait for completion
            await self.produceEncr(queue, l_dict)
            # wait until the consumer has processed all items
            await queue.join()

        # the consumer is still awaiting for an item, cancel it
        consumer.cancel()

        return sp_fin_dict


    def getEncryptParallel(self, df_payload):

        l_dict = {}
        data_dict = {}
        min_val_list = {}
        cnt = 1
        num_worker_process = self.num_worker_process
        actual_worker_task = 0
        number_of_processes = 4

        processes = []

        split_df = p.DataFrame()
        df_ret = p.DataFrame()
        dummy_df = p.DataFrame()

        # Assigning Target File Basic Name
        df_input = df_payload

        # Checking total count of rows
        count_row = df_input.shape[0]
        print('Total number of records to process:: ', count_row)

        interval = int(count_row / num_worker_process) + 1
        actual_worker_task = int(count_row/interval) + 1

        loop = asyncio.get_event_loop()
        loop.run_until_complete(self.runEncrypt(actual_worker_task, df_input))
        loop.close()

        for k, v in sp_fin_dict.items():
            min_val_list[int(k.replace('Task-', ''))] = v

        min_val = min(min_val_list, key=int)
        print("Minimum Index Value: ", min_val)

        for k, v in sorted(sp_fin_dict.items(), key=lambda k: int(k[0].replace('Task-', ''))):
            if int(k.replace('Task-', '')) == min_val:
                df_ret = sp_fin_dict[k]
            else:
                d_frames = [df_ret, sp_fin_dict[k]]
                df_ret = p.concat(d_frames)

        return df_ret

    def getDecrypt(self, df_encrypted_dict):
        try:
            de_fin_dict = {}

            df_input = p.DataFrame()
            df_fin = p.DataFrame()

            # Assigning Target File Basic Name
            for k, v in df_encrypted_dict.items():
                Process_Name = k
                df_input = v

            # Checking total count of rows
            count_row = int(df_input.shape[0])
            print('Part number of records to process:: ', count_row)

            if count_row > 0:

                # Deriving rows
                df_input['Decrypt_Acct_Nbr'] = df_input.apply(lambda row: self.decrypt_acctNbr(row), axis=1)
                df_input['Decrypt_Name'] = df_input.apply(lambda row: self.decrypt_Name(row), axis=1)
                df_input['Decrypt_Phone'] = df_input.apply(lambda row: self.decrypt_Phone(row), axis=1)
                df_input['Decrypt_Email'] = df_input.apply(lambda row: self.decrypt_Email(row), axis=1)

                # Dropping original columns
                df_input.drop(['Acct_Nbr', 'Name', 'Phone', 'Email'], axis=1, inplace=True)

                # Renaming new columns with the old column names
                df_input.rename(columns={'Decrypt_Acct_Nbr':'Acct_Nbr'}, inplace=True)
                df_input.rename(columns={'Decrypt_Name': 'Name'}, inplace=True)
                df_input.rename(columns={'Decrypt_Phone': 'Phone'}, inplace=True)
                df_input.rename(columns={'Decrypt_Email': 'Email'}, inplace=True)

                # New Column List Orders
                column_order = ['Acct_Nbr', 'Name', 'Acct_Addr_1', 'Acct_Addr_2', 'Phone', 'Email', 'Serial_No']
                df_fin = df_input.reindex(column_order, axis=1)

                de_fin_dict[Process_Name] = df_fin

            return de_fin_dict

        except Exception as e:
            df_error = p.DataFrame({'Acct_Nbr': str(e), 'Name': '', 'Acct_Addr_1': '', 'Acct_Addr_2': '', 'Phone': '', 'Email': '', 'Serial_No': ''})
            de_fin_dict[Process_Name] = df_error

            return de_fin_dict

    async def produceDecr(self, queue, l_dict):

        m_dict = {}

        m_dict = self.getDecrypt(l_dict)

        for k, v in m_dict.items():
            item = k
            print('producing {}...'.format(item))

        await queue.put(m_dict)


    async def consumeDecr(self, queue):
        result_dict = {}

        while True:
            # wait for an item from the producer
            dp_fin_dict.update(await queue.get())

            # process the item
            for k, v in dp_fin_dict.items():
                item = k
                print('consuming {}...'.format(item))

            # Notify the queue that the item has been processed
            queue.task_done()


    async def runDecrypt(self, n, df_input):
        l_dict = {}

        queue = asyncio.Queue()
        # schedule the consumer
        consumerDe = asyncio.ensure_future(self.consumeDecr(queue))

        start_pos = 0
        end_pos = 0

        num_worker_process = n

        count_row = df_input.shape[0]
        print('Total number of records to process:: ', count_row)

        interval = int(count_row / num_worker_process) + 1
        actual_worker_task = int(count_row / interval) + 1

        for i in range(actual_worker_task):
            name = 'Task-' + str(i)

            if ((start_pos + interval) < count_row):
                end_pos = start_pos + interval
            else:
                end_pos = start_pos + (count_row - start_pos)

            print("start_pos: ", start_pos)
            print("end_pos: ", end_pos)

            split_df = df_input.iloc[start_pos:end_pos]
            l_dict[name] = split_df

            if ((start_pos > count_row) | (start_pos == count_row)):
                break
            else:
                start_pos = start_pos + interval

            # run the producer and wait for completion
            await self.produceDecr(queue, l_dict)
            # wait until the consumer has processed all items
            await queue.join()

        # the consumer is still awaiting for an item, cancel it
        consumerDe.cancel()

        return dp_fin_dict


    def getDecryptParallel(self, df_payload):

        l_dict = {}
        data_dict = {}
        min_val_list = {}
        cnt = 1
        num_worker_process = self.num_worker_process
        actual_worker_task = 0
        number_of_processes = 4

        processes = []

        split_df = p.DataFrame()
        df_ret_1 = p.DataFrame()
        dummy_df = p.DataFrame()

        # Assigning Target File Basic Name
        df_input = df_payload

        # Checking total count of rows
        count_row = df_input.shape[0]
        print('Total number of records to process:: ', count_row)

        interval = int(count_row / num_worker_process) + 1
        actual_worker_task = int(count_row/interval) + 1

        loop_1 = asyncio.new_event_loop()
        asyncio.set_event_loop(asyncio.new_event_loop())
        loop_2 = asyncio.get_event_loop()
        loop_2.run_until_complete(self.runDecrypt(actual_worker_task, df_input))
        loop_2.close()

        for k, v in dp_fin_dict.items():
            min_val_list[int(k.replace('Task-', ''))] = v

        min_val = min(min_val_list, key=int)
        print("Minimum Index Value: ", min_val)

        for k, v in sorted(dp_fin_dict.items(), key=lambda k: int(k[0].replace('Task-', ''))):
            if int(k.replace('Task-', '')) == min_val:
                df_ret_1 = dp_fin_dict[k]
            else:
                d_frames = [df_ret_1, dp_fin_dict[k]]
                df_ret_1 = p.concat(d_frames)

        return df_ret_1

I don’t want to discuss any more look-up methods as the post is already pretty big. Only address a few critical lines

Under getEncryptParallel, the following lines are essential –

# Checking total count of rows
count_row = df_input.shape[0]
print('Total number of records to process:: ', count_row)

interval = int(count_row / num_worker_process) + 1
actual_worker_task = int(count_row/interval) + 1

Based on the dataframe total number of records, our application will split that main dataframe into parts of sub dataframe & then pass them using queue by asynchronous queue calls.

loop = asyncio.get_event_loop()
loop.run_until_complete(self.runEncrypt(actual_worker_task, df_input))
loop.close()

Initiating our queue methods & passing our dataframe to it.

for k, v in sorted(sp_fin_dict.items(), key=lambda k: int(k[0].replace('Task-', ''))):
    if int(k.replace('Task-', '')) == min_val:
        df_ret = sp_fin_dict[k]
    else:
        d_frames = [df_ret, sp_fin_dict[k]]
        df_ret = p.concat(d_frames)

Our application is sending & receiving data using the dictionary. The reason is – we’re not expecting data that we may get it from our server in sequence. Instead, we’re hoping the data will be random. Hence, using keys, we’re maintaining our final sequence & that will ensure our application to joining back to the correct sets of source data, which won’t be the candidate for any encryption/decryption.

Let’s discuss runEncrypt method.

for i in range(actual_worker_task):
    name = 'Task-' + str(i)

    if ((start_pos + interval) < count_row):
        end_pos = start_pos + interval
    else:
        end_pos = start_pos + (count_row - start_pos)

    print("start_pos: ", start_pos)
    print("end_pos: ", end_pos)

    split_df = df_input.iloc[start_pos:end_pos]
    l_dict[name] = split_df

    if ((start_pos > count_row) | (start_pos == count_row)):
        break
    else:
        start_pos = start_pos + interval

Here, our application is splitting our source data frame into multiple sub dataframe & then it can be processed in parallel using queues.

# run the producer and wait for completion
await self.produceEncr(queue, l_dict)
# wait until the consumer has processed all items
await queue.join()

Invoking the encryption-decryption process using queues. The last line is significant. The queue will not destroy until all the item produced/place into the queue are not consumed. Hence, your main program will wait until it processes all the records of your dataframe.

Two methods named produceEncr & consumeEncr mainly used for placing an item inside the queue & then after encryption/decryption it will retrieve it from the queue.

Few important lines from both the methods are –

#produceEncr
await queue.put(m_dict)

#consumeEncr
# wait for an item from the producer
sp_fin_dict.update(await queue.get())

# Notify the queue that the item has been processed
queue.task_done()

From the first two lines, one can see that the application will place its item into the queue. Rests are the lines from the other methods. Our application is pouring the data into the dictionary, which will be returned to our calling methods. The last line is significantly essential. Without the task_done process, the queue will continue to wait for upcoming items. Hence, that will trigger infinite wait or sometimes deadlock.

10. callClient.py (This script will trigger both the serial & parallel process of encryption one by one & finally capture some statistics. Hence, the name comes into the picture.)

############################################
#### Written By: SATYAKI DE             ####
#### Written On: 10-Feb-2019            ####
#### Package Flask package needs to     ####
#### install in order to run this       ####
#### script.                            ####
####                                    ####
#### Objective: This script will        ####
#### initiate the encrypt/decrypt class ####
#### based on client supplied data.     ####
############################################
import pandas as p
import clsSerial as cs
import time
import datetime
from clsParam import clsParam as cf
import clsParallel as cp
import sys

def main():
    source_df = p.DataFrame()
    encrypted_df = p.DataFrame()
    source_encrypted_df = p.DataFrame()
    decrypted_df = p.DataFrame()
    encrypted_parallel_df = p.DataFrame()
    source_encrypted_parallel_df = p.DataFrame()
    decrypted_parallel_df = p.DataFrame()

    ###############################################################################
    #####                Start Of Serial Encryption Methods                  ######
    ###############################################################################

    print("-" * 157)

    startEnTime = time.time()
    srcFile = 'acct_addr_20180106'
    srcFileWithPath = str(cf.config['SRC_DIR']) + srcFile + '.csv'

    print("Calling Serial Process to Encrypt!")

    # Reading source file
    source_df = p.read_csv(srcFileWithPath, index_col=False)

    # Calling Encrypt Methods
    x = cs.clsSerial()
    encrypted_df = x.getEncrypt(source_df)

    # Handling Multiple source files
    var = datetime.datetime.now().strftime("%H.%M.%S")
    print('Target File Extension will contain the following:: ', var)

    targetFile = srcFile + '_Serial_'
    taregetFileWithPath = str(cf.config['FIN_DIR']) + targetFile + var + '.csv'

    # Finally Storing them into csv
    encrypted_df.to_csv(taregetFileWithPath, index=False)

    endEnTime = time.time()
    z1 = str(endEnTime - startEnTime)
    print("Over All Encrypt Process Time:", z1)

    time.sleep(20)

    ###############################################################################
    #####                Start Of Serial Decryption Methods                  ######
    ###############################################################################

    print("-" * 157)

    startDeTime = time.time()
    srcFileWithPath = taregetFileWithPath

    print("Calling Serial Process to Decrypt!")

    # Reading source file
    source_encrypted_df = p.read_csv(srcFileWithPath, index_col=False)

    # Calling Encrypt Methods
    x = cs.clsSerial()
    decrypted_df = x.getDecrypt(source_encrypted_df)

    targetFile = srcFile + '_restored_'
    taregetFileWithPath = str(cf.config['FIN_DIR']) + targetFile + var + '.csv'

    # Finally Storing them into csv
    decrypted_df.to_csv(taregetFileWithPath, index=False)

    endDeTime = time.time()
    z2 = str(endDeTime - startDeTime)
    print("Over All Decrypt Process Time:", z2)

    print("-" * 157)

    ###############################################################################
    #####        End Of Serial Encryption/Decryption Methods                 ######
    ###############################################################################

    time.sleep(20)

    ###############################################################################
    #####                Start Of Parallel Encryption Methods                ######
    ###############################################################################

    print("-" * 157)

    startEnTime = time.time()
    srcFileWithPath = str(cf.config['SRC_DIR']) + srcFile + '.csv'

    print("Calling Serial Process to Encrypt!")

    # Reading source file
    source_df = p.read_csv(srcFileWithPath, index_col=False)

    # Calling Encrypt Methods
    x = cp.clsParallel()
    encrypted_parallel_df = x.getEncryptParallel(source_df)

    # Handling Multiple source files
    var = datetime.datetime.now().strftime("%H.%M.%S")
    print('Target File Extension will contain the following:: ', var)

    targetFile = srcFile + '_Parallel_'
    taregetFileWithPath = str(cf.config['FIN_DIR']) + targetFile + var + '.csv'

    # Finally Storing them into csv
    encrypted_parallel_df.to_csv(taregetFileWithPath, index=False)

    endEnTime = time.time()
    z3 = str(endEnTime - startEnTime)
    print("Over All Encrypt Process Time:", z3)

    time.sleep(20)

    ###############################################################################
    #####                Start Of Serial Decryption Methods                  ######
    ###############################################################################

    print("-" * 157)

    startDeTime = time.time()
    srcFileWithPath = taregetFileWithPath

    print("Calling Parallel Process to Decrypt!")

    # Reading source file
    source_encrypted_parallel_df = p.read_csv(srcFileWithPath, index_col=False)

    # Calling Encrypt Methods
    x = cp.clsParallel()
    decrypted_parallel_df = x.getDecryptParallel(source_encrypted_parallel_df)

    targetFile = srcFile + '_restored_'
    taregetFileWithPath = str(cf.config['FIN_DIR']) + targetFile + var + '.csv'

    # Finally Storing them into csv
    decrypted_parallel_df.to_csv(taregetFileWithPath, index=False)

    endDeTime = time.time()
    z4 = str(endDeTime - startDeTime)
    print("Over All Decrypt Process Time:", z4)

    print("-" * 157)

    ###############################################################################
    #####        End Of Parallel Encryption/Decryption Methods               ######
    ###############################################################################

    ###############################################################################
    #####    Final Statistics between Serial & Parallel loading.             ######
    ###############################################################################

    print("-" * 157)
    print("Serial Encryption:: ", z1)
    print("Serial Decryption:: ", z2)
    print("-" * 157)
    print("Parallel Encryption:: ", z3)
    print("Parallel Decryption:: ", z4)
    print("-" * 157)


if __name__ == '__main__':
    main()

As you can see, we’ve triggered both the application using the main callable scripts.

Let’s explore the output –

Windows:

Mac:

Note that you have to open two different windows or MAC terminal. One will trigger the server & others will trigger the client to simulate this.

Server:

Clients:

Win:

MAC:

So, finally, we’ve achieved our goal. So, today we’ve done a bit long but beneficial & advanced concepts of crossover stones from our python verse. 🙂

Lot more innovative posts are coming.

Till then – Happy Avenging!

Pandas, Numpy, Encryption/Decryption, Hidden Files In Python (Crossover between Space Stone, Reality Stone & Mind Stone of Python-Verse)

Posted on February 3, 2019September 23, 2019 by SatyakiDe in Data Science, exposure, filepermission, function, numpy, objects, operating system, Pandas, Python, read, replace, Technology

So, here we come up with another crossover of Space Stone, Reality Stone & Mind Stone of Python-Verse. It is indeed exciting & I cannot wait to explore that part further. Today, in this post, we’ll see how one application can integrate all these key ingredients in Python to serve the purpose. Our key focus will be involving popular packages like Pandas, Numpy & Popular Encryption-Decryption techniques, which include some hidden files as well.

So, our objective here is to proceed with the encryption & decryption technique. But, there is a catch. We need to store some salt or tokenized value inside a hidden file. Our application will extract the salt value from it & then based on that it will perform Encrypt/Decrypt on the data.

Why do we need this approach?

The answer is simple. On many occasions, we don’t want to store our right credentials in configuration files. Also, we don’t want to keep our keys to open to other developers. There are many ways you can achieve this kind of security. Today, I’ll be showing a different approach to make the same.

Let’s explore.

As usual, I’ll provide the solution, which is tested in Windows & MAC & provide the script. Also, I’ll explain the critical lines of those scripts to understand it from a layman point of view. And, I won’t explain any script, which I’ve already explained in my earlier post. So, you have to refer my old post for that.

To encrypt & decrypt, we need the following files, which contains credentials in a csv. Please find the sample data –

Config_orig.csv

Please see the file, which will be hidden by the application process.

As you can see, this column contains the salt, which will be used in our Encryption/Decryption.

1. clsL.py (This script will create the csv files or any intermediate debug csv file after the corresponding process. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 25-Jan-2019       ########
####                               ########
#### Objective: Log File           ########
###########################################
import pandas as p
import platform as pl
from clsParam import clsParam as cf

class clsL(object):
    def __init__(self):
        self.path = cf.config['PATH']

    def logr(self, Filename, Ind, df, subdir=None):
        try:
            x = p.DataFrame()
            x = df
            sd = subdir

            os_det = pl.system()

            if sd == None:
                if os_det == "Windows":
                    fullFileName = self.path + '\\' + Filename
                else:
                    fullFileName = self.path + '/' + Filename
            else:
                if os_det == "Windows":
                    fullFileName = self.path + '\\' + sd + "\\" + Filename
                else:
                    fullFileName = self.path + '/' + sd + "/" + Filename

            if Ind == 'Y':
                x.to_csv(fullFileName, index=False)

            return 0

        except Exception as e:
            y = str(e)
            print(y)
            return 3

2. clsParam.py (This is the script that will be used as a parameter file & will be used in other python scripts.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 25-Jan-2019       ########
#### Objective: Parameter File     ########
###########################################

import os
import platform as pl

class clsParam(object):

    config = {
        'FILENAME' : 'test.amca',
        'OSX_MOD_FILE_NM': '.test.amca',
        'CURR_PATH': os.path.dirname(os.path.realpath(__file__)),
        'NORMAL_FLAG': 32,
        'HIDDEN_FLAG': 34,
        'OS_DET': pl.system()
    }

3. clsWinHide.py (This script contains the core logic of hiding/unhiding a file under Windows OS. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE          ######
#### Written On: 25-Jan-2019         ######
####                                 ######
#### This script will hide or Unhide ######
#### Files in Windows.               ######
###########################################

import win32file
import win32con
from clsParam import clsParam as cp

class clsWinHide(object):
    def __init__(self):
        self.path = cp.config['CURR_PATH']
        self.FileName = cp.config['FILENAME']
        self.normal_file_flag = cp.config['NORMAL_FLAG']

    def doit(self):
        try:
            path = self.path
            FileName = self.FileName

            FileNameWithPath = path + '\\' + FileName
            flags = win32file.GetFileAttributesW(FileNameWithPath)
            win32file.SetFileAttributes(FileNameWithPath,win32con.FILE_ATTRIBUTE_HIDDEN | flags)

            return 0
        except Exception as e:
            x = str(e)
            print(x)

            return 1

    def undoit(self):
        try:
            path = self.path
            FileName = self.FileName
            normal_file_flag = self.normal_file_flag

            FileNameWithPath = path + '\\' + FileName
            win32file.SetFileAttributes(FileNameWithPath,win32con.FILE_ATTRIBUTE_NORMAL | int(normal_file_flag))

            return 0
        except Exception as e:
            x = str(e)
            print(x)

            return 1

Key lines that we would like to explore are as follows –

def doit()

flags = win32file.GetFileAttributesW(FileNameWithPath)
win32file.SetFileAttributes(FileNameWithPath,win32con.FILE_ATTRIBUTE_HIDDEN | flags)

The above two lines under doit() functions are changing the file attributes in Windows OS to the hidden mode by assigning the FILE_ATTRIBUTE_HIDDEN property.

def undoit()

normal_file_flag = self.normal_file_flag

FileNameWithPath = path + '\\' + FileName
win32file.SetFileAttributes(FileNameWithPath,win32con.FILE_ATTRIBUTE_NORMAL | int(normal_file_flag))

As the script suggested, the application is setting the file attribute of a hidden file to FILE_ATTRIBUTE_NORMAL & set the correct flag from parameters, which leads to the file appears as a normal windows file.

4. clsOSXHide.py (This script contains the core logic of hiding/unhiding a file under OSX, i.e., MAC OS. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE           #####
#### Written On: 25-Jan-2019          #####
####                                  #####
#### Objective: This script will hide #####
#### or Unhide the file in OSX.       #####
###########################################

import os
from clsParam import clsParam as cp

class clsOSXHide(object):
    def __init__(self):
        self.path = cp.config['CURR_PATH']
        self.FileName = cp.config['FILENAME']
        self.OSX_Mod_FileName = cp.config['OSX_MOD_FILE_NM']
        self.normal_file_flag = cp.config['NORMAL_FLAG']

    def doit(self):
        try:
            path = self.path
            FileName = self.FileName

            FileNameWithPath = path + '/' + FileName
            os.rename(FileNameWithPath, os.path.join(os.path.dirname(FileNameWithPath),'.'
                                                     + os.path.basename(FileNameWithPath)))

            return 0
        except Exception as e:
            x = str(e)
            print(x)

            return 1

    def undoit(self):
        try:
            path = self.path
            FileName = self.FileName
            OSX_Mod_FileName = self.OSX_Mod_FileName

            FileNameWithPath = path + '/' + FileName
            os.rename(OSX_Mod_FileName, FileNameWithPath)

            return 0
        except Exception as e:
            x = str(e)
            print(x)

            return 1

The key lines that we’ll be exploring here are as follows –

def doit()

FileNameWithPath = path + '/' + FileName
os.rename(FileNameWithPath, os.path.join(os.path.dirname(FileNameWithPath),'.'
                                         + os.path.basename(FileNameWithPath)))

In MAC or Linux, any file starts with ‘.’ will be considered as a hidden file. Hence, we’re changing the file type by doing this manipulation.

def undoit()

OSX_Mod_FileName = self.OSX_Mod_FileName

FileNameWithPath = path + '/' + FileName
os.rename(OSX_Mod_FileName, FileNameWithPath)

In this case, our application simply renaming a file with its the original file to get the file as a normal file.

Let’s understand that in Linux or MAC, you have a lot of other ways to restrict any files as it has much more granular level access control. But, I thought, why not take a slightly different & fun way to achieve the same. After all, we’re building an Infinity War for Python verse. A little bit of fun will certainly make some sense. 🙂

5. clsProcess.py (This script will invoke any of the hide scripts, i.e. clsWinHide.py or clsOSXHide.py based on the OS platform. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE          ######
#### Written On: 25-Jan-2019         ######
####                                 ######
#### Objective: Based on the OS, this######
#### script calls the actual script. ######
###########################################

from clsParam import clsParam as cp

plat_det = cp.config['OS_DET']

# Based on the platform
# Application is loading subprocess
# in order to avoid library missing
# case against cross platform

if plat_det == "Windows":
    import clsWinHide as win
else:
    import clsOSXHide as osx

# End of conditional class load

class clsProcess(object):
    def __init__(self):
        self.os_det = plat_det

    def doit(self):
        try:

            os_det = self.os_det
            print("OS Info: ", os_det)

            if os_det == "Windows":
                win_doit = win.clsWinHide()
                ret_val = win_doit.doit()
            else:
                osx_doit = osx.clsOSXHide()
                ret_val = osx_doit.doit()

            return ret_val
        except Exception as e:
            x = str(e)
            print(x)

            return 1

    def undoit(self):
        try:

            os_det = self.os_det
            print("OS Info: ", os_det)

            if os_det == "Windows":
                win_doit = win.clsWinHide()
                ret_val = win_doit.undoit()
            else:
                osx_doit = osx.clsOSXHide()
                ret_val = osx_doit.undoit()

            return ret_val
        except Exception as e:
            x = str(e)
            print(x)

            return 1

Key lines to explores are as follows –

from clsParam import clsParam as cp

plat_det = cp.config['OS_DET']

# Based on the platform
# Application is loading subprocess
# in order to avoid library missing
# case against cross platform

if plat_det == "Windows":
    import clsWinHide as win
else:
    import clsOSXHide as osx

This step is very essential to run the same python scripts in both the environments, e.g. in this case like MAC & Windows.

So, based on the platform details, which the application is getting from the clsParam class, it is loading the specific class to the application. And why it is so important.

Under Windows OS, this will work if you load both the class. But, under MAC, this will fail as the first program will try to load all the libraries & it may happen that the pywin32/pypiwin32 package might not available under MAC. Anyway, you are not even using that package. So, this conditional class loading is significant.

os_det = self.os_det
print("OS Info: ", os_det)

if os_det == "Windows":
    win_doit = win.clsWinHide()
    ret_val = win_doit.doit()
else:
    osx_doit = osx.clsOSXHide()
    ret_val = osx_doit.doit()

As you can see that, based on the OS, it is invoking the correct function of that corresponding class.

6. clsEnDec.py (This script will read the credentials from a csv file & then based on the salt captured from the hidden file, it will either encrypt or decrypt the content. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 25-Jan-2019       ########
#### Package Cryptography needs to ########
#### install in order to run this  ########
#### script.                       ########
####                               ########
#### Objective: This script will   ########
#### encrypt/decrypt based on the  ########
#### hidden supplied salt value.   ########
###########################################

import pandas as p
from cryptography.fernet import Fernet

class clsEnDec(object):

    def __init__(self, token):
        # Calculating Key
        self.token = token

    def encrypt_str(self):
        try:
            # Capturing the Salt Information
            salt = self.token
            # Fetching the content of lookup file
            df_orig = p.read_csv('Config_orig.csv', index_col=False)

            # Checking Individual Types inside the Dataframe
            cipher = Fernet(salt)

            df_orig['User'] = df_orig['User'].apply(lambda x1: cipher.encrypt(bytes(x1,'utf8')))
            df_orig['Pwd'] = df_orig['Pwd'].apply(lambda x2: cipher.encrypt(bytes(x2,'utf8')))

            # Writing to the File
            df_orig.to_csv('Encrypt_Config.csv', index=False)

            return 0
        except Exception as e:
            x = str(e)
            print(x)
            return 1

    def decrypt_str(self):
        try:
            # Capturing the Salt Information
            salt = self.token
            # Checking Individual Types inside the Dataframe
            cipher = Fernet(salt)

            # Fetching the Encrypted csv file
            df_orig = p.read_csv('Encrypt_Config.csv', index_col=False)

            df_orig['User'] = df_orig['User'].apply(lambda x1: str(cipher.decrypt(bytes(x1[2:-1],'utf8'))).replace("b'","").replace("'",""))
            df_orig['Pwd'] = df_orig['Pwd'].apply(lambda x2: str(cipher.decrypt(bytes(x2[2:-1],'utf8'))).replace("b'","").replace("'",""))

            # Writing to the file
            df_orig.to_csv('Decrypt_Config.csv', index=False)

            return 0
        except Exception as e:
            x = str(e)
            print(x)
            return 1

Key lines from this script are as follows –

def encrypt_str()

# Checking Individual Types inside the Dataframe
cipher = Fernet(salt)

df_orig['User'] = df_orig['User'].apply(lambda x1: cipher.encrypt(bytes(x1,'utf8')))
df_orig['Pwd'] = df_orig['Pwd'].apply(lambda x2: cipher.encrypt(bytes(x2,'utf8')))

So, once you captured the salt from that hidden file, the application is capturing that value over here. And, based on that both the field will be encrypted. But, note that cryptography package is required for this. And, you need to pass bytes value to work this thing. Hence, we’ve used bytes() function over here.

def decrypt_str()

cipher = Fernet(salt)

# Fetching the Encrypted csv file
df_orig = p.read_csv('Encrypt_Config.csv', index_col=False)

df_orig['User'] = df_orig['User'].apply(lambda x1: str(cipher.decrypt(bytes(x1[2:-1],'utf8'))).replace("b'","").replace("'",""))
df_orig['Pwd'] = df_orig['Pwd'].apply(lambda x2: str(cipher.decrypt(bytes(x2[2:-1],'utf8'))).replace("b'","").replace("'",""))

Again, in this step, our application is extracting the salt & then it retrieves the encrypted values of corresponding fields & applies the decryption logic on top of it. Note that, since we need to pass bytes value to get it to work. Hence, your output will be appended with (b’xxxxx’). To strip that, we’ve used the replace() functions. You can use regular expression using pattern matching as well.

7. callEnDec.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as normal verbose debug logging as well. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE           #####
#### Written On: 25-Jan-2019          #####
####                                  #####
#### Objective: Main calling function #####
###########################################

import clsEnDec as ed
import clsProcess as h
from clsParam import clsParam as cp
import time as t
import pandas as p

def main():
    print("")
    print("#" * 60)
    print("Calling (Encryption/Decryption) Package!!")
    print("#" * 60)
    print("")

    # Unhiding the file
    x = h.clsProcess()
    ret_val_unhide = x.undoit()

    if ret_val_unhide == 0:
        print("Successfully Unhide the file!")
    else:
        print("Unsuccessful to Unhide the file!")

    # To See the Unhide file
    t.sleep(10)

    print("*" * 60)
    print("Proceeding with Encryption...")
    print("*" * 60)

    # Getting Salt Value from the hidden files
    # by temporarily making it available
    FileName = cp.config['FILENAME']
    df = p.read_csv(FileName, index_col=False)
    salt = str(df.iloc[0]['Token_Salt'])
    print("-" * 60)
    print("Salt: ", salt)
    print("-" * 60)

    # Calling the Encryption Method
    x = ed.clsEnDec(salt)
    ret_val = x.encrypt_str()

    if ret_val == 0:
        print("Encryption Successful!")
    else:
        print("Encryption Failure!")

    print("")
    print("*" * 60)
    print("Checking Decryption Now...")
    print("*" * 60)

    # Calling the Decryption Method
    ret_val1 = x.decrypt_str()

    if ret_val1 == 0:
        print("Decryption Successful!")
    else:
        print("Decryption Failure!")

    # Hiding the salt file
    x = h.clsProcess()
    ret_val_hide = x.doit()

    if ret_val_hide == 0:
        print("Successfully Hide the file!")
    else:
        print("Unsuccessful to Hide the file!")

    print("*" * 60)
    print("Operation Done!")
    print("*" * 60)

if __name__ == '__main__':
    main()

And, here comes the final calling methods.

The key lines that we would like to discuss –

# Getting Salt Value from the hidden files
# by temporarily making it available
FileName = cp.config['FILENAME']
df = p.read_csv(FileName, index_col=False)
salt = str(df.iloc[0]['Token_Salt'])

As I’ve shown that, we have our hidden files that contain only 1 row & 1 column. To extract the specific value we’ve used iloc with the row number as 0 along with the column name, i.e. Token_Salt.

Now, let’s see how it runs –

Windows (64 bit):

Mac (32 bit):

So, from the screenshot, we can see our desired output & you can calculate the aggregated value based on our sample provided in the previous screenshot.

Let’s check the Encrypted & Decrypted values –

Encrypted Values (Encrypt_Config.csv):

Decrypted Values (Decrypt_Config.csv):

So, finally, we’ve achieved our target.

I hope this will give you some more idea about more insights into the Python verse. Let me know – how do you think about this post.

Till then – Happy Avenging!

Pandas, Numpy, JSON & SSL (Crossover of Space Stone & Reality Stone in Python Verse)

Posted on January 29, 2019September 23, 2019 by SatyakiDe in code, Data Science, operating system, Pandas, Python

In our last installment, we’ve shown pandas & numpy based on a specific situation. If that is our Space Stone installment of Python Verse, then this would be one approach of creating much interesting crossover of Space Stone & Reality Stone of Python verse. Yes. You are right. We’ll be discussing one requirement, where we need many of these in a single task.

Let’s dive into it!

Let’s assume that we have a source csv file which has the following data –

Now, the requirement is – we need to use one third party web service to send JSON payload preparing with this data & send them to the 3rd party API to get the City, State & based on that we need to find the total number of item sold against each State & City.

Let’s look into our third-party API site –

Please find the third-party API Link

As per the agreement with this website, any developer can test 10 calls per day free. After that, it will send your response with encrypted values, e.g. Classified. But, we don’t need more than 10 calls to test it.

Here, we’ll be dealing with the 4 python scripts. Among them, one scenario I’ve already described in my previous post. So, I’ll be just mentioning the file & post the script.

Please find the directory structure in both the OS –

1. clsLpy (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as usual verbose debug logging as well. Hence, the name comes into the picture.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 20-Jan-2019       ########
###########################################
import pandas as p
import os
import platform as pl
from clsParam import clsParam as cf

class clsL(object):
    def __init__(self):
        self.path = cf.config['PATH']

    def logr(self, Filename, Ind, df, subdir=None):
        try:
            x = p.DataFrame()
            x = df
            sd = subdir

            os_det = pl.system()

            if sd == None:
                if os_det == "Windows":
                    fullFileName = self.path + '\\' + Filename
                else:
                    fullFileName = self.path + '/' + Filename
            else:
                if os_det == "Windows":
                    fullFileName = self.path + '\\' + sd + "\\" + Filename
                else:
                    fullFileName = self.path + '/' + sd + "/" + Filename

            if Ind == 'Y':
                x.to_csv(fullFileName, index=False)

            return 0

        except Exception as e:
            y = str(e)
            print(y)
            return 3

2. clsParam.py (This script contains the parameter entries in the form of dictionary & later this can be used in all the relevant python scripts as configuration parameters.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 20-Jan-2019       ########
###########################################

import os

class clsParam(object):

    config = {
        'MAX_RETRY' : 5,
        'API_KEY' : 'HtWTVS86n8xoGXahyg1tPYH0HwngPqH2YFICzRCtlLbCtfNdya8L1UwRvH90PeMF',
        'PATH' : os.path.dirname(os.path.realpath(__file__)),
        'SUBDIR' : 'data'
    }

As you can see from this script that we’ve declared all the necessary parameters here as dictionary object & later we’ll be referring these parameters in the corresponding python scripts.

'API_KEY' : 'HtWTVS86n8xoGXahyg1tPYH0HwngPqH2YFICzRCtlLbCtfNdya8L1UwRvH90PeMF'

One crucial line, we’ll look into. API_KEY will be used while sending the JSON payload to the third-party web service. We’ll get this API_KEY from the highlighted (In Yellow) picture posted above.

3. clsWeb.py (This is the main script, which will first convert the pandas’ data frames into JSON & and send the API request as per the third party site. It will capture the response & convert that by normalizing the data & poured it back to the data frame for further process.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 20-Jan-2019       ########
###########################################

import json
import requests
import datetime
import time
import ssl
from urllib.request import urlopen
import pandas as p
import numpy as np
import os
import gc
from clsParam import clsParam as cp

class clsWeb(object):
    def __init__(self, payload, format, unit):
        self.payload = payload
        self.path = cp.config['PATH']
        # To disable logging info
        self.max_retries = cp.config['MAX_RETRY']
        self.api_key = cp.config['API_KEY']
        self.unit = unit
        self.format =format

    def get_response(self):
        # Assigning Logging Info
        max_retries = self.max_retries
        api_key = self.api_key
        unit = self.unit
        format = self.format
        df_conv = p.DataFrame()
        cnt = 0

        try:
            # Bypassing SSL Authentication
            try:
                _create_unverified_https_context = ssl._create_unverified_context
            except AttributeError:
                # Legacy python that doesn't verify HTTPS certificates by default
                pass
            else:
                # Handle target environment that doesn't support HTTPS verification
                ssl._create_default_https_context = _create_unverified_https_context

            # Capturing the payload
            data_df = self.payload
            temp_df = data_df[['zipcode']]

            list_of_rec = temp_df['zipcode'].values.tolist()

            print(list_of_rec)

            for i in list_of_rec:
                zip = i

                # Providing the url
                url_part = 'http://www.zipcodeapi.com/rest/'
                url = url_part + api_key + '/' + 'info.' + format + '/' + str(zip) + '/' + unit

                headers = {"Content-type": "application/json"}
                param = headers

                var1 = datetime.datetime.now().strftime("%H:%M:%S")
                print('Json Fetch Start Time:', var1)

                retries = 1
                success = False

                while not success:
                    # Getting response from web service
                    response = requests.get(url, params=param, verify=False)
                    # print("Complete Error:: ", str(response.status_code))
                    # print("Error First::", str(response.status_code)[:1])

                    if str(response.status_code)[:1] == '2':
                        # response = s.post(url, params=param, json=json_data, verify=False)
                        success=True
                    else:
                        wait = retries * 2
                        print("Retry fails! Waiting " + str(wait) + " seconds and retrying.")
                        time.sleep(wait)
                        retries += 1

                    # Checking Maximum Retries
                    if retries == max_retries:
                        success=True
                        raise ValueError

                # print(response.text)

                var2 = datetime.datetime.now().strftime("%H:%M:%S")
                print('Json Fetch End Time:', var2)

                print("-" * 90)

                # Capturing the response json from Web Service
                df_response_json = response.text
                string_to_json = json.loads(df_response_json)

                # Converting the response json to Dataframe
                # df_Int_Rec = p.read_json(string_to_json, orient='records')
                df_Int_Rec = p.io.json.json_normalize(string_to_json)
                df_Int_Rec.columns = df_Int_Rec.columns.map(lambda x: x.split(".")[-1])

                if cnt == 0:
                    df_conv = df_Int_Rec
                else:
                    d_frames = [df_conv, df_Int_Rec]
                    df_conv = p.concat(d_frames)

                cnt += 1

            # Deleting temporary dataframes & Releasing memories
            del [[df_Int_Rec]]
            gc.collect()

            # Resetting the Index Value
            df_conv.reset_index(drop=True, inplace=True)

            # Merging two data side ways maintaining the orders
            df_add = p.concat([data_df, df_conv], axis=1)

            del [[df_conv]]
            gc.collect()

            # Dropping unwanted column
            df_add.drop(['acceptable_city_names'], axis=1, inplace=True)

            return df_add

        except ValueError as v:
            print(response.text)
            x = str(v)
            print(x)

            # Return Empty Dataframe
            df = p.DataFrame()
            return df

        except Exception as e:
            print(response.text)
            x = str(e)
            print(x)

            # Return Empty Dataframe
            df = p.DataFrame()
            return df

Let’s look at the key lines to discuss –

def __init__(self, payload, format, unit):

    self.payload = payload
    self.path = cp.config['PATH']

    # To disable logging info
    self.max_retries = cp.config['MAX_RETRY']
    self.api_key = cp.config['API_KEY']
    self.unit = unit
    self.format = format

The first block will be instantiated as soon as you are invoking the class. Note that, we’ve used our parameter class python script here as cp & then we’re referring the corresponding elements as & when requires. Other parameters will be captured from the invoking script, which we’ll be discussed later in this post.

# Bypassing SSL Authentication
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

Sometimes, Your Firewall or Proxy might block your web service request due to a specific certificate error. This snippet will bypass that authentication. However, it is always advised to use proper SSL certification in the Production environment.

# Capturing the payload
data_df = self.payload
temp_df = data_df[['zipcode']]

list_of_rec = temp_df['zipcode'].values.tolist()

In this snippet, we’re capturing the zip code from our source data frame & converting them into a list & this would be our candidate to pass the data as part of our JSON payload.

for i in list_of_rec:
    zip = i

    # Providing the url
    url_part = 'http://www.zipcodeapi.com/rest/'
    url = url_part + api_key + '/' + 'info.' + format + '/' + str(zip) + '/' + unit

    headers = {"Content-type": "application/json"}
    param = headers

Once, we’ve extracted our zip codes, we’re passing it one-by-one & forming our JSON with header & data.

retries = 1
success = False

while not success:
    # Getting response from web service
    response = requests.get(url, params=param, verify=False)
    # print("Complete Error:: ", str(response.status_code))
    # print("Error First::", str(response.status_code)[:1])

    if str(response.status_code)[:1] == '2':
        # response = s.post(url, params=param, json=json_data, verify=False)
        success=True
    else:
        wait = retries * 2
        print("Retry fails! Waiting " + str(wait) + " seconds and retrying.")
        time.sleep(wait)
        retries += 1

    # Checking Maximum Retries
    if retries == max_retries:
        success=True
        raise ValueError

In this section, we’re posting our JSON application & waiting for the response from the third-party API. If we receive the success response (200), we will proceed with the next zip code. However, if we didn’t receive the success response, we’ll retry the post option again until or unless it reaches the maximum limits. In case, if the application still waiting for a valid answer even after the maximum limit, it will exit from the loop & raise an error to the main application.

# Capturing the response json from Web Service
df_response_json = response.text
string_to_json = json.loads(df_response_json)

# Converting the response json to Dataframe
# df_Int_Rec = p.read_json(string_to_json, orient='records')
df_Int_Rec = p.io.json.json_normalize(string_to_json)
df_Int_Rec.columns = df_Int_Rec.columns.map(lambda x: x.split(".")[-1])

This snippet will extract the desired response from the API & convert that back to the Pandas data frame. Last two lines, it is normalizing the data that it has received from the API for further process. This is critical steps as these steps will lead to extract City & State from our API response.

# Merging two data side ways maintaining the orders
df_add = p.concat([data_df, df_conv], axis=1)

Once, we’ll have structured data – we can merge it back to our source data frame for our next step.

4. callWebservice.py (This script will call the API script & also process the data to create an aggregate report for our task.)

#####################################################
### Objective: Purpose of this Library is to call ###
### the Web Service method to capture the city,   ###
### & the state as a json response & update them  ###
### in the dataframe & finally produce the summary###
### of Total Sales & Item Counts based on the City###
### & the State.                                  ###
###                                               ###
### Arguments are as follows:                     ###
### Mentioned the Exception Part. First time dry  ###
### run the program without providing any args.   ###
### It will show all the mandatory params.        ###
###                                               ###
#####################################################
#####################################################
#### Written By: SATYAKI DE                       ###
#### Written On: 20-Jan-2019                      ###
#####################################################

import clsWeb as cw
import sys
import pandas as p
import os
import platform as pl
import clsLog as log
import datetime
import numpy as np
from clsParam import clsParam as cp

# Disbling Warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

def main():
    print("Calling the custom Package..")

    try:
        if len(sys.argv) == 4:
            inputFile = str(sys.argv[1])
            format = str(sys.argv[2])
            unit = str(sys.argv[3])
        else:
            raise Exception

        # Checking whether the format contains
        # allowable choices or not
        if (format == 'JSON'):
            format = 'json'
        elif (format == 'CSV'):
            format = 'csv'
        elif (format == 'XML'):
            format = 'xml'
        else:
            raise Exception

        # Checking whether the format contains
        # allowable choices or not
        if (unit == 'DEGREE'):
            unit = 'degree'
        elif (unit == 'RADIANS'):
            unit = 'radians'
        else:
            raise Exception

        print("*" * 170)
        print("Reading from " + str(inputFile))
        print("*" * 170)

        # Initiating Logging Instances
        clog = log.clsLog()

        path = cp.config['PATH']
        subdir = cp.config['SUBDIR']

        os_det = pl.system()

        if os_det == "Windows":
            src_path = path + '\\' + 'data\\'
        else:
            src_path = path + '/' + 'data/'

        # Reading source data csv file
        df_Payload = p.read_csv(src_path+inputFile, index_col=False, skipinitialspace=True)

        x = cw.clsWeb(df_Payload, format, unit)
        retDf = x.get_response()

        # Total Number of rows fetched
        count_row = retDf.shape[0]

        if count_row == 0:
            print("Data Processing Issue!")
        else:
            print("Writing to file -> (" + str(inputFile) + "_modified.csv) Status: Success")

        FileName, FileExtn = inputFile.split(".")

        # Writing to the file
        clog.logr(FileName + '_modified.' + FileExtn, 'Y', retDf, subdir)
        print("*" * 170)

        # Performing group by operation to get the desired result
        # State & City-wise total Sales & Item Sales
        df_src = p.DataFrame()
        df_src = retDf[['city', 'state', 'total', 'item_count']]

        # Converting values to Integer
        df_src['city_1'] = retDf['city'].astype(str)
        df_src['state_1'] = retDf['state'].astype(str)
        df_src['total_1'] = retDf['total'].astype(int)
        df_src['item_count_1'] = retDf['item_count'].astype(int)

        # Dropping the old Dtype Columns
        df_src.drop(['city'], axis=1, inplace=True)
        df_src.drop(['state'], axis=1, inplace=True)
        df_src.drop(['total'], axis=1, inplace=True)
        df_src.drop(['item_count'], axis=1, inplace=True)

        # Renaming the new columns to as per Old Column Name
        df_src.rename(columns={'city_1': 'city'}, inplace=True)
        df_src.rename(columns={'state_1': 'state'}, inplace=True)
        df_src.rename(columns={'total_1': 'total'}, inplace=True)
        df_src.rename(columns={'item_count_1': 'item_count'}, inplace=True)

        # Performing Group By Operation
        grouped = df_src.groupby(['state', 'city'])
        res_1 = grouped.aggregate(np.sum)

        print("DF:")
        print(res_1)

        FileName1 = 'StateCityWiseReport'
        # Handling Multiple source files
        var = datetime.datetime.now().strftime(".%H.%M.%S")
        print('Target File Extension will contain the following:: ', var)

        # Writing to the file
        clog.logr(FileName1 + var + '.' + FileExtn, 'Y', df_src, subdir)

        print("*" * 170)
        print("Operation done for " + str(inputFile) + "!")
        print("*" * 170)
    except Exception as e:
        x = str(e)
        print(x)
        print("*" * 170)
        print('Current order would be - <' + str(sys.argv[0]) + '> <Csv File Name> <JSON/CSV/XML> <DEGREE/RADIANS>')
        print('Make sure last two params should be in CAPS only!')
        print("*" * 170)

if __name__ == "__main__":
    main()

Let’s look at some vital code snippet in this main script –

# Reading source data csv file
df_Payload = p.read_csv(src_path+inputFile, index_col=False, skipinitialspace=True)

x = cw.clsWeb(df_Payload, format, unit)
retDf = x.get_response()

In this snippet, we’re getting our data from our source csv & then calling our leading Web API service to get the State & City information.

# Converting values to Integer
df_src['city_1'] = retDf['city'].astype(str)
df_src['state_1'] = retDf['state'].astype(str)
df_src['total_1'] = retDf['total'].astype(int)
df_src['item_count_1'] = retDf['item_count'].astype(int)

Converting individual data type to appropriate data types. In Pandas, it is always advisable to change the data type of frames to avoid unforeseen scenarios.

# Dropping the old Dtype Columns
df_src.drop(['city'], axis=1, inplace=True)
df_src.drop(['state'], axis=1, inplace=True)
df_src.drop(['total'], axis=1, inplace=True)
df_src.drop(['item_count'], axis=1, inplace=True)

# Renaming the new columns to as per Old Column Name
df_src.rename(columns={'city_1': 'city'}, inplace=True)
df_src.rename(columns={'state_1': 'state'}, inplace=True)
df_src.rename(columns={'total_1': 'total'}, inplace=True)
df_src.rename(columns={'item_count_1': 'item_count'}, inplace=True)

Now, dropping the old columns & renaming the new columns to get the same column with correct data types. I personally like this way as it is an immaculate way to do this task. You can also debug it easily.

# Performing Group By Operation
grouped = df_src.groupby(['state', 'city'])
res_1 = grouped.aggregate(np.sum)

And, finally, using Pandas group-by method we’re aggregating the groups & then using numpy to generate the same against each group.

Please check the first consolidated output –

From this screenshot, you can see how we have the desired intermediate data of City & State to proceed for the next level.

Let’s see how it runs –

Windows (64 bit):

Mac (32 bit):

So, from the screenshot, we can see our desired output & you can calculate the aggregated value based on our sample provided in the previous screenshot.

Let’s check how the data directory looks like after run –

Windows:

MAC:

So, finally, we’ve achieved our target.

I hope this will give you some more idea about more insights into the Python verse. Let me know – how do you think about this post.

Till then – Happy Avenging!

Pandas & Numpy (Space Stone of Programming World)

Posted on January 15, 2019September 20, 2019 by SatyakiDe in Data Science, data warehouse, function, Pandas, Python, read

Today, we’ll demonstrate the different application of Pandas. In this case, we’ll be exploring the possibilities of reading large CSV files & splitting it sets of smaller more manageable csv to read.

And, after creating it, another process will merge them together. This is especially very useful when you need transformation on a large volume of data without going for any kind of memory error. And, moreover, the developer has more control over failed cases & can resume the load without restarting it from the beginning of the files.

In this case, I’ll be using one more custom methods to create the csv file instead of directly using the to_csv method of pandas.

But, before that let’s prepare the virtual environment & proceed from there –

Windows 10 (64 bit):

Commands:

python -m venv –copies .env

.env\Scripts\activate.bat

Screenshot:

Mac OS (64 bit):

Commands:

python -m venv env

source env/bin/activate

Screenshot:

So, both the Windows & Mac version is 3.7 & we’re going to explore our task in the given section.

After creating this virtual environment, you need to install only pandas package for this task as shown below for both the Windows or Mac OS –

Windows:

Mac:

Rests are the packages comes as default with the Python 3.7.

Please find the GUI screenshots from WinSCP software comparing both the directory structures (Mac & Windows) as given below –

From the above screenshot, you can see that our directory structure are not exactly identical before the blog directory. However, our program will take care of this difference.

Let’s check the scripts one-by-one,

1. clsL.py (This script will create the split csv files or final merge file after the corresponding process. However, this can be used as normal verbose debug logging as well. Hence, the name comes into the picture.)

#############################################
#### Written By: Satyaki De              ####
#############################################
import pandas as p
import os
import platform as pl

class clsL(object):
    def __init__(self):
        self.path = os.path.dirname(os.path.realpath(__file__))

    def logr(self, Filename, Ind, df, subdir=None):
        try:
            x = p.DataFrame()
            x = df

            sd = subdir
            os_det = pl.system()

            if os_det == "Windows":
                if sd == None:
                    fullFileName = self.path + "\\" + Filename
                else:
                    fullFileName = self.path + "\\" + sd + "\\" + Filename
            else:
                if sd == None:
                    fullFileName = self.path + "/" + Filename
                else:
                    fullFileName = self.path + "/" + sd + "/" + Filename


            if Ind == 'Y':
                x.to_csv(fullFileName, index=False)

            return 0

        except Exception as e:
            y = str(e)
            print(y)
            return 3

From the above script, you can see that based on the Indicator, whose value can be either ‘Y’ or ‘N’. It will generate the csv file from the pandas data frame using to_csv method available in pandas.

Key snippet to notice –

self.path = os.path.dirname(os.path.realpath(__file__))

Here, the class is creating an instance & during that time it is initializing the value of the current path from where the application is triggering.

x = p.DataFrame()
x = df

The first line, declaring a pandas data frame variable. The second line assigns the value from the supplied method to that variable.

os_det = pl.system()

This will identify the operating system on which your application is running. Based on that, your path will be dynamically configured & passed. Hence, your application will be ready to handle multiple operating systems since beginning.

x.to_csv(fullFileName, index=False)

Finally, to_csv will generate the final csv file based on the supplied Indicator value. Also, notice that we’ve added one more parameter (index=False). By default, pandas create one extra column known as an index & maintain it’s operation based on that.

index_val

As you can see that the first column is not coming from our source files. Rather, it is generated by the pandas package in python. Hence, we don’t want to capture that in our final file by mentioning (index=False) options.

2. clsSplitFl.py (This script will create the split csv files. This will bring chunk by chunk data into your memory & process the large files.)

#############################################
#### Written By: Satyaki De              ####
#############################################
import os
import pandas as p
import clsLog as log
import gc
import csv

class clsSplitFl(object):
    def __init__(self, srcFileName, path, subdir):
        self.srcFileName = srcFileName
        self.path = path
        self.subdir = subdir

        # Maximum Number of rows in CSV
        # in order to avoid Memory Error
        self.max_num_rows = 30000
        self.networked_directory = 'src_file'
        self.Ind = 'Y'

    def split_files(self):
        try:
            src_dir = self.path
            subdir = self.subdir
            networked_directory = self.networked_directory

            # Initiate Logging Instances
            clog = log.clsLog()

            # Setting up values
            srcFileName = self.srcFileName

            First_part, Last_part = str(srcFileName).split(".")

            num_rows = self.max_num_rows
            dest_path = self.path
            remote_src_path = src_dir + networked_directory
            Ind = self.Ind
            interval = num_rows

            # Changing work directory location to source file
            # directory at remote server
            os.chdir(remote_src_path)

            src_fil_itr_no = 1

            # Split logic here
            for df2 in p.read_csv(srcFileName, index_col=False, error_bad_lines=False, chunksize=interval):
                # Changing the target directory path
                os.chdir(dest_path)

                # Calling custom file generation method
                # to generate splitted files
                clog.logr(str(src_fil_itr_no) + '__' + First_part + '_' + '_splitted_.' + Last_part, Ind, df2, subdir)

                del [[df2]]
                gc.collect()

                src_fil_itr_no += 1

            return 0
        except Exception as e:
            x = str(e)
            print(x)

            return 1

In this script, we’re splitting the file if that file has more than 30,000 records. And, based on that it will split a number of equal or fewer volume files.

Important lines to be noticed –

self.max_num_rows = 30000

As already explained, based on this the split files contain the maximum number of rows in each file.

First_part, Last_part = str(srcFileName).split(“.”)

This will split the source file name into the first part & second part i.e. one part contains only the file name & the other part contains only the extension dynamically.

for df2 in p.read_csv(srcFileName, index_col=False, error_bad_lines=False, chunksize=interval):

As you can see, the chunk-by-chunk (mentioned as chunksize=interval) application will read lines from the large source csv. And, if it has any bad rows in the source files – it will skip them due to the following condition -> (error_bad_lines=False).

clog.logr(str(src_fil_itr_no) + ‘__’ + First_part + ‘_’ + ‘_splitted_.’ + Last_part, Ind, df2, subdir)

Dynamically generating split files in the specific subdirectory along with the modified name. So, these files won’t get overwritten – if you rerun it. Remember that the src_fil_itr_no will play an important role while merging them back to one as this is a number representing the current file’s split number.

del [[df2]]
gc.collect()

Once, you process that part – delete the data frame & deallocate the memory. So, that you won’t encounter any memory error or a similar issue.

And, the split file will look like this –

3. clsMergeFl.py (This script will add together all the split csv files into one big csv file. This will bring chunk by chunk data into your memory & generates the large file.)

#############################################
#### Written By: Satyaki De              ####
#############################################
import os
import platform as pl
import pandas as p
import gc
import clsLog as log
import re

class clsMergeFl(object):

    def __init__(self, srcFilename):
        self.srcFilename = srcFilename
        self.subdir = 'finished'
        self.Ind = 'Y'

    def merge_file(self):
        try:
            # Initiating Logging Instances
            clog = log.clsLog()
            df_W = p.DataFrame()
            df_M = p.DataFrame()
            f = {}

            subdir = self.subdir
            srcFilename = self.srcFilename
            Ind = self.Ind
            cnt = 0

            os_det = pl.system()

            if os_det == "Windows":
                proc_dir = "\\temp\\"
                gen_dir = "\\process\\"
            else:
                proc_dir = "/temp/"
                gen_dir = "/process/"

            # Current Directory where application presents
            path = os.path.dirname(os.path.realpath(__file__)) + proc_dir

            print("Path: ", path)
            print("Source File Initial Name: ", srcFilename)

            for fname in os.listdir(path):
                if fname.__contains__(srcFilename) and fname.endswith('_splitted_.csv'):
                    key = int(re.split('__', str(fname))[0])
                    f[key] = str(fname)

            for k in sorted(f):
                print(k)
                print(f[k])
                print("-"*30)

                df_W = p.read_csv(path+f[k], index_col=False)

                if cnt == 0:
                    df_M = df_W
                else:
                    d_frames = [df_M, df_W]
                    df_M = p.concat(d_frames)

                cnt += 1

                print("-"*30)
                print("Total Records in this Iteration: ", df_M.shape[0])

            FtgtFileName = fname.replace('_splitted_', '')
            first, FinalFileName = re.split("__", FtgtFileName)

            clog.logr(FinalFileName, Ind, df_M, gen_dir)

            del [[df_W], [df_M]]
            gc.collect()

            return 0
        except Exception as e:
            x = str(e)
            print(x)

            return 1

In this script, we’re merging smaller files into a large file. Following are the key snippet that we’ll explore –

for fname in os.listdir(path):
    if fname.__contains__(srcFilename) and fname.endswith('_splitted_.csv'):
        key = int(re.split('__', str(fname))[0])
        f[key] = str(fname)

In this section, the application will check if in that specified path we’ve files whose extension ends with “_splitted_.csv” & their first name starts with the file name initial i.e. if you have a source file named – acct_addr_20180112.csv, then it will check the first name should start with the -> “acct_addr” & last part should contain “_splitted_.csv”. If it is available, then it will start the merge process by considering one by one file & merging them using pandas data frame (marked in purple color) as shown below –

for k in sorted(f):
    print(k)
    print(f[k])
    print("-"*30)

    df_W = p.read_csv(f[k], index_col=False)

    if cnt == 0:
        df_M = df_W
    else:
        d_frames = [df_M, df_W]
        df_M = p.concat(d_frames)

    cnt += 1

Note that, here f is a dictionary that contains filename in key, value pair. The first part of the split file contains the number. That way, it would be easier for the merge to club them back to one large file without thinking of orders.

Here, also notice the special function concat provided by the pandas. In this step, applications are merging two data frames.

Finally, the main python script, from where we’ll call it –

4. callSplitMergeFl.py

#############################################
#### Written By: Satyaki De              ####
#############################################
import clsSplitFl as t
import clsMergeFl as cm
import re
import platform as pl
import os

def main():
    print("Calling the custom Package for large file splitting..")
    os_det = pl.system()

    print("Running on :", os_det)

    ###############################################################
    ###### User Input based on Windows OS                  ########
    ###############################################################

    srcF = str(input("Please enter the file name with extension:"))
    base_name = re.sub(r'[0-9]','', srcF)
    srcFileInit = base_name[:-5]

    if os_det == "Windows":
        subdir = "\\temp\\"
        path = os.path.dirname(os.path.realpath(__file__)) + "\\"
    else:
        subdir = "/temp/"
        path = os.path.dirname(os.path.realpath(__file__)) + '/'

    ###############################################################
    ###### End Of User Input                                 ######
    ###############################################################

    x = t.clsSplitFl(srcF, path, subdir)

    ret_val = x.split_files()

    if ret_val == 0:
        print("Splitting Successful!")
    else:
        print("Splitting Failure!")

    print("-"*30)

    print("Finally, Merging small splitted files to make the same big file!")

    y = cm.clsMergeFl(srcFileInit)

    ret_val1 = y.merge_file()

    if ret_val1 == 0:
        print("Merge Successful!")
    else:
        print("Merge Failure!")

    print("-"*30)



if __name__ == "__main__":
    main()

Following are the key section that we can check –

import clsSplitFl as t
import clsMergeFl as cm

Like any other standard python package, we’re importing our own class into our main callable script.

x = t.clsSplitFl(srcF, path, subdir)
ret_val = x.split_files()

Or,

y = cm.clsMergeFl(srcFileInit)
ret_val1 = y.merge_file()

In this section, we’ve instantiated the class & then we’re calling its function. And, based on the return value – we’re printing the status of our application last run.

The final run of this application looks like ->

Windows:

Mac:

And, the final file should look like this –

Windows:

MAC:

Left-hand side representing windows final processed/output file, whereas right-hand side representing MAC final processed/output file.

Hope, this will give you some idea about how we can use pandas in various cases apart from conventional data computing.

In this post, I skipped the exception part intentionally. I’ll post one bonus post once my series complete.

Let me know, what do you think.

Till then, Happy Avenging!

Satyaki De

Python Verse – Universe of Avengers in Computer Language World!

Posted on January 12, 2019March 25, 2019 by SatyakiDe in Data Science, function, numpy, objects, operating system, Pandas, pattern matching, Python, snippet, String Manipulation, Technology

The last couple of years, I’ve been working on various technologies. And, one of the interesting languages that I came across is Python. It is extremely flexible for developers to learn & rapidly develop with very few lines of code compared to the other languages. There are major versions of python that I worked with. Among them, python 2.7 & current python 3.7.1 are very popular to developers & my personal favorite.

There are many useful packages that are available to reduce the burden of the developers. Among them, packages like “pandas”, “numpy”, “json”, “AES”, “threading” etc. are extremely useful & one can do lot’s of work with it.

I personally prefer Ubuntu or Mac version of python. However, I’ve worked on Windows version as well or developed python based framework & application, which works in all the major operating systems. If you take care few things from the beginning, then you don’t have to make much more changes of your python application in order to work in all the major operating systems. 🙂

To me, Python Universe is nothing shorter than Marvel’s Universe of Avengers. In order to beat Supreme Villain Thanos (That Challenging & Complex Product with extremely tight timeline), you got to have 6 infinity stones to defeat him.

Space Stone ( Pandas & Numpy )
Reality Stone ( Json, SSL & Encryption/Decryption )
Power Stone ( Multi-Threading/Multi-Processing )
Mind Stone ( OS, Database, Directories & Files )
Soul Stone ( Logging & Exception )
Time Stone ( Cloud Interaction & Framework )

I’ll release a series of python based post in coming days, which might be useful for many peers or information seeker. Hopefully, this installment is a beginning & please follow my post. I hope, very soon you will get many such useful posts.

You get the latest version of Python from the official site given below –

Python Link (3.7.1)

Make sure you must install pip package along with python. I’m not going in details of how one should install python in either of Windows/Mac or Linux.

Just showing you how to install individual python packages.

Windows:

pip install pandas

Linux/Mac:

sudo python3.7 -m pip install pandas

From the second example, you can see that you can install packages to specific python version in case if you have multiple versions of python.

Note that: There might be slight variation based on different versions of Linux. Make sure you are using the correct syntax as per your flavor.

You can get plenty of good sites, where the detailed step-by-step process shared for each operating system.

Till then – Happy Avenging!

	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on Agentic AI in the Enterprise:…
	Real-time video summ… on Real-time video summary assist…

Tag: Pandas

Improvement of Pandas data processing performance using Multi-threading with the Queue (Another crossover of Space Stone, Reality Stone & Power Stone)

Windows (16 GB – Core 2) Vs Mac (10 GB – Core 2):

Windows (16 GB – Core 2):

Mac (10 GB – Core 2):

Find the complete directory from both the machine.

Windows (16 GB – Core 2):

Mac (10 GB – Core 2):

Like this:

Pandas with Encryption/Decryption along with the JSON – (Client API Access) along with Data Queue (A crossover between Space stone, Reality Stone & Power Stone)

Windows:

Mac:

Win:

MAC:

Like this:

Pandas, Numpy, Encryption/Decryption, Hidden Files In Python (Crossover between Space Stone, Reality Stone & Mind Stone of Python-Verse)

Like this:

Pandas, Numpy, JSON & SSL (Crossover of Space Stone & Reality Stone in Python Verse)

Like this:

Pandas & Numpy (Space Stone of Programming World)

Commands:

Screenshot:

Commands:

Screenshot:

Windows:

Mac:

Windows:

Mac:

Windows:

MAC:

Like this:

srcEmail.json

srcTwitter.json

srcHR.json

Python Scripts:

Share this:

Like this:

Windows (16 GB – Core 2) Vs Mac (10 GB – Core 2):

Windows (16 GB – Core 2):

Mac (10 GB – Core 2):

Find the complete directory from both the machine.

Windows (16 GB – Core 2):

Mac (10 GB – Core 2):

Share this:

Like this:

Windows:

Mac:

Win:

MAC:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Commands:

Screenshot:

Commands:

Screenshot:

Windows:

Mac:

Windows:

Mac:

Windows:

MAC:

Share this:

Like this:

Share this:

Like this: