machine-learning Archives

Live visual reading using Convolutional Neural Network (CNN) through Python-based machine-learning application.

Posted on January 19, 2022 by SatyakiDe in api, cloud, code, combining, Computer-Vision, computing, Crossplatform, Data Science, exposure, features, function, gui, integration, IoT, json, Keras, machine-learning, matplotlib, member function, Model, numpy, objects, Open-CV, Pandas, Pickle, Python, Scikit-Learn, snippet, Technology, Tensorflow, video

This week we’re planning to touch on one of the exciting posts of visually reading characters from WebCAM & predict the letters using CNN methods. Before we dig deep, why don’t we see the demo run first?

Isn’t it fascinating? As we can see, the computer can record events and read like humans. And, thanks to the brilliant packages available in Python, which can help us predict the correct letter out of an Image.

What do we need to test it out?

Preferably an external WebCAM.
A moderate or good Laptop to test out this.
Python
And a few other packages that we’ll mention next block.

What Python packages do we need?

Some of the critical packages that we must need to test out this application are –

cmake==3.22.1
dlib==19.19.0
face-recognition==1.3.0
face-recognition-models==0.3.0
imutils==0.5.3
jsonschema==4.4.0
keras==2.7.0
Keras-Preprocessing==1.1.2
matplotlib==3.5.1
matplotlib-inline==0.1.3
oauthlib==3.1.1
opencv-contrib-python==4.1.2.30
opencv-contrib-python-headless==4.4.0.46
opencv-python==4.5.5.62
opencv-python-headless==4.5.5.62
pickleshare==0.7.5
Pillow==9.0.0
python-dateutil==2.8.2
requests==2.27.1
requests-oauthlib==1.3.0
scikit-image==0.19.1
scikit-learn==1.0.2
tensorboard==2.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.7.0
tensorflow-estimator==2.7.0
tensorflow-io-gcs-filesystem==0.23.1
tqdm==4.62.3

What is CNN?

In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural networks most commonly applied to analyze visual imagery.

We can understand from the above picture that a CNN generally takes an image as input. The neural network analyzes each pixel separately. The weights and biases of the model are then tweaked to detect the desired letters (In our use case) from the image. Like other algorithms, the data also has to pass through pre-processing stage. However, a CNN needs relatively less pre-processing than most other Deep Learning algorithms.

If you want to know more about this, there is an excellent article on CNN with some on-point animations explaining this concept. Please read it here.

Where do we get the data sets for our testing?

For testing, we are fortunate enough to have Kaggle with us. We have received a wide variety of sample data, which you can get from here.

Our use-case:

From the above diagram, one can see that the python application will consume a live video feed of any random letters (both printed & handwritten) & predict the character as part of the machine learning model that we trained.

Code:

clsConfig.py (Configuration file for the entire application.)

	################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 15-May-2020 ####
	#### Modified On: 28-Dec-2021 ####
	#### ####
	#### Objective: This script is a config ####
	#### file, contains all the keys for ####
	#### Machine-Learning & streaming dashboard.####
	#### ####
	################################################

	import os
	import platform as pl

	class clsConfig(object):
	Curr_Path = os.path.dirname(os.path.realpath(__file__))

	os_det = pl.system()
	if os_det == "Windows":
	sep = '\\'
	else:
	sep = '/'

	conf = {
	'APP_ID': 1,
	'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
	'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
	'LOG_PATH': Curr_Path + sep + 'log' + sep,
	'REPORT_PATH': Curr_Path + sep + 'report',
	'FILE_NAME': Curr_Path + sep + 'Data' + sep + 'A_Z_Handwritten_Data.csv',
	'SRC_PATH': Curr_Path + sep + 'data' + sep,
	'APP_DESC_1': 'Old Video Enhancement!',
	'DEBUG_IND': 'N',
	'INIT_PATH': Curr_Path,
	'SUBDIR': 'data',
	'SEP': sep,
	'testRatio':0.2,
	'valRatio':0.2,
	'epochsVal':8,
	'activationType':'relu',
	'activationType2':'softmax',
	'numOfClasses':26,
	'kernelSize':(3, 3),
	'poolSize':(2, 2),
	'filterVal1':32,
	'filterVal2':64,
	'filterVal3':128,
	'stridesVal':2,
	'monitorVal':'val_loss',
	'paddingVal1':'same',
	'paddingVal2':'valid',
	'reshapeVal':28,
	'reshapeVal1':(28,28),
	'patienceVal1':1,
	'patienceVal2':2,
	'sleepTime':3,
	'sleepTime1':6,
	'factorVal':0.2,
	'learningRateVal':0.001,
	'minDeltaVal':0,
	'minLrVal':0.0001,
	'verboseFlag':0,
	'modeInd':'auto',
	'shuffleVal':100,
	'DenkseVal1':26,
	'DenkseVal2':64,
	'DenkseVal3':128,
	'predParam':9,
	'word_dict':{0:'A',1:'B',2:'C',3:'D',4:'E',5:'F',6:'G',7:'H',8:'I',9:'J',10:'K',11:'L',12:'M',13:'N',14:'O',15:'P',16:'Q',17:'R',18:'S',19:'T',20:'U',21:'V',22:'W',23:'X', 24:'Y',25:'Z'},
	'width':640,
	'height':480,
	'imgSize': (32,32),
	'threshold': 0.45,
	'imgDimension': (400, 440),
	'imgSmallDim': (7, 7),
	'imgMidDim': (28, 28),
	'reshapeParam1':1,
	'reshapeParam2':28,
	'colorFeed':(0,0,130),
	'colorPredict':(0,25,255)
	}

view raw

clsConfig.py

hosted with ❤ by GitHub

Important parameters that we need to follow from the above snippets are –

'testRatio':0.2,
'valRatio':0.2,
'epochsVal':8,
'activationType':'relu',
'activationType2':'softmax',
'numOfClasses':26,
'kernelSize':(3, 3),
'poolSize':(2, 2),
'word_dict':{0:'A',1:'B',2:'C',3:'D',4:'E',5:'F',6:'G',7:'H',8:'I',9:'J',10:'K',11:'L',12:'M',13:'N',14:'O',15:'P',16:'Q',17:'R',18:'S',19:'T',20:'U',21:'V',22:'W',23:'X', 24:'Y',25:'Z'},

Since we have 26 letters, we have classified it as 26 in the numOfClasses.

Since we are talking about characters, we had to come up with a process of identifying each character as numbers & then processing our entire logic. Hence, the above parameter named word_dict captured all the characters in a python dictionary & stored them. Moreover, the application translates the final number output to more appropriate characters as the prediction.

2. clsAlphabetReading.py (Main training class to teach the model to predict alphabets from visual reader.)

	###############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 17-Jan-2022 ####
	#### Modified On 17-Jan-2022 ####
	#### ####
	#### Objective: This python script will ####
	#### teach & perfect the model to read ####
	#### visual alphabets using Convolutional ####
	#### Neural Network (CNN). ####
	###############################################

	from keras.datasets import mnist
	import matplotlib.pyplot as plt
	import cv2
	import numpy as np
	from keras.models import Sequential
	from keras.layers import Dense, Flatten, Conv2D, MaxPool2D, Dropout
	from tensorflow.keras.optimizers import SGD, Adam
	from keras.callbacks import ReduceLROnPlateau, EarlyStopping
	from keras.utils.np_utils import to_categorical
	import pandas as p
	import numpy as np
	from sklearn.model_selection import train_test_split
	from keras.utils import np_utils
	import matplotlib.pyplot as plt
	from tqdm import tqdm_notebook
	from sklearn.utils import shuffle

	import pickle

	import os
	import platform as pl

	from clsConfig import clsConfig as cf

	class clsAlphabetReading:
	def __init__(self):
	self.sep = str(cf.conf['SEP'])
	self.Curr_Path = str(cf.conf['INIT_PATH'])
	self.fileName = str(cf.conf['FILE_NAME'])
	self.testRatio = float(cf.conf['testRatio'])
	self.valRatio = float(cf.conf['valRatio'])
	self.epochsVal = int(cf.conf['epochsVal'])
	self.activationType = str(cf.conf['activationType'])
	self.activationType2 = str(cf.conf['activationType2'])
	self.numOfClasses = int(cf.conf['numOfClasses'])
	self.kernelSize = cf.conf['kernelSize']
	self.poolSize = cf.conf['poolSize']
	self.filterVal1 = int(cf.conf['filterVal1'])
	self.filterVal2 = int(cf.conf['filterVal2'])
	self.filterVal3 = int(cf.conf['filterVal3'])
	self.stridesVal = int(cf.conf['stridesVal'])
	self.monitorVal = str(cf.conf['monitorVal'])
	self.paddingVal1 = str(cf.conf['paddingVal1'])
	self.paddingVal2 = str(cf.conf['paddingVal2'])
	self.reshapeVal = int(cf.conf['reshapeVal'])
	self.reshapeVal1 = cf.conf['reshapeVal1']
	self.patienceVal1 = int(cf.conf['patienceVal1'])
	self.patienceVal2 = int(cf.conf['patienceVal2'])
	self.sleepTime = int(cf.conf['sleepTime'])
	self.sleepTime1 = int(cf.conf['sleepTime1'])
	self.factorVal = float(cf.conf['factorVal'])
	self.learningRateVal = float(cf.conf['learningRateVal'])
	self.minDeltaVal = int(cf.conf['minDeltaVal'])
	self.minLrVal = float(cf.conf['minLrVal'])
	self.verboseFlag = int(cf.conf['verboseFlag'])
	self.modeInd = str(cf.conf['modeInd'])
	self.shuffleVal = int(cf.conf['shuffleVal'])
	self.DenkseVal1 = int(cf.conf['DenkseVal1'])
	self.DenkseVal2 = int(cf.conf['DenkseVal2'])
	self.DenkseVal3 = int(cf.conf['DenkseVal3'])
	self.predParam = int(cf.conf['predParam'])
	self.word_dict = cf.conf['word_dict']

	def applyCNN(self, X_Train, Y_Train_Catg, X_Validation, Y_Validation_Catg):
	try:
	testRatio = self.testRatio
	epochsVal = self.epochsVal
	activationType = self.activationType
	activationType2 = self.activationType2
	numOfClasses = self.numOfClasses
	kernelSize = self.kernelSize
	poolSize = self.poolSize
	filterVal1 = self.filterVal1
	filterVal2 = self.filterVal2
	filterVal3 = self.filterVal3
	stridesVal = self.stridesVal
	monitorVal = self.monitorVal
	paddingVal1 = self.paddingVal1
	paddingVal2 = self.paddingVal2
	reshapeVal = self.reshapeVal
	patienceVal1 = self.patienceVal1
	patienceVal2 = self.patienceVal2
	sleepTime = self.sleepTime
	sleepTime1 = self.sleepTime1
	factorVal = self.factorVal
	learningRateVal = self.learningRateVal
	minDeltaVal = self.minDeltaVal
	minLrVal = self.minLrVal
	verboseFlag = self.verboseFlag
	modeInd = self.modeInd
	shuffleVal = self.shuffleVal
	DenkseVal1 = self.DenkseVal1
	DenkseVal2 = self.DenkseVal2
	DenkseVal3 = self.DenkseVal3

	model = Sequential()

	model.add(Conv2D(filters=filterVal1, kernel_size=kernelSize, activation=activationType, input_shape=(28,28,1)))
	model.add(MaxPool2D(pool_size=poolSize, strides=stridesVal))

	model.add(Conv2D(filters=filterVal2, kernel_size=kernelSize, activation=activationType, padding = paddingVal1))
	model.add(MaxPool2D(pool_size=poolSize, strides=stridesVal))

	model.add(Conv2D(filters=filterVal3, kernel_size=kernelSize, activation=activationType, padding = paddingVal2))
	model.add(MaxPool2D(pool_size=poolSize, strides=stridesVal))

	model.add(Flatten())

	model.add(Dense(DenkseVal2,activation = activationType))
	model.add(Dense(DenkseVal3,activation = activationType))

	model.add(Dense(DenkseVal1,activation = activationType2))

	model.compile(optimizer = Adam(learning_rate=learningRateVal), loss='categorical_crossentropy', metrics=['accuracy'])
	reduce_lr = ReduceLROnPlateau(monitor=monitorVal, factor=factorVal, patience=patienceVal1, min_lr=minLrVal)
	early_stop = EarlyStopping(monitor=monitorVal, min_delta=minDeltaVal, patience=patienceVal2, verbose=verboseFlag, mode=modeInd)


	fittedModel = model.fit(X_Train, Y_Train_Catg, epochs=epochsVal, callbacks=[reduce_lr, early_stop], validation_data = (X_Validation,Y_Validation_Catg))

	return (model, fittedModel)

	except Exception as e:
	x = str(e)
	model = Sequential()
	print('Error: ', x)

	return (model, model)

	def trainModel(self, debugInd, var):
	try:
	sep = self.sep
	Curr_Path = self.Curr_Path
	fileName = self.fileName
	epochsVal = self.epochsVal
	valRatio = self.valRatio
	predParam = self.predParam
	testRatio = self.testRatio
	reshapeVal = self.reshapeVal
	numOfClasses = self.numOfClasses
	sleepTime = self.sleepTime
	sleepTime1 = self.sleepTime1
	shuffleVal = self.shuffleVal
	reshapeVal1 = self.reshapeVal1

	# Dictionary for getting characters from index values
	word_dict = self.word_dict

	print('File Name: ', str(fileName))

	# Read the data
	df_HW_Alphabet = p.read_csv(fileName).astype('float32')

	# Sample Data
	print('Sample Data: ')
	print(df_HW_Alphabet.head())

	# Split data the (x – Our data) & (y – the prdict label)
	x = df_HW_Alphabet.drop('0',axis = 1)
	y = df_HW_Alphabet['0']


	# Reshaping the data in csv file to display as an image
	X_Train, X_Test, Y_Train, Y_Test = train_test_split(x, y, test_size = testRatio)
	X_Train, X_Validation, Y_Train, Y_Validation = train_test_split(X_Train, Y_Train, test_size = valRatio)

	X_Train = np.reshape(X_Train.values, (X_Train.shape[0], reshapeVal, reshapeVal))
	X_Test = np.reshape(X_Test.values, (X_Test.shape[0], reshapeVal, reshapeVal))
	X_Validation = np.reshape(X_Validation.values, (X_Validation.shape[0], reshapeVal, reshapeVal))


	print("Train Data Shape: ", X_Train.shape)
	print("Test Data Shape: ", X_Test.shape)
	print("Validation Data shape: ", X_Validation.shape)

	# Plotting the number of alphabets in the dataset
	Y_Train_Num = np.int0(y)
	count = np.zeros(numOfClasses, dtype='int')
	for i in Y_Train_Num:
	count[i] +=1

	alphabets = []
	for i in word_dict.values():
	alphabets.append(i)

	fig, ax = plt.subplots(1,1, figsize=(7,7))
	ax.barh(alphabets, count)

	plt.xlabel("Number of elements ")
	plt.ylabel("Alphabets")
	plt.grid()
	plt.show(block=False)
	plt.pause(sleepTime)
	plt.close()

	# Shuffling the data
	shuff = shuffle(X_Train[:shuffleVal])

	# Model reshaping the training & test dataset
	X_Train = X_Train.reshape(X_Train.shape[0],X_Train.shape[1],X_Train.shape[2],1)
	print("Shape of Train Data: ", X_Train.shape)

	X_Test = X_Test.reshape(X_Test.shape[0], X_Test.shape[1], X_Test.shape[2],1)
	print("Shape of Test Data: ", X_Test.shape)

	X_Validation = X_Validation.reshape(X_Validation.shape[0], X_Validation.shape[1], X_Validation.shape[2],1)
	print("Shape of Validation data: ", X_Validation.shape)

	# Converting the labels to categorical values
	Y_Train_Catg = to_categorical(Y_Train, num_classes = numOfClasses, dtype='int')
	print("Shape of Train Labels: ", Y_Train_Catg.shape)

	Y_Test_Catg = to_categorical(Y_Test, num_classes = numOfClasses, dtype='int')
	print("Shape of Test Labels: ", Y_Test_Catg.shape)

	Y_Validation_Catg = to_categorical(Y_Validation, num_classes = numOfClasses, dtype='int')
	print("Shape of validation labels: ", Y_Validation_Catg.shape)

	model, history = self.applyCNN(X_Train, Y_Train_Catg, X_Validation, Y_Validation_Catg)

	print('Model Summary: ')
	print(model.summary())

	# Displaying the accuracies & losses for train & validation set
	print("Validation Accuracy :", history.history['val_accuracy'])
	print("Training Accuracy :", history.history['accuracy'])
	print("Validation Loss :", history.history['val_loss'])
	print("Training Loss :", history.history['loss'])

	# Displaying the Loss Graph
	plt.figure(1)
	plt.plot(history.history['loss'])
	plt.plot(history.history['val_loss'])
	plt.legend(['training','validation'])
	plt.title('Loss')
	plt.xlabel('epoch')
	plt.show(block=False)
	plt.pause(sleepTime1)
	plt.close()

	# Dsiplaying the Accuracy Graph
	plt.figure(2)
	plt.plot(history.history['accuracy'])
	plt.plot(history.history['val_accuracy'])
	plt.legend(['training','validation'])
	plt.title('Accuracy')
	plt.xlabel('epoch')
	plt.show(block=False)
	plt.pause(sleepTime1)
	plt.close()

	# Making the model to predict
	pred = model.predict(X_Test[:predParam])

	print('Test Details::')
	print('X_Test: ', X_Test.shape)
	print('Y_Test_Catg: ', Y_Test_Catg.shape)

	try:
	score = model.evaluate(X_Test, Y_Test_Catg, verbose=0)
	print('Test Score = ', score[0])
	print('Test Accuracy = ', score[1])
	except Exception as e:
	x = str(e)
	print('Error: ', x)

	# Displaying some of the test images & their predicted labels
	fig, ax = plt.subplots(3,3, figsize=(8,9))
	axes = ax.flatten()

	for i in range(9):
	axes[i].imshow(np.reshape(X_Test[i], reshapeVal1), cmap="Greys")
	pred = word_dict[np.argmax(Y_Test_Catg[i])]
	print('Prediction: ', pred)
	axes[i].set_title("Test Prediction: " + pred)
	axes[i].grid()
	plt.show(block=False)
	plt.pause(sleepTime1)
	plt.close()

	fileName = Curr_Path + sep + 'Model' + sep + 'model_trained_' + str(epochsVal) + '.p'
	print('Model Name: ', str(fileName))

	pickle_out = open(fileName, 'wb')
	pickle.dump(model, pickle_out)
	pickle_out.close()

	return 0
	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return 1

view raw

clsAlphabetReading.py

hosted with ❤ by GitHub

Some of the key snippets from the above scripts are –

x = df_HW_Alphabet.drop('0',axis = 1)
y = df_HW_Alphabet['0']

In the above snippet, we have split the data into images & their corresponding labels.

X_Train, X_Test, Y_Train, Y_Test = train_test_split(x, y, test_size = testRatio)
X_Train, X_Validation, Y_Train, Y_Validation = train_test_split(X_Train, Y_Train, test_size = valRatio)

X_Train = np.reshape(X_Train.values, (X_Train.shape[0], reshapeVal, reshapeVal))
X_Test = np.reshape(X_Test.values, (X_Test.shape[0], reshapeVal, reshapeVal))
X_Validation = np.reshape(X_Validation.values, (X_Validation.shape[0], reshapeVal, reshapeVal))


print("Train Data Shape: ", X_Train.shape)
print("Test Data Shape: ", X_Test.shape)
print("Validation Data shape: ", X_Validation.shape)

We are splitting the data into Train, Test & Validation sets to get more accurate predictions and reshaping the raw data into the image by consuming the 784 data columns to 28×28 pixel images.

Since we are talking about characters, we had to come up with a process of identifying The following snippet will plot the character equivalent number into a matplotlib chart & showcase the overall distribution trend after splitting.

Y_Train_Num = np.int0(y)
count = np.zeros(numOfClasses, dtype='int')
for i in Y_Train_Num:
    count[i] +=1

alphabets = []
for i in word_dict.values():
    alphabets.append(i)

fig, ax = plt.subplots(1,1, figsize=(7,7))
ax.barh(alphabets, count)

plt.xlabel("Number of elements ")
plt.ylabel("Alphabets")
plt.grid()
plt.show(block=False)
plt.pause(sleepTime)
plt.close()

Note that we have tweaked the plt.show property with (block=False). This property will enable us to continue execution without human interventions after the initial pause.

# Model reshaping the training & test dataset
X_Train = X_Train.reshape(X_Train.shape[0],X_Train.shape[1],X_Train.shape[2],1)
print("Shape of Train Data: ", X_Train.shape)

X_Test = X_Test.reshape(X_Test.shape[0], X_Test.shape[1], X_Test.shape[2],1)
print("Shape of Test Data: ", X_Test.shape)

X_Validation = X_Validation.reshape(X_Validation.shape[0], X_Validation.shape[1], X_Validation.shape[2],1)
print("Shape of Validation data: ", X_Validation.shape)

# Converting the labels to categorical values
Y_Train_Catg = to_categorical(Y_Train, num_classes = numOfClasses, dtype='int')
print("Shape of Train Labels: ", Y_Train_Catg.shape)

Y_Test_Catg = to_categorical(Y_Test, num_classes = numOfClasses, dtype='int')
print("Shape of Test Labels: ", Y_Test_Catg.shape)

Y_Validation_Catg = to_categorical(Y_Validation, num_classes = numOfClasses, dtype='int')
print("Shape of validation labels: ", Y_Validation_Catg.shape)

In the above diagram, the application did reshape all three categories of data before calling the primary CNN function.

model = Sequential()

model.add(Conv2D(filters=filterVal1, kernel_size=kernelSize, activation=activationType, input_shape=(28,28,1)))
model.add(MaxPool2D(pool_size=poolSize, strides=stridesVal))

model.add(Conv2D(filters=filterVal2, kernel_size=kernelSize, activation=activationType, padding = paddingVal1))
model.add(MaxPool2D(pool_size=poolSize, strides=stridesVal))

model.add(Conv2D(filters=filterVal3, kernel_size=kernelSize, activation=activationType, padding = paddingVal2))
model.add(MaxPool2D(pool_size=poolSize, strides=stridesVal))

model.add(Flatten())

model.add(Dense(DenkseVal2,activation = activationType))
model.add(Dense(DenkseVal3,activation = activationType))

model.add(Dense(DenkseVal1,activation = activationType2))

model.compile(optimizer = Adam(learning_rate=learningRateVal), loss='categorical_crossentropy', metrics=['accuracy'])
reduce_lr = ReduceLROnPlateau(monitor=monitorVal, factor=factorVal, patience=patienceVal1, min_lr=minLrVal)
early_stop = EarlyStopping(monitor=monitorVal, min_delta=minDeltaVal, patience=patienceVal2, verbose=verboseFlag, mode=modeInd)


fittedModel = model.fit(X_Train, Y_Train_Catg, epochs=epochsVal, callbacks=[reduce_lr, early_stop],  validation_data = (X_Validation,Y_Validation_Catg))

return (model, fittedModel)

In the above snippet, the convolution layers are followed by maxpool layers, which reduce the number of features extracted. The output of the maxpool layers and convolution layers are flattened into a vector of a single dimension and supplied as an input to the Dense layer—the CNN model prepared for training the model using the training dataset.

We have used optimization parameters like Adam, RMSProp & the application we trained for eight epochs for better accuracy & predictions.

# Displaying the accuracies & losses for train & validation set
print("Validation Accuracy :", history.history['val_accuracy'])
print("Training Accuracy :", history.history['accuracy'])
print("Validation Loss :", history.history['val_loss'])
print("Training Loss :", history.history['loss'])

# Displaying the Loss Graph
plt.figure(1)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['training','validation'])
plt.title('Loss')
plt.xlabel('epoch')
plt.show(block=False)
plt.pause(sleepTime1)
plt.close()

# Dsiplaying the Accuracy Graph
plt.figure(2)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['training','validation'])
plt.title('Accuracy')
plt.xlabel('epoch')
plt.show(block=False)
plt.pause(sleepTime1)
plt.close()

Also, we have captured the validation Accuracy & Loss & plot them into two separate graphs for better understanding.

try:
    score = model.evaluate(X_Test, Y_Test_Catg, verbose=0)
    print('Test Score = ', score[0])
    print('Test Accuracy = ', score[1])
except Exception as e:
    x = str(e)
    print('Error: ', x)

Also, the application is trying to get the accuracy of the model that we trained & validated with the training & validation data. This time we have used test data to predict the confidence score.

# Displaying some of the test images & their predicted labels
fig, ax = plt.subplots(3,3, figsize=(8,9))
axes = ax.flatten()

for i in range(9):
    axes[i].imshow(np.reshape(X_Test[i], reshapeVal1), cmap="Greys")
    pred = word_dict[np.argmax(Y_Test_Catg[i])]
    print('Prediction: ', pred)
    axes[i].set_title("Test Prediction: " + pred)
    axes[i].grid()
plt.show(block=False)
plt.pause(sleepTime1)
plt.close()

Finally, the application testing with some random test data & tried to plot the output & prediction assessment.

fileName = Curr_Path + sep + 'Model' + sep + 'model_trained_' + str(epochsVal) + '.p'
print('Model Name: ', str(fileName))

pickle_out = open(fileName, 'wb')
pickle.dump(model, pickle_out)
pickle_out.close()

As a part of the last step, the application will generate the models using a pickle package & save them under a specific location, which the reader application will use.

3. trainingVisualDataRead.py (Main application that will invoke the training class to predict alphabet through WebCam using Convolutional Neural Network (CNN).)

	###############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 17-Jan-2022 ####
	#### Modified On 17-Jan-2022 ####
	#### ####
	#### Objective: This is the main calling ####
	#### python script that will invoke the ####
	#### clsAlhpabetReading class to initiate ####
	#### teach & perfect the model to read ####
	#### visual alphabets using Convolutional ####
	#### Neural Network (CNN). ####
	###############################################

	# We keep the setup code in a different class as shown below.
	import clsAlphabetReading as ar
	from clsConfig import clsConfig as cf

	import datetime
	import logging

	###############################################
	### Global Section ###
	###############################################
	# Instantiating all the three classes

	x1 = ar.clsAlphabetReading()

	###############################################
	### End of Global Section ###
	###############################################

	def main():
	try:
	# Other useful variables
	debugInd = 'Y'
	var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	var1 = datetime.datetime.now()

	print('Start Time: ', str(var))
	# End of useful variables

	# Initiating Log Class
	general_log_path = str(cf.conf['LOG_PATH'])

	# Enabling Logging Info
	logging.basicConfig(filename=general_log_path + 'restoreVideo.log', level=logging.INFO)

	print('Started Transformation!')

	# Execute all the pass
	r1 = x1.trainModel(debugInd, var)

	if (r1 == 0):
	print('Successfully Visual Alphabet Training Completed!')
	else:
	print('Failed to complete the Visual Alphabet Training!')

	var2 = datetime.datetime.now()

	c = var2 – var1
	minutes = c.total_seconds() / 60
	print('Total difference in minutes: ', str(minutes))

	print('End Time: ', str(var1))

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	if __name__ == "__main__":
	main()

view raw

trainingVisualDataRead.py

hosted with ❤ by GitHub

And the core snippet from the above script is –

x1 = ar.clsAlphabetReading()

Instantiate the main class.

r1 = x1.trainModel(debugInd, var)

The python application will invoke the class & capture the returned value inside the r1 variable.

4. readingVisualData.py (Reading the model to predict Alphabet using WebCAM.)

	###############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 18-Jan-2022 ####
	#### Modified On 18-Jan-2022 ####
	#### ####
	#### Objective: This python script will ####
	#### scan the live video feed from the ####
	#### web-cam & predict the alphabet that ####
	#### read it. ####
	###############################################

	# We keep the setup code in a different class as shown below.
	from clsConfig import clsConfig as cf

	import datetime
	import logging
	import cv2
	import pickle
	import numpy as np
	###############################################
	### Global Section ###
	###############################################

	sep = str(cf.conf['SEP'])
	Curr_Path = str(cf.conf['INIT_PATH'])
	fileName = str(cf.conf['FILE_NAME'])
	epochsVal = int(cf.conf['epochsVal'])
	numOfClasses = int(cf.conf['numOfClasses'])
	word_dict = cf.conf['word_dict']
	width = int(cf.conf['width'])
	height = int(cf.conf['height'])
	imgSize = cf.conf['imgSize']
	threshold = float(cf.conf['threshold'])
	imgDimension = cf.conf['imgDimension']
	imgSmallDim = cf.conf['imgSmallDim']
	imgMidDim = cf.conf['imgMidDim']
	reshapeParam1 = int(cf.conf['reshapeParam1'])
	reshapeParam2 = int(cf.conf['reshapeParam2'])
	colorFeed = cf.conf['colorFeed']
	colorPredict = cf.conf['colorPredict']
	###############################################
	### End of Global Section ###
	###############################################

	def main():
	try:
	# Other useful variables
	debugInd = 'Y'
	var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	var1 = datetime.datetime.now()

	print('Start Time: ', str(var))
	# End of useful variables

	# Initiating Log Class
	general_log_path = str(cf.conf['LOG_PATH'])

	# Enabling Logging Info
	logging.basicConfig(filename=general_log_path + 'restoreVideo.log', level=logging.INFO)

	print('Started Live Streaming!')

	cap = cv2.VideoCapture(0)
	cap.set(3, width)
	cap.set(4, height)

	fileName = Curr_Path + sep + 'Model' + sep + 'model_trained_' + str(epochsVal) + '.p'
	print('Model Name: ', str(fileName))

	pickle_in = open(fileName, 'rb')
	model = pickle.load(pickle_in)

	while True:
	status, img = cap.read()

	if status == False:
	break

	img_copy = img.copy()

	img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
	img = cv2.resize(img, imgDimension)

	img_copy = cv2.GaussianBlur(img_copy, imgSmallDim, 0)
	img_gray = cv2.cvtColor(img_copy, cv2.COLOR_BGR2GRAY)
	bin, img_thresh = cv2.threshold(img_gray, 100, 255, cv2.THRESH_BINARY_INV)

	img_final = cv2.resize(img_thresh, imgMidDim)
	img_final = np.reshape(img_final, (reshapeParam1,reshapeParam2,reshapeParam2,reshapeParam1))


	img_pred = word_dict[np.argmax(model.predict(img_final))]

	# Extracting Probability Values
	Predict_X = model.predict(img_final)
	probVal = round(np.amax(Predict_X) * 100)

	cv2.putText(img, "Live Feed : (" + str(probVal) + "%) ", (20,25), cv2.FONT_HERSHEY_TRIPLEX, 0.7, color = colorFeed)
	cv2.putText(img, "Prediction: " + img_pred, (20,410), cv2.FONT_HERSHEY_DUPLEX, 1.3, color = colorPredict)

	cv2.imshow("Original Image", img)

	if cv2.waitKey(1) & 0xFF == ord('q'):
	r1=0
	break

	if (r1 == 0):
	print('Successfully Alphabets predicted!')
	else:
	print('Failed to predict alphabet!')

	var2 = datetime.datetime.now()

	c = var2 – var1
	minutes = c.total_seconds() / 60
	print('Total Run Time in minutes: ', str(minutes))

	print('End Time: ', str(var1))

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	if __name__ == "__main__":
	main()

view raw

readingVisualData.py

hosted with ❤ by GitHub

And the key snippet from the above code is –

cap = cv2.VideoCapture(0)
cap.set(3, width)
cap.set(4, height)

The application is reading the live video data from WebCAM. Also, set out the height & width for the video output.

fileName = Curr_Path + sep + 'Model' + sep + 'model_trained_' + str(epochsVal) + '.p'
print('Model Name: ', str(fileName))

pickle_in = open(fileName, 'rb')
model = pickle.load(pickle_in)

The application reads the model output generated as part of the previous script using the pickle package.

while True:
    status, img = cap.read()

    if status == False:
        break

The application will read the WebCAM & it exits if there is an end of video transmission or some kind of corrupt video frame.

img_copy = img.copy()

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, imgDimension)

img_copy = cv2.GaussianBlur(img_copy, imgSmallDim, 0)
img_gray = cv2.cvtColor(img_copy, cv2.COLOR_BGR2GRAY)
bin, img_thresh = cv2.threshold(img_gray, 100, 255, cv2.THRESH_BINARY_INV)

img_final = cv2.resize(img_thresh, imgMidDim)
img_final = np.reshape(img_final, (reshapeParam1,reshapeParam2,reshapeParam2,reshapeParam1))


img_pred = word_dict[np.argmax(model.predict(img_final))]

We have initially cloned the original video frame & then it converted from BGR2GRAYSCALE while applying the threshold on it doe better prediction outcomes. Then the image has resized & reshaped for model input. Finally, the np.argmax function extracted the class index with the highest predicted probability. Furthermore, it is translated using the word_dict dictionary to an Alphabet & displayed on top of the Live View.

# Extracting Probability Values
Predict_X = model.predict(img_final)
probVal = round(np.amax(Predict_X) * 100)

Also, derive the confidence score of that probability & display that on top of the Live View.

if cv2.waitKey(1) & 0xFF == ord('q'):
    r1=0
    break

The above code will let the developer exit from this application by pressing the “Esc” or “q”-key from the keyboard & the program will terminate.

So, we’ve done it.

You will get the complete codebase in the following Github link.

I’ll bring some more exciting topic in the coming days from the Python verse. Please share & subscribe my post & let me know your feedback.

Till then, Happy Avenging! 😀

Note: All the data & scenario posted here are representational data & scenarios & available over the internet & for educational purpose only. Some of the images (except my photo) that we’ve used are available over the net. We don’t claim the ownership of these images. There is an always room for improvement & especially the prediction quality of Alphabet.

Predicting real-time Covid-19 forecast by analyzing time-series data using Facebook machine-learning API

Posted on August 1, 2021 by SatyakiDe in api, Azure, cloud, code, computing, dashboard, Data Science, function, gui, integration, json, Logistic-regression, machine-learning, mobile, Pandas, Python, Real-time, Technology

Hello Guys,

Today, I’ll share one of the important posts on predicting data using facebook’s relatively new machine-learning-based API. I find this API is interesting as to how it can build & anticipate the outcome.

We’ll be using one of the most acceptable API-based sources for Covid-19 & I’ll be sharing the link over here.

We’ll be using the prophet-API developed by Facebook to predict the data. You will get the details from this link.

Now, let’s explore the architecture shared above.

As you can see that the application will consume the data from the third-party API named “about-corona,” & the python application will clean & transform the data. The application will send the clean data to the Facebook API (prophet) built on the machine-learning algorithm. This API is another effective time-series analysis platform given to the data scientist community.

Once the application receives the predicted model, it will visualize them using plotly & matplotlib.

I would request you to please check the demo of this application just for your reference.

We’ll do a time series analysis. Let us understand the basic concept of time series.

Time series is a series of data points indexed (or listed or graphed) in time order.

Therefore, the data organized by relatively deterministic timestamps and potentially compared with random sample data contain additional information that we can leverage for our business use case to make a better decision.

To use the prophet API, one needs to use & transform their data cleaner & should contain two fields (ds & y).

Let’s check one such use case since our source has plenty of good data points to decide. We’ve daily data of newly infected covid patients based on countries, as shown below –

And, our clean class will transform the data into two fields –

Once we fit the data into the prophet model, it will generate some additional columns, which will be used for prediction as shown below –

And, a sample prediction based on a similar kind of data would be identical to this –

Let us understand what packages we need to install to prepare this application –

And, here is the packages –

pip install pandas
pip install matplotlib
pip install prophet

Let us now revisit the code –

1. clsConfig.py ( This native Python script contains the configuration entries. )

	#####################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 26-Jul-2021 ####
	#### ####
	#### Objective: This script is a config ####
	#### file, contains all the keys for ####
	#### for Prophet API. Application will ####
	#### process these information & perform ####
	#### the call to our newly developed with ####
	#### APIs developed by Facebook & a open-source ####
	#### project called "About-Corona". ####
	#####################################################

	import os
	import platform as pl

	class clsConfig(object):
	Curr_Path = os.path.dirname(os.path.realpath(__file__))

	os_det = pl.system()
	if os_det == "Windows":
	sep = '\\'
	else:
	sep = '/'

	conf = {
	'APP_ID': 1,
	"URL":"https://corona-api.com/countries/",
	"appType":"application/json",
	"conType":"keep-alive",
	"limRec": 10,
	"CACHE":"no-cache",
	"coList": "DE, IN, US, CA, GB, ID, BR",
	"LOG_PATH":Curr_Path + sep + 'log' + sep,
	"MAX_RETRY": 3,
	"FNC": "NewConfirmed",
	"TMS": "ReportedDate",
	"FND": "NewDeaths"
	}

view raw

clsConfig.py

hosted with ❤ by GitHub

We’re not going to discuss anything specific to this script.

2. clsL.py ( This native Python script logs the application. )

	#####################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 26-Jul-2021 ####
	#### ####
	#### Objective: This script is a log ####
	#### file, that is useful for debugging purpose. ####
	#### ####
	#####################################################

	import pandas as p
	import os
	import platform as pl

	class clsL(object):
	def __init__(self):
	self.path = os.path.dirname(os.path.realpath(__file__))

	def logr(self, Filename, Ind, df, subdir=None, write_mode='w', with_index='N'):
	try:
	x = p.DataFrame()
	x = df

	sd = subdir

	os_det = pl.system()

	if sd == None:
	if os_det == "windows":
	fullFileName = self.path + '\\' + Filename
	else:
	fullFileName = self.path + '/' + Filename
	else:
	if os_det == "windows":
	fullFileName = self.path + '\\' + sd + '\\' + Filename
	else:
	fullFileName = self.path + '/' + sd + '/' + Filename

	if (with_index == 'N'):
	if ((Ind == 'Y') & (write_mode == 'w')):
	x.to_csv(fullFileName, index=False)
	else:
	x.to_csv(fullFileName, index=False, mode=write_mode, header=None)
	else:
	if ((Ind == 'Y') & (write_mode == 'w')):
	x.to_csv(fullFileName)
	else:
	x.to_csv(fullFileName, mode=write_mode, header=None)

	return 0
	except Exception as e:
	y = str(e)
	print(y)

	return 3

view raw

clsL.py

hosted with ❤ by GitHub

Based on the operating system, the log class will capture potential information under the “log” directory in the form of csv for later reference purposes.

3. clsForecast.py ( This native Python script will clean & transform the data. )

	##############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 26-Jul-2021 ####
	#### Modified On 26-Jul-2021 ####
	#### ####
	#### Objective: Calling Data Cleaning API ####
	##############################################

	import json
	from clsConfig import clsConfig as cf
	import requests
	import logging
	import time
	import pandas as p
	import clsL as cl

	from prophet import Prophet

	class clsForecast:
	def __init__(self):
	self.fnc = cf.conf['FNC']
	self.fnd = cf.conf['FND']
	self.tms = cf.conf['TMS']

	def forecastNewConfirmed(self, srcDF, debugInd, varVa):
	try:
	fnc = self.fnc
	tms = self.tms
	var = varVa
	debug_ind = debugInd
	countryISO = ''

	df_M = p.DataFrame()

	dfWork = srcDF

	# Initiating Log class
	l = cl.clsL()

	#Extracting the unique country name
	unqCountry = dfWork['CountryCode'].unique()

	for i in unqCountry:
	countryISO = i.strip()

	print('Country Name: ' + countryISO)

	df_Comm = dfWork[[tms, fnc]]
	l.logr('13.df_Comm_' + var + '.csv', debug_ind, df_Comm, 'log')

	# Aligning as per Prophet naming convention
	df_Comm.columns = ['ds', 'y']
	l.logr('14.df_Comm_Mod_' + var + '.csv', debug_ind, df_Comm, 'log')

	return df_Comm

	except Exception as e:

	x = str(e)
	print(x)

	logging.info(x)
	df = p.DataFrame()

	return df

	def forecastNewDead(self, srcDF, debugInd, varVa):
	try:
	fnd = self.fnd
	tms = self.tms
	var = varVa
	debug_ind = debugInd
	countryISO = ''

	df_M = p.DataFrame()

	dfWork = srcDF

	# Initiating Log class
	l = cl.clsL()

	#Extracting the unique country name
	unqCountry = dfWork['CountryCode'].unique()

	for i in unqCountry:
	countryISO = i.strip()

	print('Country Name: ' + countryISO)

	df_Comm = dfWork[[tms, fnd]]
	l.logr('17.df_Comm_' + var + '.csv', debug_ind, df_Comm, 'log')

	# Aligning as per Prophet naming convention
	df_Comm.columns = ['ds', 'y']
	l.logr('18.df_Comm_Mod_' + var + '.csv', debug_ind, df_Comm, 'log')

	return df_Comm

	except Exception as e:

	x = str(e)
	print(x)

	logging.info(x)
	df = p.DataFrame()

	return df

view raw

clsForecast.py

hosted with ❤ by GitHub

Let’s explore the critical snippet out of this script –

df_Comm = dfWork[[tms, fnc]]

Now, the application will extract only the relevant columns to proceed.

df_Comm.columns = ['ds', 'y']

It is now assigning specific column names, which is a requirement for prophet API.

4. clsCovidAPI.py ( This native Python script will call the Covid-19 API. )

	##############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 26-Jul-2021 ####
	#### Modified On 26-Jul-2021 ####
	#### ####
	#### Objective: Calling Covid-19 API ####
	##############################################

	import json
	from clsConfig import clsConfig as cf
	import requests
	import logging
	import time
	import pandas as p
	import clsL as cl

	class clsCovidAPI:
	def __init__(self):
	self.url = cf.conf['URL']
	self.azure_cache = cf.conf['CACHE']
	self.azure_con = cf.conf['conType']
	self.type = cf.conf['appType']
	self.typVal = cf.conf['coList']
	self.max_retries = cf.conf['MAX_RETRY']

	def searchQry(self, varVa, debugInd):
	try:
	url = self.url
	api_cache = self.azure_cache
	api_con = self.azure_con
	type = self.type
	typVal = self.typVal
	max_retries = self.max_retries
	var = varVa
	debug_ind = debugInd
	cnt = 0

	df_M = p.DataFrame()

	# Initiating Log class
	l = cl.clsL()

	payload = {}

	strMsg = 'Input Countries: ' + str(typVal)
	logging.info(strMsg)

	headers = {}

	countryList = typVal.split(',')

	for i in countryList:
	# Failed case retry
	retries = 1
	success = False
	val = ''

	try:
	while not success:
	# Getting response from web service
	try:
	df_conv = p.DataFrame()

	strCountryUrl = url + str(i).strip()

	print('Country: ' + str(i).strip())
	print('Url: ' + str(strCountryUrl))

	str1 = 'Url: ' + str(strCountryUrl)
	logging.info(str1)

	response = requests.request("GET", strCountryUrl, headers=headers, params=payload)
	ResJson = response.text

	#jdata = json.dumps(ResJson)
	RJson = json.loads(ResJson)

	df_conv = p.io.json.json_normalize(RJson)
	df_conv.drop(['data.timeline'], axis=1, inplace=True)
	df_conv['DummyKey'] = 1
	df_conv.set_index('DummyKey')

	l.logr('1.df_conv_' + var + '.csv', debug_ind, df_conv, 'log')

	# Extracting timeline part separately
	Rjson_1 = RJson['data']['timeline']

	df_conv2 = p.io.json.json_normalize(Rjson_1)
	df_conv2['DummyKey'] = 1
	df_conv2.set_index('DummyKey')
	l.logr('2.df_conv_timeline_' + var + '.csv', debug_ind, df_conv2, 'log')

	# Doing Cross Join
	df_fin = df_conv.merge(df_conv2, on='DummyKey', how='outer')
	l.logr('3.df_fin_' + var + '.csv', debug_ind, df_fin, 'log')

	# Merging with the previous Country Code data
	if cnt == 0:
	df_M = df_fin
	else:
	d_frames = [df_M, df_fin]
	df_M = p.concat(d_frames)

	cnt += 1


	strCountryUrl = ''

	if str(response.status_code)[:1] == '2':
	success = True
	else:
	wait = retries * 2
	print("retries Fail! Waiting " + str(wait) + " seconds and retrying!")
	str_R1 = "retries Fail! Waiting " + str(wait) + " seconds and retrying!"
	logging.info(str_R1)
	time.sleep(wait)
	retries += 1

	# Checking maximum retries
	if retries == max_retries:
	success = True
	raise Exception

	except Exception as e:
	x = str(e)
	print(x)
	logging.info(x)

	pass

	except Exception as e:
	pass

	l.logr('4.df_M_' + var + '.csv', debug_ind, df_M, 'log')


	return df_M

	except Exception as e:

	x = str(e)
	print(x)

	logging.info(x)
	df = p.DataFrame()

	return df

view raw

clsCovidAPI.py

hosted with ❤ by GitHub

Let us explore the key snippet –

countryList = typVal.split(',')

The application will fetch individual country names into a list based on the input lists from the configure script.

response = requests.request("GET", strCountryUrl, headers=headers, params=payload)
ResJson = response.text

RJson = json.loads(ResJson)

df_conv = p.io.json.json_normalize(RJson)
df_conv.drop(['data.timeline'], axis=1, inplace=True)
df_conv['DummyKey'] = 1
df_conv.set_index('DummyKey')

The application will extract the elements & normalize the JSON & convert that to a pandas dataframe & also added one dummy column, which will use for the later purpose to merge the data from another set.

# Extracting timeline part separately
Rjson_1 = RJson['data']['timeline']

df_conv2 = p.io.json.json_normalize(Rjson_1)
df_conv2['DummyKey'] = 1
df_conv2.set_index('DummyKey')

Now, the application will take the nested element & normalize that as per granular level. Also, it added the dummy column to join both of these data together.

# Doing Cross Join
df_fin = df_conv.merge(df_conv2, on='DummyKey', how='outer')

The application will Merge both the data sets to get the complete denormalized data for our use cases.

# Merging with the previous Country Code data
if cnt == 0:
    df_M = df_fin
else:
    d_frames = [df_M, df_fin]
    df_M = p.concat(d_frames)

This entire deserializing execution happens per country. Hence, the above snippet will create an individual sub-group based on the country & later does union to all the sets.

if str(response.status_code)[:1] == '2':
    success = True
else:
    wait = retries * 2
    print("retries Fail! Waiting " + str(wait) + " seconds and retrying!")
    str_R1 = "retries Fail! Waiting " + str(wait) + " seconds and retrying!"
    logging.info(str_R1)
    time.sleep(wait)
    retries += 1

# Checking maximum retries
if retries == max_retries:
    success = True
    raise  Exception

If any calls to source API fails, the application will retrigger after waiting for a specific time until it reaches its maximum capacity.

5. callPredictCovidAnalysis.py ( This native Python script is the main one to predict the Covid. )

	##############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 26-Jul-2021 ####
	#### Modified On 26-Jul-2021 ####
	#### ####
	#### Objective: Calling multiple API's ####
	#### that including Prophet-API developed ####
	#### by Facebook for future prediction of ####
	#### Covid-19 situations in upcoming days ####
	#### for world's major hotspots. ####
	##############################################

	import json

	import clsCovidAPI as ca
	from clsConfig import clsConfig as cf
	import datetime
	import logging
	import clsL as cl

	import clsForecast as f

	from prophet import Prophet

	from prophet.plot import plot_plotly, plot_components_plotly

	import matplotlib.pyplot as plt
	import pandas as p

	# Disbling Warning
	def warn(args, *kwargs):
	pass

	import warnings
	warnings.warn = warn

	# Initiating Log class
	l = cl.clsL()

	# Helper Function that removes underscores
	def countryDet(inputCD):
	try:
	countryCD = inputCD

	if str(countryCD) == 'DE':
	cntCD = 'Germany'
	elif str(countryCD) == 'BR':
	cntCD = 'Brazil'
	elif str(countryCD) == 'GB':
	cntCD = 'United Kingdom'
	elif str(countryCD) == 'US':
	cntCD = 'United States'
	elif str(countryCD) == 'IN':
	cntCD = 'India'
	elif str(countryCD) == 'CA':
	cntCD = 'Canada'
	elif str(countryCD) == 'ID':
	cntCD = 'Indonesia'
	else:
	cntCD = 'N/A'

	return cntCD
	except:
	cntCD = 'N/A'

	return cntCD

	def plot_picture(inputDF, debug_ind, var, countryCD, stat):
	try:
	iDF = inputDF

	# Lowercase the column names
	iDF.columns = [c.lower() for c in iDF.columns]
	# Determine which is Y axis
	y_col = [c for c in iDF.columns if c.startswith('y')][0]
	# Determine which is X axis
	x_col = [c for c in iDF.columns if c.startswith('ds')][0]

	# Data Conversion
	iDF['y'] = iDF[y_col].astype('float')
	iDF['ds'] = iDF[x_col].astype('datetime64[ns]')

	# Forecast calculations
	# Decreasing the changepoint_prior_scale to 0.001 to make the trend less flexible
	m = Prophet(n_changepoints=20, yearly_seasonality=True, changepoint_prior_scale=0.001)
	m.fit(iDF)

	forecastDF = m.make_future_dataframe(periods=365)

	forecastDF = m.predict(forecastDF)

	l.logr('15.forecastDF_' + var + '_' + countryCD + '.csv', debug_ind, forecastDF, 'log')

	df_M = forecastDF[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]

	l.logr('16.df_M_' + var + '_' + countryCD + '.csv', debug_ind, df_M, 'log')

	#m.plot_components(df_M)
	# Getting Full Country Name
	cntCD = countryDet(countryCD)

	# Draw forecast results
	lbl = str(cntCD) + ' – Covid – ' + stat
	m.plot(df_M, xlabel = 'Date', ylabel = lbl)

	# Combine all graps in the same page
	plt.title(f'Covid Forecasting')
	plt.title(lbl)
	plt.ylabel('Millions')
	plt.show()

	return 0

	except Exception as e:
	x = str(e)
	print(x)

	return 1

	def countrySpecificDF(counryDF, val):
	try:
	countryName = val
	df = counryDF

	df_lkpFile = df[(df['CountryCode'] == val)]

	return df_lkpFile
	except:
	df = p.DataFrame()

	return df

	def main():
	try:
	var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	print('' 60)
	DInd = 'Y'
	NC = 'New Confirmed'
	ND = 'New Dead'
	SM = 'data process Successful!'
	FM = 'data process Failure!'

	print("Calling the custom Package for large file splitting..")
	print('Start Time: ' + str(var1))

	countryList = str(cf.conf['coList']).split(',')

	# Initiating Log Class
	general_log_path = str(cf.conf['LOG_PATH'])

	# Enabling Logging Info
	logging.basicConfig(filename=general_log_path + 'CovidAPI.log', level=logging.INFO)

	# Create the instance of the Covid API Class
	x1 = ca.clsCovidAPI()

	# Let's pass this to our map section
	retDF = x1.searchQry(var1, DInd)

	retVal = int(retDF.shape[0])

	if retVal > 0:
	print('Successfully Covid Data Extracted from the API-source.')
	else:
	print('Something wrong with your API-source!')

	# Extracting Skeleton Data
	df = retDF[['data.code', 'date', 'deaths', 'confirmed', 'recovered', 'new_confirmed', 'new_recovered', 'new_deaths', 'active']]

	df.columns = ['CountryCode', 'ReportedDate', 'TotalReportedDead', 'TotalConfirmedCase', 'TotalRecovered', 'NewConfirmed', 'NewRecovered', 'NewDeaths', 'ActiveCaases']

	df.dropna()

	print('Returned Skeleton Data Frame: ')
	print(df)

	l.logr('5.df_' + var1 + '.csv', DInd, df, 'log')

	# Working with forecast
	# Create the instance of the Forecast API Class
	x2 = f.clsForecast()

	# Fetching each country name & then get the details
	cnt = 6

	for i in countryList:
	try:
	cntryIndiv = i.strip()

	print('Country Porcessing: ' + str(cntryIndiv))

	# Creating dataframe for each country
	# Germany Main DataFrame
	dfCountry = countrySpecificDF(df, cntryIndiv)
	l.logr(str(cnt) + '.df_' + cntryIndiv + '_' + var1 + '.csv', DInd, dfCountry, 'log')

	# Let's pass this to our map section
	retDFGenNC = x2.forecastNewConfirmed(dfCountry, DInd, var1)

	statVal = str(NC)

	a1 = plot_picture(retDFGenNC, DInd, var1, cntryIndiv, statVal)

	retDFGenNC_D = x2.forecastNewDead(dfCountry, DInd, var1)

	statVal = str(ND)

	a2 = plot_picture(retDFGenNC_D, DInd, var1, cntryIndiv, statVal)

	cntryFullName = countryDet(cntryIndiv)

	if (a1 + a2) == 0:
	oprMsg = cntryFullName + ' ' + SM
	print(oprMsg)
	else:
	oprMsg = cntryFullName + ' ' + FM
	print(oprMsg)

	# Resetting the dataframe value for the next iteration
	dfCountry = p.DataFrame()
	cntryIndiv = ''
	oprMsg = ''
	cntryFullName = ''
	a1 = 0
	a2 = 0
	statVal = ''

	cnt += 1
	except Exception as e:
	x = str(e)
	print(x)

	var2 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	print('End Time: ' + str(var2))
	print('' 60)

	except Exception as e:
	x = str(e)

	if __name__ == "__main__":
	main()

view raw

callPredictCovidAnalysis.py

hosted with ❤ by GitHub

Let us explore the key snippet –

def countryDet(inputCD):
    try:
        countryCD = inputCD

        if str(countryCD) == 'DE':
            cntCD = 'Germany'
        elif str(countryCD) == 'BR':
            cntCD = 'Brazil'
        elif str(countryCD) == 'GB':
            cntCD = 'United Kingdom'
        elif str(countryCD) == 'US':
            cntCD = 'United States'
        elif str(countryCD) == 'IN':
            cntCD = 'India'
        elif str(countryCD) == 'CA':
            cntCD = 'Canada'
        elif str(countryCD) == 'ID':
            cntCD = 'Indonesia'
        else:
            cntCD = 'N/A'

        return cntCD
    except:
        cntCD = 'N/A'

        return cntCD

The application is extracting the full country name based on ISO country code.

# Lowercase the column names
iDF.columns = [c.lower() for c in iDF.columns]
# Determine which is Y axis
y_col = [c for c in iDF.columns if c.startswith('y')][0]
# Determine which is X axis
x_col = [c for c in iDF.columns if c.startswith('ds')][0]

# Data Conversion
iDF['y'] = iDF[y_col].astype('float')
iDF['ds'] = iDF[x_col].astype('datetime64[ns]')

The above script will convert all the column names in lower letters & then convert & cast them with the appropriate data type.

# Forecast calculations
# Decreasing the changepoint_prior_scale to 0.001 to make the trend less flexible
m = Prophet(n_changepoints=20, yearly_seasonality=True, changepoint_prior_scale=0.001)
m.fit(iDF)

forecastDF = m.make_future_dataframe(periods=365)

forecastDF = m.predict(forecastDF)

l.logr('15.forecastDF_' + var + '_' + countryCD + '.csv', debug_ind, forecastDF, 'log')

df_M = forecastDF[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]

l.logr('16.df_M_' + var + '_' + countryCD + '.csv', debug_ind, df_M, 'log')

The above snippet will use the machine-learning driven prophet-API, where the application will fit the model & then predict based on the existing data for a year. Also, we’ve identified the number of changepoints. By default, the prophet-API adds 25 changepoints to the initial 80% of the data set that trend is less flexible.

Prophet allows you to adjust the trend in case there is an overfit or underfit. changepoint_prior_scale helps adjust the strength of the movement & decreasing the changepoint_prior_scale to 0.001 to make it less flexible.

def countrySpecificDF(counryDF, val):
    try:
        countryName = val
        df = counryDF

        df_lkpFile = df[(df['CountryCode'] == val)]

        return df_lkpFile
    except:
        df = p.DataFrame()

        return df

The application is fetching & creating the country-specific dataframe.

for i in countryList:
    try:
        cntryIndiv = i.strip()

        print('Country Porcessing: ' + str(cntryIndiv))

        # Creating dataframe for each country
        # Germany Main DataFrame
        dfCountry = countrySpecificDF(df, cntryIndiv)
        l.logr(str(cnt) + '.df_' + cntryIndiv + '_' + var1 + '.csv', DInd, dfCountry, 'log')

        # Let's pass this to our map section
        retDFGenNC = x2.forecastNewConfirmed(dfCountry, DInd, var1)

        statVal = str(NC)

        a1 = plot_picture(retDFGenNC, DInd, var1, cntryIndiv, statVal)

        retDFGenNC_D = x2.forecastNewDead(dfCountry, DInd, var1)

        statVal = str(ND)

        a2 = plot_picture(retDFGenNC_D, DInd, var1, cntryIndiv, statVal)

        cntryFullName = countryDet(cntryIndiv)

        if (a1 + a2) == 0:
            oprMsg = cntryFullName + ' ' + SM
            print(oprMsg)
        else:
            oprMsg = cntryFullName + ' ' + FM
            print(oprMsg)

        # Resetting the dataframe value for the next iteration
        dfCountry = p.DataFrame()
        cntryIndiv = ''
        oprMsg = ''
        cntryFullName = ''
        a1 = 0
        a2 = 0
        statVal = ''

        cnt += 1
    except Exception as e:
        x = str(e)
        print(x)

The above snippet will call the function to predict the data & then predict the visual representation based on plotting the data points.

Let us run the application –

And, it will generate the visual representation as follows –

And, here is the folder structure –

Let’s explore the comparison study & try to find out the outcome –

Let us analyze from the above visual data-point.

Conclusion:

Let’s explore the comparison study & try to find out the outcome –

India may see a rise of new covid cases & it might cross the mark 400,000 during June 2022 & would be the highest among the countries that we’ve considered here including India, Indonesia, Germany, US, UK, Canada & Brazil. The second worst affected country might be the US during the same period. The third affected country will be Indonesia during the same period.
Canada will be the least affected country during June 2022. The figure should be within 12,000.
However, death case wise India is not only the leading country. The US, India & Brazil will see almost 4000 or slightly over the 4000 marks.

So, we’ve done it.

You will get the complete codebase in the following Github link.

I’ll bring some more exciting topic in the coming days from the Python verse.

Till then, Happy Avenging! 😀

Note: All the data & scenario posted here are representational data & scenarios & available over the internet & for educational purpose only.

One more thing you need to understand is that this prediction based on limited data points. The actual event may happen differently. Ideally, countries are taking a cue from this kind of analysis & are initiating appropriate measures to avoid the high-curve. And, that is one of the main objective of time series analysis.

There is always a room for improvement of this kind of models & the solution associated with it. I’ve shown the basic ways to achieve the same for the education purpose only.

Canada’s Covid19 analysis based on Logistic Regression

Posted on June 2, 2020January 2, 2021 by SatyakiDe in api, Azure, cloud, code, Data Science, exposure, Logistic-regression, machine-learning, natural-language, numpy, Pandas, partition, pattern matching, snippet, String Manipulation, Technology

Hi Guys,

Today, I’ll be demonstrating some scenarios based on open-source data from Canada. In this post, I will only explain some of the significant parts of the code. Not the entire range of scripts here.

Let’s explore a couple of sample source data –

I would like to explore how much this disease caused an impact on the elderly in Canada.

Let’s explore the source directory structure –

For this, you need to install the following packages –

pip install pandas
pip install seaborn

Please find the PyPi link given below –

In this case, we’ve downloaded the data from Canada’s site. However, they have created API. So, you can consume the data through that way as well. Since the volume is a little large. I decided to download that in CSV & then use that for my analysis.

Before I start, let me explain a couple of critical assumptions that I had to make due to data impurities or availabilities.

If there is no data available for a specific case, my application will consider that patient as COVID-Active.
We will consider the patient is affected through Community-spreading until we have data to find it otherwise.
If there is no data available for gender, we’re marking these records as “Other.” So, that way, we’re making it into that category, where the patient doesn’t want to disclose their sexual orientation.
If we don’t have any data, then by default, the application is considering the patient is alive.
Lastly, my application considers the middle point of the age range data for all the categories, i.e., the patient’s age between 20 & 30 will be considered as 25.

1. clsCovidAnalysisByCountryAdv (This is the main script, which will invoke the Machine-Learning API & return 0 if successful.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 01-Jun-2020              ####
#### Modified On 01-Jun-2020              ####
####                                      ####
#### Objective: Main scripts for Logistic ####
#### Regression.                          ####
##############################################

import pandas as p
import clsL as log
import datetime

import matplotlib.pyplot as plt
import seaborn as sns
from clsConfig import clsConfig as cf

# %matplotlib inline -- for Jupyter Notebook
class clsCovidAnalysisByCountryAdv:
    def __init__(self):
        self.fileName_1 = cf.config['FILE_NAME_1']
        self.fileName_2 = cf.config['FILE_NAME_2']
        self.Ind = cf.config['DEBUG_IND']
        self.subdir = str(cf.config['LOG_DIR_NAME'])

    def setDefaultActiveCases(self, row):
        try:
            str_status = str(row['case_status'])

            if str_status == 'Not Reported':
                return 'Active'
            else:
                return str_status
        except:
            return 'Active'

    def setDefaultExposure(self, row):
        try:
            str_exposure = str(row['exposure'])

            if str_exposure == 'Not Reported':
                return 'Community'
            else:
                return str_exposure
        except:
            return 'Community'

    def setGender(self, row):
        try:
            str_gender = str(row['gender'])

            if str_gender == 'Not Reported':
                return 'Other'
            else:
                return str_gender
        except:
            return 'Other'

    def setSurviveStatus(self, row):
        try:
            # 0 - Deceased
            # 1 - Alive
            str_active = str(row['ActiveCases'])

            if str_active == 'Deceased':
                return 0
            else:
                return 1
        except:
            return 1

    def getAgeFromGroup(self, row):
        try:
            # We'll take the middle of the Age group
            # If a age range falls with 20, we'll
            # consider this as 10.
            # Similarly, a age group between 20 & 30,
            # should reflect by 25.
            # Anything above 80 will be considered as
            # 85

            str_age_group = str(row['AgeGroup'])

            if str_age_group == '<20':
                return 10
            elif str_age_group == '20-29':
                return 25
            elif str_age_group == '30-39':
                return 35
            elif str_age_group == '40-49':
                return 45
            elif str_age_group == '50-59':
                return 55
            elif str_age_group == '60-69':
                return 65
            elif str_age_group == '70-79':
                return 75
            else:
                return 85
        except:
            return 100

    def predictResult(self):
        try:
            
            # Initiating Logging Instances
            clog = log.clsL()

            # Important variables
            var = datetime.datetime.now().strftime(".%H.%M.%S")
            print('Target File Extension will contain the following:: ', var)
            Ind = self.Ind
            subdir = self.subdir

            #######################################
            #                                     #
            # Using Logistic Regression to        #
            # Idenitfy the following scenarios -  #
            #                                     #
            # Age wise Infection Vs Deaths        #
            #                                     #
            #######################################
            inputFileName_2 = self.fileName_2

            # Reading from Input File
            df_2 = p.read_csv(inputFileName_2)

            # Fetching only relevant columns
            df_2_Mod = df_2[['date_reported','age_group','gender','exposure','case_status']]
            df_2_Mod['State'] = df_2['province_abbr']

            print()
            print('Projecting 2nd file sample rows: ')
            print(df_2_Mod.head())

            print()
            x_row_1 = df_2_Mod.shape[0]
            x_col_1 = df_2_Mod.shape[1]

            print('Total Number of Rows: ', x_row_1)
            print('Total Number of columns: ', x_col_1)

            #########################################################################################
            # Few Assumptions                                                                       #
            #########################################################################################
            # By default, if there is no data on exposure - We'll treat that as community spreading #
            # By default, if there is no data on case_status - We'll consider this as active        #
            # By default, if there is no data on gender - We'll put that under a separate Gender    #
            # category marked as the "Other". This includes someone who doesn't want to identify    #
            # his/her gender or wants to be part of LGBT community in a generic term.               #
            #                                                                                       #
            # We'll transform our data accordingly based on the above logic.                        #
            #########################################################################################
            df_2_Mod['ActiveCases'] = df_2_Mod.apply(lambda row: self.setDefaultActiveCases(row), axis=1)
            df_2_Mod['ExposureStatus'] = df_2_Mod.apply(lambda row: self.setDefaultExposure(row), axis=1)
            df_2_Mod['Gender'] = df_2_Mod.apply(lambda row: self.setGender(row), axis=1)

            # Filtering all other records where we don't get any relevant information
            # Fetching Data for
            df_3 = df_2_Mod[(df_2_Mod['age_group'] != 'Not Reported')]

            # Dropping unwanted columns
            df_3.drop(columns=['exposure'], inplace=True)
            df_3.drop(columns=['case_status'], inplace=True)
            df_3.drop(columns=['date_reported'], inplace=True)
            df_3.drop(columns=['gender'], inplace=True)

            # Renaming one existing column
            df_3.rename(columns={"age_group": "AgeGroup"}, inplace=True)

            # Creating important feature
            # 0 - Deceased
            # 1 - Alive
            df_3['Survived'] = df_3.apply(lambda row: self.setSurviveStatus(row), axis=1)

            clog.logr('2.df_3' + var + '.csv', Ind, df_3, subdir)

            print()
            print('Projecting Filter sample rows: ')
            print(df_3.head())

            print()
            x_row_2 = df_3.shape[0]
            x_col_2 = df_3.shape[1]

            print('Total Number of Rows: ', x_row_2)
            print('Total Number of columns: ', x_col_2)

            # Let's do some basic checkings
            sns.set_style('whitegrid')
            #sns.countplot(x='Survived', hue='Gender', data=df_3, palette='RdBu_r')

            # Fixing Gender Column
            # This will check & indicate yellow for missing entries
            #sns.heatmap(df_3.isnull(), yticklabels=False, cbar=False, cmap='viridis')

            #sex = p.get_dummies(df_3['Gender'], drop_first=True)
            sex = p.get_dummies(df_3['Gender'])
            df_4 = p.concat([df_3, sex], axis=1)

            print('After New addition of columns: ')
            print(df_4.head())

            clog.logr('3.df_4' + var + '.csv', Ind, df_4, subdir)

            # Dropping unwanted columns for our Machine Learning
            df_4.drop(columns=['Gender'], inplace=True)
            df_4.drop(columns=['ActiveCases'], inplace=True)
            df_4.drop(columns=['Male','Other','Transgender'], inplace=True)

            clog.logr('4.df_4_Mod' + var + '.csv', Ind, df_4, subdir)

            # Fixing Spread Columns
            spread = p.get_dummies(df_4['ExposureStatus'], drop_first=True)
            df_5 = p.concat([df_4, spread], axis=1)

            print('After Spread columns:')
            print(df_5.head())

            clog.logr('5.df_5' + var + '.csv', Ind, df_5, subdir)

            # Dropping unwanted columns for our Machine Learning
            df_5.drop(columns=['ExposureStatus'], inplace=True)

            clog.logr('6.df_5_Mod' + var + '.csv', Ind, df_5, subdir)

            # Fixing Age Columns
            df_5['Age'] = df_5.apply(lambda row: self.getAgeFromGroup(row), axis=1)
            df_5.drop(columns=["AgeGroup"], inplace=True)

            clog.logr('7.df_6' + var + '.csv', Ind, df_5, subdir)

            # Fixing Dummy Columns Name
            # Renaming one existing column Travel-Related with Travel_Related
            df_5.rename(columns={"Travel-Related": "TravelRelated"}, inplace=True)

            clog.logr('8.df_7' + var + '.csv', Ind, df_5, subdir)

            # Removing state for temporary basis
            df_5.drop(columns=['State'], inplace=True)
            # df_5.drop(columns=['State','Other','Transgender','Pending','TravelRelated','Male'], inplace=True)

            # Casting this entire dataframe into Integer
            # df_5_temp.apply(p.to_numeric)

            print('Info::')
            print(df_5.info())
            print("*" * 60)
            print(df_5.describe())
            print("*" * 60)

            clog.logr('9.df_8' + var + '.csv', Ind, df_5, subdir)

            print('Intermediate Sample Dataframe for Age::')
            print(df_5.head())

            # Plotting it to Graph
            sns.jointplot(x="Age", y='Survived', data=df_5)
            sns.jointplot(x="Age", y='Survived', data=df_5, kind='kde', color='red')
            plt.xlabel("Age")
            plt.ylabel("Data Point (0 - Died   Vs    1 - Alive)")

            # Another check with Age Group
            sns.countplot(x='Survived', hue='Age', data=df_5, palette='RdBu_r')
            plt.xlabel("Survived(0 - Died   Vs    1 - Alive)")
            plt.ylabel("Total No Of Patient")

            df_6 = df_5.drop(columns=['Survived'], axis=1)

            clog.logr('10.df_9' + var + '.csv', Ind, df_6, subdir)

            # Train & Split Data
            x_1 = df_6
            y_1 = df_5['Survived']

            # Now Train-Test Split of your source data
            from sklearn.model_selection import train_test_split

            # test_size => % of allocated data for your test cases
            # random_state => A specific set of random split on your data
            X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(x_1, y_1, test_size=0.3, random_state=101)

            # Importing Model
            from sklearn.linear_model import LogisticRegression

            logmodel = LogisticRegression()
            logmodel.fit(X_train_1, Y_train_1)

            # Adding Predictions to it
            predictions_1 = logmodel.predict(X_test_1)

            from sklearn.metrics import classification_report

            print('Classification Report:: ')
            print(classification_report(Y_test_1, predictions_1))

            from sklearn.metrics import confusion_matrix

            print('Confusion Matrix:: ')
            print(confusion_matrix(Y_test_1, predictions_1))

            # This is require when you are trying to print from conventional
            # front & not using Jupyter notebook.
            plt.show()

            return 0

        except Exception  as e:
            x = str(e)
            print('Error : ', x)

            return 1

Key snippets from the above script –

df_2_Mod['ActiveCases'] = df_2_Mod.apply(lambda row: self.setDefaultActiveCases(row), axis=1)
df_2_Mod['ExposureStatus'] = df_2_Mod.apply(lambda row: self.setDefaultExposure(row), axis=1)
df_2_Mod['Gender'] = df_2_Mod.apply(lambda row: self.setGender(row), axis=1)

# Filtering all other records where we don't get any relevant information
# Fetching Data for
df_3 = df_2_Mod[(df_2_Mod['age_group'] != 'Not Reported')]

# Dropping unwanted columns
df_3.drop(columns=['exposure'], inplace=True)
df_3.drop(columns=['case_status'], inplace=True)
df_3.drop(columns=['date_reported'], inplace=True)
df_3.drop(columns=['gender'], inplace=True)

# Renaming one existing column
df_3.rename(columns={"age_group": "AgeGroup"}, inplace=True)

# Creating important feature
# 0 - Deceased
# 1 - Alive
df_3['Survived'] = df_3.apply(lambda row: self.setSurviveStatus(row), axis=1)

The above lines point to the critical transformation areas, where the application is invoking various essential business logic.

Let’s see at this moment our sample data –

Let’s look into the following part –

# Fixing Spread Columns
spread = p.get_dummies(df_4['ExposureStatus'], drop_first=True)
df_5 = p.concat([df_4, spread], axis=1)

The above lines will transform the data into this –

As you can see, we’ve transformed the row values into columns with binary values. This kind of transformation is beneficial.

# Plotting it to Graph
sns.jointplot(x="Age", y='Survived', data=df_5)
sns.jointplot(x="Age", y='Survived', data=df_5, kind='kde', color='red')
plt.xlabel("Age")
plt.ylabel("Data Point (0 - Died   Vs    1 - Alive)")

# Another check with Age Group
sns.countplot(x='Survived', hue='Age', data=df_5, palette='RdBu_r')
plt.xlabel("Survived(0 - Died   Vs    1 - Alive)")
plt.ylabel("Total No Of Patient")

The above lines will process the data & visualize based on that.

x_1 = df_6
y_1 = df_5['Survived']

In the above snippet, we’ve assigned the features & target variable for our final logistic regression model.

# Now Train-Test Split of your source data
from sklearn.model_selection import train_test_split

# test_size => % of allocated data for your test cases
# random_state => A specific set of random split on your data
X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(x_1, y_1, test_size=0.3, random_state=101)

# Importing Model
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train_1, Y_train_1)

In the above snippet, we’re splitting the primary data & create a set of test & train data. Once we have the collection, the application will put the logistic regression model. And, finally, we’ll fit the training data.

# Adding Predictions to it
predictions_1 = logmodel.predict(X_test_1)

from sklearn.metrics import classification_report

print('Classification Report:: ')
print(classification_report(Y_test_1, predictions_1))

The above lines, finally use the model & then we feed our test data.

Let’s see how it runs –

And, here is the log directory –

For better understanding, I’m just clubbing both the diagram at one place & the final outcome is showing as follows –

So, from the above picture, we can see that the maximum vulnerable patients are patients who are 80+. The next two categories that also suffered are 70+ & 60+.

Also, We’ve checked the Female Vs. Male in the following code –

sns.countplot(x='Survived', hue='Female', data=df_5, palette='RdBu_r')
plt.xlabel("Survived(0 - Died   Vs    1 - Alive)")
plt.ylabel("Female Vs Male (Including Other Genders)")

And, the analysis represents through this –

In this case, you have to consider that the Male part includes all the other genders apart from the actual Male. Hence, I believe death for females would be more compared to people who identified themselves as males.

So, finally, we’ve done it.

During this challenging time, I would request you to follow strict health guidelines & stay healthy.

N.B.: All the data that are used here can be found in the public domain. We use this solely for educational purposes. You can find the details here.

Predicting Flipkart business growth factor using Linear-Regression Machine Learning Model

Posted on May 16, 2020March 11, 2021 by SatyakiDe in call, code, features, function, Linear-Regression, machine-learning, member function, Pandas, pattern matching, Python, regexp_substr, regular expression, snippet, String Manipulation

Hi Guys,

Today, We’ll be exploring the potential business growth factor using the “Linear-Regression Machine Learning” model. We’ve prepared a set of dummy data & based on that, we’ll predict.

Let’s explore a few sample data –

So, based on these data, we would like to predict YearlyAmountSpent dependent on any one of the following features, i.e. [ Time On App / Time On Website / Flipkart Membership Duration (In Year) ].

You need to install the following packages –

pip install pandas
pip install matplotlib
pip install sklearn

We’ll be discussing only the main calling script & class script. However, we’ll be posting the parameters without discussing it. And, we won’t discuss clsL.py as we’ve already discussed that in our previous post.

1. clsConfig.py (This script contains all the parameter details.)

################################################
#### Written By: SATYAKI DE                 ####
#### Written On: 15-May-2020                ####
####                                        ####
#### Objective: This script is a config     ####
#### file, contains all the keys for        ####
#### Machine-Learning. Application will     ####
#### process these information & perform    ####
#### various analysis on Linear-Regression. ####
################################################

import os
import platform as pl

class clsConfig(object):
    Curr_Path = os.path.dirname(os.path.realpath(__file__))

    os_det = pl.system()
    if os_det == "Windows":
        sep = '\\'
    else:
        sep = '/'

    config = {
        'APP_ID': 1,
        'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
        'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
        'LOG_PATH': Curr_Path + sep + 'log' + sep,
        'REPORT_PATH': Curr_Path + sep + 'report',
        'FILE_NAME': Curr_Path + sep + 'Data' + sep + 'FlipkartCustomers.csv',
        'SRC_PATH': Curr_Path + sep + 'Data' + sep,
        'APP_DESC_1': 'IBM Watson Language Understand!',
        'DEBUG_IND': 'N',
        'INIT_PATH': Curr_Path
    }

2. clsLinearRegression.py (This is the main script, which will invoke the Machine-Learning API & return 0 if successful.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 15-May-2020              ####
#### Modified On 15-May-2020              ####
####                                      ####
#### Objective: Main scripts for Linear   ####
#### Regression.                          ####
##############################################

import pandas as p
import numpy as np
import regex as re

import matplotlib.pyplot as plt
from clsConfig import clsConfig as cf

# %matplotlib inline -- for Jupyter Notebook
class clsLinearRegression:
    def __init__(self):
        self.fileName =  cf.config['FILE_NAME']

    def predictResult(self):
        try:

            inputFileName = self.fileName

            # Reading from Input File
            df = p.read_csv(inputFileName)

            print()
            print('Projecting sample rows: ')
            print(df.head())

            print()
            x_row = df.shape[0]
            x_col = df.shape[1]

            print('Total Number of Rows: ', x_row)
            print('Total Number of columns: ', x_col)

            # Adding Features
            x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]

            # Target Variable - Trying to predict
            y = df['YearlyAmountSpent']

            # Now Train-Test Split of your source data
            from sklearn.model_selection import train_test_split

            # test_size => % of allocated data for your test cases
            # random_state => A specific set of random split on your data
            X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)

            # Importing Model
            from sklearn.linear_model import LinearRegression

            # Creating an Instance
            lm = LinearRegression()

            # Train or Fit my model on Training Data
            lm.fit(X_train, Y_train)

            # Creating a prediction value
            flipKartSalePrediction = lm.predict(X_test)

            # Creating a scatter plot based on Actual Value & Predicted Value
            plt.scatter(Y_test, flipKartSalePrediction)

            # Adding meaningful Label
            plt.xlabel('Actual Values')
            plt.ylabel('Predicted Values')

            # Checking Individual Metrics
            from sklearn import metrics

            print()
            mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction)
            print('Mean Absolute Error (MEA): ', mea_val)

            mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction)
            print('Mean Square Error (MSE): ', mse_val)

            rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction))
            print('Square root Mean Square Error (RMSE): ', rmse_val)

            print()

            # Check Variance Score - R^2 Value
            print('Variance Score:')
            var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip()
            print('Our Model is', var_score, '% accurate. ')
            print()

            # Finding Coeficent on X_train.columns
            print()
            print('Finding Coeficent: ')

            cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient'])
            print('Printing the All the Factors: ')
            print(cedf)

            print()

            # Getting the Max Value from it
            cedf['MaxFactorForBusiness'] = cedf['Coefficient'].max()

            # Filtering the max Value to identify the biggest Business factor
            dfMax = cedf[(cedf['MaxFactorForBusiness'] == cedf['Coefficient'])]

            # Dropping the derived column
            dfMax.drop(columns=['MaxFactorForBusiness'], inplace=True)
            dfMax = dfMax.reset_index()

            print(dfMax)

            # Extracting Actual Business Factor from Pandas dataframe
            str_factor_temp = str(dfMax.iloc[0]['index'])
            str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp)
            str_value = str(round(float(dfMax.iloc[0]['Coefficient']),2))

            print()
            print('*' * 80)
            print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')
            print('*' * 80)
            print()

            # This is require when you are trying to print from conventional
            # front & not using Jupyter notebook.
            plt.show()

            return 0

        except Exception  as e:
            x = str(e)
            print('Error : ', x)

            return 1

Key lines from the above snippet –

# Adding Features
x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]

Our application creating a subset of the main datagram, which contains all the features.

# Target Variable - Trying to predict
y = df['YearlyAmountSpent']

Now, the application is setting the target variable into ‘Y.’

# Now Train-Test Split of your source data
from sklearn.model_selection import train_test_split

# test_size => % of allocated data for your test cases
# random_state => A specific set of random split on your data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)

As per “Supervised Learning,” our application is splitting the dataset into two subsets. One is to train the model & another segment is to test your final model. However, you can divide the data into three sets that include the performance statistics for a large dataset. In our case, we don’t need that as this data is significantly less.

# Train or Fit my model on Training Data
lm.fit(X_train, Y_train)

Our application is now training/fit the data into the model.

# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)

Our application projected the outcome based on the predicted data in a scatterplot graph.

Also, the following concepts captured by using our program. For more details, I’ve provided the external link for your reference –

And, the implementation has shown as –

mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction)
print('Mean Absolute Error (MEA): ', mea_val)

mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction)
print('Mean Square Error (MSE): ', mse_val)

rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction))
print('Square Root Mean Square Error (RMSE): ', rmse_val)

At this moment, we would like to check the credibility of our model by using the variance score are as follows –

var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip()
print('Our Model is', var_score, '% accurate. ')

Finally, extracting the coefficient to find out, which particular feature will lead Flikkart for better sale & growth by taking the maximum of coefficient value month the all features are as shown below –

cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient'])

# Getting the Max Value from it
cedf['MaxFactorForBusiness'] = cedf['Coefficient'].max()

# Filtering the max Value to identify the biggest Business factor
dfMax = cedf[(cedf['MaxFactorForBusiness'] == cedf['Coefficient'])]

# Dropping the derived column
dfMax.drop(columns=['MaxFactorForBusiness'], inplace=True)
dfMax = dfMax.reset_index()

Note that we’ve used a regular expression to split the camel-case column name from our feature & represent that with a much more meaningful name without changing the column name.

# Extracting Actual Business Factor from Pandas dataframe
str_factor_temp = str(dfMax.iloc[0]['index'])
str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp)
str_value = str(round(float(dfMax.iloc[0]['Coefficient']),2))

print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')

3. callLinear.py (This is the first calling script.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 15-May-2020              ####
#### Modified On 15-May-2020              ####
####                                      ####
#### Objective: Main calling scripts.     ####
##############################################

from clsConfig import clsConfig as cf
import clsL as cl
import logging
import datetime
import clsLinearRegression as cw

# Disbling Warning
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

# Lookup functions from
# Azure cloud SQL DB

var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def main():
    try:
        ret_1 = 0
        general_log_path = str(cf.config['LOG_PATH'])

        # Enabling Logging Info
        logging.basicConfig(filename=general_log_path + 'MachineLearning_LinearRegression.log', level=logging.INFO)

        # Initiating Log Class
        l = cl.clsL()

        # Moving previous day log files to archive directory
        log_dir = cf.config['LOG_PATH']
        curr_ver =datetime.datetime.now().strftime("%Y-%m-%d")

        tmpR0 = "*" * 157

        logging.info(tmpR0)
        tmpR9 = 'Start Time: ' + str(var)
        logging.info(tmpR9)
        logging.info(tmpR0)

        print("Log Directory::", log_dir)
        tmpR1 = 'Log Directory::' + log_dir
        logging.info(tmpR1)

        print('Machine Learning - Linear Regression Prediction : ')
        print('-' * 200)

        # Create the instance of the Linear-Regression Class
        x2 = cw.clsLinearRegression()

        ret = x2.predictResult()

        if ret == 0:
            print('Successful Linear-Regression Prediction Generated!')
        else:
            print('Failed to generate Linear-Regression Prediction!')

        print("-" * 200)
        print()

        print('Finding Analysis points..')
        print("*" * 200)
        logging.info('Finding Analysis points..')
        logging.info(tmpR0)


        tmpR10 = 'End Time: ' + str(var)
        logging.info(tmpR10)
        logging.info(tmpR0)

    except ValueError as e:
        print(str(e))
        logging.info(str(e))

    except Exception as e:
        print("Top level Error: args:{0}, message{1}".format(e.args, e.message))

if __name__ == "__main__":
    main()

Key snippet from the above script –

# Create the instance of the Linear-Regression
x2 = cw.clsLinearRegression()

ret = x2.predictResult()

In the above snippet, our application initially creating an instance of the main class & finally invokes the “predictResult” method.

Let’s run our application –

Step 1:

First, the application will fetch the following sample rows from our source file – if it is successful.

Step 2:

Then, It will create the following scatterplot by executing the following snippet –

# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)

Note that our model is pretty accurate & it has a balanced success rate compared to our predicted numbers.

Step 3:

Finally, it is successfully able to project the critical feature are shown below –

From the above picture, you can see that our model is pretty accurate (89% approx).

Also, highlighted red square identifying the key-features & their confidence score & finally, the projecting the winner feature marked in green.

So, as per that, we’ve come to one conclusion that Flipkart’s business growth depends on the tenure of their subscriber, i.e., old members are prone to buy more than newer members.

Let’s look into our directory structure –

So, we’ve done it.

I’ll be posting another new post in the coming days. Till then, Happy Avenging! 😀

Note: All the data posted here are representational data & available over the internet & for educational purpose only.

	The LLM Security Chr… on The LLM Security Chronicles…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on Agentic AI in the Enterprise:…

Tag: machine-learning

Live visual reading using Convolutional Neural Network (CNN) through Python-based machine-learning application.

Like this:

Canada’s Covid19 analysis based on Logistic Regression

Like this:

Predicting Flipkart business growth factor using Linear-Regression Machine Learning Model

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: