Hi Guys,
Today, We’ll be exploring the potential business growth factor using the “Linear-Regression Machine Learning” model. We’ve prepared a set of dummy data & based on that, we’ll predict.
Let’s explore a few sample data –

So, based on these data, we would like to predict YearlyAmountSpent dependent on any one of the following features, i.e. [ Time On App / Time On Website / Flipkart Membership Duration (In Year) ].
You need to install the following packages –
pip install pandas
pip install matplotlib
pip install sklearn
We’ll be discussing only the main calling script & class script. However, we’ll be posting the parameters without discussing it. And, we won’t discuss clsL.py as we’ve already discussed that in our previous post.
1. clsConfig.py (This script contains all the parameter details.)
################################################ #### Written By: SATYAKI DE #### #### Written On: 15-May-2020 #### #### #### #### Objective: This script is a config #### #### file, contains all the keys for #### #### Machine-Learning. Application will #### #### process these information & perform #### #### various analysis on Linear-Regression. #### ################################################ import os import platform as pl class clsConfig(object): Curr_Path = os.path.dirname(os.path.realpath(__file__)) os_det = pl.system() if os_det == "Windows": sep = '\\' else: sep = '/' config = { 'APP_ID': 1, 'ARCH_DIR': Curr_Path + sep + 'arch' + sep, 'PROFILE_PATH': Curr_Path + sep + 'profile' + sep, 'LOG_PATH': Curr_Path + sep + 'log' + sep, 'REPORT_PATH': Curr_Path + sep + 'report', 'FILE_NAME': Curr_Path + sep + 'Data' + sep + 'FlipkartCustomers.csv', 'SRC_PATH': Curr_Path + sep + 'Data' + sep, 'APP_DESC_1': 'IBM Watson Language Understand!', 'DEBUG_IND': 'N', 'INIT_PATH': Curr_Path }
2. clsLinearRegression.py (This is the main script, which will invoke the Machine-Learning API & return 0 if successful.)
############################################## #### Written By: SATYAKI DE #### #### Written On: 15-May-2020 #### #### Modified On 15-May-2020 #### #### #### #### Objective: Main scripts for Linear #### #### Regression. #### ############################################## import pandas as p import numpy as np import regex as re import matplotlib.pyplot as plt from clsConfig import clsConfig as cf # %matplotlib inline -- for Jupyter Notebook class clsLinearRegression: def __init__(self): self.fileName = cf.config['FILE_NAME'] def predictResult(self): try: inputFileName = self.fileName # Reading from Input File df = p.read_csv(inputFileName) print() print('Projecting sample rows: ') print(df.head()) print() x_row = df.shape[0] x_col = df.shape[1] print('Total Number of Rows: ', x_row) print('Total Number of columns: ', x_col) # Adding Features x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']] # Target Variable - Trying to predict y = df['YearlyAmountSpent'] # Now Train-Test Split of your source data from sklearn.model_selection import train_test_split # test_size => % of allocated data for your test cases # random_state => A specific set of random split on your data X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101) # Importing Model from sklearn.linear_model import LinearRegression # Creating an Instance lm = LinearRegression() # Train or Fit my model on Training Data lm.fit(X_train, Y_train) # Creating a prediction value flipKartSalePrediction = lm.predict(X_test) # Creating a scatter plot based on Actual Value & Predicted Value plt.scatter(Y_test, flipKartSalePrediction) # Adding meaningful Label plt.xlabel('Actual Values') plt.ylabel('Predicted Values') # Checking Individual Metrics from sklearn import metrics print() mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction) print('Mean Absolute Error (MEA): ', mea_val) mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction) print('Mean Square Error (MSE): ', mse_val) rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction)) print('Square root Mean Square Error (RMSE): ', rmse_val) print() # Check Variance Score - R^2 Value print('Variance Score:') var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip() print('Our Model is', var_score, '% accurate. ') print() # Finding Coeficent on X_train.columns print() print('Finding Coeficent: ') cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient']) print('Printing the All the Factors: ') print(cedf) print() # Getting the Max Value from it cedf['MaxFactorForBusiness'] = cedf['Coefficient'].max() # Filtering the max Value to identify the biggest Business factor dfMax = cedf[(cedf['MaxFactorForBusiness'] == cedf['Coefficient'])] # Dropping the derived column dfMax.drop(columns=['MaxFactorForBusiness'], inplace=True) dfMax = dfMax.reset_index() print(dfMax) # Extracting Actual Business Factor from Pandas dataframe str_factor_temp = str(dfMax.iloc[0]['index']) str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp) str_value = str(round(float(dfMax.iloc[0]['Coefficient']),2)) print() print('*' * 80) print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%') print('*' * 80) print() # This is require when you are trying to print from conventional # front & not using Jupyter notebook. plt.show() return 0 except Exception as e: x = str(e) print('Error : ', x) return 1
Key lines from the above snippet –
# Adding Features x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]
Our application creating a subset of the main datagram, which contains all the features.
# Target Variable - Trying to predict y = df['YearlyAmountSpent']
Now, the application is setting the target variable into ‘Y.’
# Now Train-Test Split of your source data from sklearn.model_selection import train_test_split # test_size => % of allocated data for your test cases # random_state => A specific set of random split on your data X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)
As per “Supervised Learning,” our application is splitting the dataset into two subsets. One is to train the model & another segment is to test your final model. However, you can divide the data into three sets that include the performance statistics for a large dataset. In our case, we don’t need that as this data is significantly less.
# Train or Fit my model on Training Data lm.fit(X_train, Y_train)
Our application is now training/fit the data into the model.
# Creating a scatter plot based on Actual Value & Predicted Value plt.scatter(Y_test, flipKartSalePrediction)
Our application projected the outcome based on the predicted data in a scatterplot graph.
Also, the following concepts captured by using our program. For more details, I’ve provided the external link for your reference –
- Mean Absolute Error (MEA)
- Mean Square Error (MSE)
- Square Root Mean Square Error (RMSE)
And, the implementation has shown as –
mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction) print('Mean Absolute Error (MEA): ', mea_val) mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction) print('Mean Square Error (MSE): ', mse_val) rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction)) print('Square Root Mean Square Error (RMSE): ', rmse_val)
At this moment, we would like to check the credibility of our model by using the variance score are as follows –
var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip() print('Our Model is', var_score, '% accurate. ')
Finally, extracting the coefficient to find out, which particular feature will lead Flikkart for better sale & growth by taking the maximum of coefficient value month the all features are as shown below –
cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient']) # Getting the Max Value from it cedf['MaxFactorForBusiness'] = cedf['Coefficient'].max() # Filtering the max Value to identify the biggest Business factor dfMax = cedf[(cedf['MaxFactorForBusiness'] == cedf['Coefficient'])] # Dropping the derived column dfMax.drop(columns=['MaxFactorForBusiness'], inplace=True) dfMax = dfMax.reset_index()
Note that we’ve used a regular expression to split the camel-case column name from our feature & represent that with a much more meaningful name without changing the column name.
# Extracting Actual Business Factor from Pandas dataframe str_factor_temp = str(dfMax.iloc[0]['index']) str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp) str_value = str(round(float(dfMax.iloc[0]['Coefficient']),2)) print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')
3. callLinear.py (This is the first calling script.)
############################################## #### Written By: SATYAKI DE #### #### Written On: 15-May-2020 #### #### Modified On 15-May-2020 #### #### #### #### Objective: Main calling scripts. #### ############################################## from clsConfig import clsConfig as cf import clsL as cl import logging import datetime import clsLinearRegression as cw # Disbling Warning def warn(*args, **kwargs): pass import warnings warnings.warn = warn # Lookup functions from # Azure cloud SQL DB var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") def main(): try: ret_1 = 0 general_log_path = str(cf.config['LOG_PATH']) # Enabling Logging Info logging.basicConfig(filename=general_log_path + 'MachineLearning_LinearRegression.log', level=logging.INFO) # Initiating Log Class l = cl.clsL() # Moving previous day log files to archive directory log_dir = cf.config['LOG_PATH'] curr_ver =datetime.datetime.now().strftime("%Y-%m-%d") tmpR0 = "*" * 157 logging.info(tmpR0) tmpR9 = 'Start Time: ' + str(var) logging.info(tmpR9) logging.info(tmpR0) print("Log Directory::", log_dir) tmpR1 = 'Log Directory::' + log_dir logging.info(tmpR1) print('Machine Learning - Linear Regression Prediction : ') print('-' * 200) # Create the instance of the Linear-Regression Class x2 = cw.clsLinearRegression() ret = x2.predictResult() if ret == 0: print('Successful Linear-Regression Prediction Generated!') else: print('Failed to generate Linear-Regression Prediction!') print("-" * 200) print() print('Finding Analysis points..') print("*" * 200) logging.info('Finding Analysis points..') logging.info(tmpR0) tmpR10 = 'End Time: ' + str(var) logging.info(tmpR10) logging.info(tmpR0) except ValueError as e: print(str(e)) logging.info(str(e)) except Exception as e: print("Top level Error: args:{0}, message{1}".format(e.args, e.message)) if __name__ == "__main__": main()
Key snippet from the above script –
# Create the instance of the Linear-Regression x2 = cw.clsLinearRegression() ret = x2.predictResult()
In the above snippet, our application initially creating an instance of the main class & finally invokes the “predictResult” method.
Let’s run our application –
Step 1:
First, the application will fetch the following sample rows from our source file – if it is successful.

Step 2:
Then, It will create the following scatterplot by executing the following snippet –
# Creating a scatter plot based on Actual Value & Predicted Value plt.scatter(Y_test, flipKartSalePrediction)

Note that our model is pretty accurate & it has a balanced success rate compared to our predicted numbers.
Step 3:
Finally, it is successfully able to project the critical feature are shown below –

From the above picture, you can see that our model is pretty accurate (89% approx).
Also, highlighted red square identifying the key-features & their confidence score & finally, the projecting the winner feature marked in green.
So, as per that, we’ve come to one conclusion that Flipkart’s business growth depends on the tenure of their subscriber, i.e., old members are prone to buy more than newer members.
Let’s look into our directory structure –

So, we’ve done it.
I’ll be posting another new post in the coming days. Till then, Happy Avenging! 😀
Note: All the data posted here are representational data & available over the internet & for educational purpose only.
You must log in to post a comment.