Predicting Flipkart business growth factor using Linear-Regression Machine Learning Model

Hi Guys,

Today, We’ll be exploring the potential business growth factor using the “Linear-Regression Machine Learning” model. We’ve prepared a set of dummy data & based on that, we’ll predict.

Let’s explore a few sample data –

1. Sample Data

So, based on these data, we would like to predict YearlyAmountSpent dependent on any one of the following features, i.e. [ Time On App / Time On Website / Flipkart Membership Duration (In Year) ].

You need to install the following packages –

pip install pandas

pip install matplotlib

pip install sklearn

We’ll be discussing only the main calling script & class script. However, we’ll be posting the parameters without discussing it. And, we won’t discuss clsL.py as we’ve already discussed that in our previous post.

1. clsConfig.py (This script contains all the parameter details.)

################################################
#### Written By: SATYAKI DE                 ####
#### Written On: 15-May-2020                ####
####                                        ####
#### Objective: This script is a config     ####
#### file, contains all the keys for        ####
#### Machine-Learning. Application will     ####
#### process these information & perform    ####
#### various analysis on Linear-Regression. ####
################################################

import os
import platform as pl

class clsConfig(object):
    Curr_Path = os.path.dirname(os.path.realpath(__file__))

    os_det = pl.system()
    if os_det == "Windows":
        sep = '\\'
    else:
        sep = '/'

    config = {
        'APP_ID': 1,
        'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
        'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
        'LOG_PATH': Curr_Path + sep + 'log' + sep,
        'REPORT_PATH': Curr_Path + sep + 'report',
        'FILE_NAME': Curr_Path + sep + 'Data' + sep + 'FlipkartCustomers.csv',
        'SRC_PATH': Curr_Path + sep + 'Data' + sep,
        'APP_DESC_1': 'IBM Watson Language Understand!',
        'DEBUG_IND': 'N',
        'INIT_PATH': Curr_Path
    }

2. clsLinearRegression.py (This is the main script, which will invoke the Machine-Learning API & return 0 if successful.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 15-May-2020              ####
#### Modified On 15-May-2020              ####
####                                      ####
#### Objective: Main scripts for Linear   ####
#### Regression.                          ####
##############################################

import pandas as p
import numpy as np
import regex as re

import matplotlib.pyplot as plt
from clsConfig import clsConfig as cf

# %matplotlib inline -- for Jupyter Notebook
class clsLinearRegression:
    def __init__(self):
        self.fileName =  cf.config['FILE_NAME']

    def predictResult(self):
        try:

            inputFileName = self.fileName

            # Reading from Input File
            df = p.read_csv(inputFileName)

            print()
            print('Projecting sample rows: ')
            print(df.head())

            print()
            x_row = df.shape[0]
            x_col = df.shape[1]

            print('Total Number of Rows: ', x_row)
            print('Total Number of columns: ', x_col)

            # Adding Features
            x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]

            # Target Variable - Trying to predict
            y = df['YearlyAmountSpent']

            # Now Train-Test Split of your source data
            from sklearn.model_selection import train_test_split

            # test_size => % of allocated data for your test cases
            # random_state => A specific set of random split on your data
            X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)

            # Importing Model
            from sklearn.linear_model import LinearRegression

            # Creating an Instance
            lm = LinearRegression()

            # Train or Fit my model on Training Data
            lm.fit(X_train, Y_train)

            # Creating a prediction value
            flipKartSalePrediction = lm.predict(X_test)

            # Creating a scatter plot based on Actual Value & Predicted Value
            plt.scatter(Y_test, flipKartSalePrediction)

            # Adding meaningful Label
            plt.xlabel('Actual Values')
            plt.ylabel('Predicted Values')

            # Checking Individual Metrics
            from sklearn import metrics

            print()
            mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction)
            print('Mean Absolute Error (MEA): ', mea_val)

            mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction)
            print('Mean Square Error (MSE): ', mse_val)

            rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction))
            print('Square root Mean Square Error (RMSE): ', rmse_val)

            print()

            # Check Variance Score - R^2 Value
            print('Variance Score:')
            var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip()
            print('Our Model is', var_score, '% accurate. ')
            print()

            # Finding Coeficent on X_train.columns
            print()
            print('Finding Coeficent: ')

            cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient'])
            print('Printing the All the Factors: ')
            print(cedf)

            print()

            # Getting the Max Value from it
            cedf['MaxFactorForBusiness'] = cedf['Coefficient'].max()

            # Filtering the max Value to identify the biggest Business factor
            dfMax = cedf[(cedf['MaxFactorForBusiness'] == cedf['Coefficient'])]

            # Dropping the derived column
            dfMax.drop(columns=['MaxFactorForBusiness'], inplace=True)
            dfMax = dfMax.reset_index()

            print(dfMax)

            # Extracting Actual Business Factor from Pandas dataframe
            str_factor_temp = str(dfMax.iloc[0]['index'])
            str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp)
            str_value = str(round(float(dfMax.iloc[0]['Coefficient']),2))

            print()
            print('*' * 80)
            print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')
            print('*' * 80)
            print()

            # This is require when you are trying to print from conventional
            # front & not using Jupyter notebook.
            plt.show()

            return 0

        except Exception  as e:
            x = str(e)
            print('Error : ', x)

            return 1

Key lines from the above snippet –

# Adding Features
x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]

Our application creating a subset of the main datagram, which contains all the features.

# Target Variable - Trying to predict
y = df['YearlyAmountSpent']

Now, the application is setting the target variable into ‘Y.’

# Now Train-Test Split of your source data
from sklearn.model_selection import train_test_split

# test_size => % of allocated data for your test cases
# random_state => A specific set of random split on your data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)

As per “Supervised Learning,” our application is splitting the dataset into two subsets. One is to train the model & another segment is to test your final model. However, you can divide the data into three sets that include the performance statistics for a large dataset. In our case, we don’t need that as this data is significantly less.

# Train or Fit my model on Training Data
lm.fit(X_train, Y_train)

Our application is now training/fit the data into the model.

# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)

Our application projected the outcome based on the predicted data in a scatterplot graph.

Also, the following concepts captured by using our program. For more details, I’ve provided the external link for your reference –

  1. Mean Absolute Error (MEA)
  2. Mean Square Error (MSE)
  3. Square Root Mean Square Error (RMSE)

And, the implementation has shown as –

mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction)
print('Mean Absolute Error (MEA): ', mea_val)

mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction)
print('Mean Square Error (MSE): ', mse_val)

rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction))
print('Square Root Mean Square Error (RMSE): ', rmse_val)

At this moment, we would like to check the credibility of our model by using the variance score are as follows –

var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip()
print('Our Model is', var_score, '% accurate. ')

Finally, extracting the coefficient to find out, which particular feature will lead Flikkart for better sale & growth by taking the maximum of coefficient value month the all features are as shown below –

cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient'])

# Getting the Max Value from it
cedf['MaxFactorForBusiness'] = cedf['Coefficient'].max()

# Filtering the max Value to identify the biggest Business factor
dfMax = cedf[(cedf['MaxFactorForBusiness'] == cedf['Coefficient'])]

# Dropping the derived column
dfMax.drop(columns=['MaxFactorForBusiness'], inplace=True)
dfMax = dfMax.reset_index()

Note that we’ve used a regular expression to split the camel-case column name from our feature & represent that with a much more meaningful name without changing the column name.

# Extracting Actual Business Factor from Pandas dataframe
str_factor_temp = str(dfMax.iloc[0]['index'])
str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp)
str_value = str(round(float(dfMax.iloc[0]['Coefficient']),2))

print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')

3. callLinear.py (This is the first calling script.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 15-May-2020              ####
#### Modified On 15-May-2020              ####
####                                      ####
#### Objective: Main calling scripts.     ####
##############################################

from clsConfig import clsConfig as cf
import clsL as cl
import logging
import datetime
import clsLinearRegression as cw

# Disbling Warning
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

# Lookup functions from
# Azure cloud SQL DB

var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def main():
    try:
        ret_1 = 0
        general_log_path = str(cf.config['LOG_PATH'])

        # Enabling Logging Info
        logging.basicConfig(filename=general_log_path + 'MachineLearning_LinearRegression.log', level=logging.INFO)

        # Initiating Log Class
        l = cl.clsL()

        # Moving previous day log files to archive directory
        log_dir = cf.config['LOG_PATH']
        curr_ver =datetime.datetime.now().strftime("%Y-%m-%d")

        tmpR0 = "*" * 157

        logging.info(tmpR0)
        tmpR9 = 'Start Time: ' + str(var)
        logging.info(tmpR9)
        logging.info(tmpR0)

        print("Log Directory::", log_dir)
        tmpR1 = 'Log Directory::' + log_dir
        logging.info(tmpR1)

        print('Machine Learning - Linear Regression Prediction : ')
        print('-' * 200)

        # Create the instance of the Linear-Regression Class
        x2 = cw.clsLinearRegression()

        ret = x2.predictResult()

        if ret == 0:
            print('Successful Linear-Regression Prediction Generated!')
        else:
            print('Failed to generate Linear-Regression Prediction!')

        print("-" * 200)
        print()

        print('Finding Analysis points..')
        print("*" * 200)
        logging.info('Finding Analysis points..')
        logging.info(tmpR0)


        tmpR10 = 'End Time: ' + str(var)
        logging.info(tmpR10)
        logging.info(tmpR0)

    except ValueError as e:
        print(str(e))
        logging.info(str(e))

    except Exception as e:
        print("Top level Error: args:{0}, message{1}".format(e.args, e.message))

if __name__ == "__main__":
    main()

Key snippet from the above script –

# Create the instance of the Linear-Regression
x2 = cw.clsLinearRegression()

ret = x2.predictResult()

In the above snippet, our application initially creating an instance of the main class & finally invokes the “predictResult” method.

Let’s run our application –

Step 1:

First, the application will fetch the following sample rows from our source file – if it is successful.

2. Run_1

Step 2:

Then, It will create the following scatterplot by executing the following snippet –

# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)
3. Run_2

Note that our model is pretty accurate & it has a balanced success rate compared to our predicted numbers.

Step 3:

Finally, it is successfully able to project the critical feature are shown below –

4. Run_3

From the above picture, you can see that our model is pretty accurate (89% approx).

Also, highlighted red square identifying the key-features & their confidence score & finally, the projecting the winner feature marked in green.

So, as per that, we’ve come to one conclusion that Flipkart’s business growth depends on the tenure of their subscriber, i.e., old members are prone to buy more than newer members.

Let’s look into our directory structure –

5. Win_Dir

So, we’ve done it.

I’ll be posting another new post in the coming days. Till then, Happy Avenging! 😀

Note: All the data posted here are representational data & available over the internet & for educational purpose only.

Password Validation Using Regular Expression In Teradata 14 & 15

Today, we’ll be checking one new area where we can implement regular expression to achieve the password validation without involving any kind of Macro, Stored-Proc.

 

Let’s consider the following conditions to be implemented –

 

1. Password should contain characters between 6 & 10.

2. One character should be digit.

3. One character should be in upper case letter.

4. There should be at least one special character.

 

Let’s check the Query & Output –

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
select seq_no,
       passwd,
       regexp_similar(passwd,'^(?=^([[:graph:]]{6,10})$)(?=.*([[:upper:]]{1,}))(?=.*([[:digit:]]{1,})).*$') as reg_test
from scott.login_det
order by 1;


SEQ_NO	PASSWD	 REG_TEST
-----   -------  --------------
1	hoti         0
2	hotimla	     0
3	hotImla	     0
4	hot@imla     0
5	hoT@imla     0
6	hoT@iml9a    1
7	hoT@iml9a66  0

 

Similarly, you can add condition of lower case character if you want to make it more complex.

 

Hope, this will give you another way – to implement the same logic. 🙂

Reverse String using Regexp Functions

Hi,

 

Today, we’ll be checking one new area where we can implement regular expression to achieve the same without involving any kind of Macro, Stored-Proc.

 

Many occasion we may have to parse various kind of strings. Assume that, we need to parse the string in reverse order. Until TD 14.0, you don’t have any easy method to implement the same. Off course, this new method also has some limits. It can only able to reverse couple of characters only. I have checked with 9 characters.  It supports that.

 

So, if you have specific string lengths, then you may also try this solution if you are using TD 14.0. That will be handy.

 

Let’s check the Query & Output –

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT SEQ_NO,
                 SRC_STR,
                 regexp_replace(SRC_STR,'([[:alnum:]]{1,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})','\9\8\7\6\5\4\3\2\1') AS REV_SRC_STR
FROM WM_CONCAT_TAB
ORDER BY 1;


SEQ_NO	SRC_STR	 REV_SRC_STR
-----   -------  --------------
1	BIRESWAR RAWSERIB
1	TRIDIB	 BIDIRT
1	SUJAY	 YAJUS
1	ANUPAM	 MAPUNA
1	RAM	 MAR
2	PAPU	 UPAP
2	SAYAN	 NAYAS
2	TUNKAI	 IAKNUT
2	BABU	 UBAB
3	ATBIS	 SIBTA
3	SAPMUNDA ADNUMPAS
3	IK	 KI

 

Hope, this will give you another way – to implement the same logic. 🙂

String Manipulation Advanced Using Teradata 14.0 Regular Expression

Today, I’ll show couple of very useful functions or logic implemented in Teradata using It’s Regular Expression.

There is two very popular demand comes from most of the Developer across different databases regarding the following two cases –

1. How to Split Comma Separated Values in each rows 

2. How to bind separate values in 1 row (Just opposite of Step 1)

2nd Options are very demanding as Cross platform database professional specially Oracle Developers looking for these kind of implementation as Oracle has directly built-in functions to do the same. Those functions are Listagg, wm_concat, group_concat.

Let’s check the solution –

Case 1,

Let’s create the table & prepare some data –

 

1
2
3
4
5
6
7
CREATE MULTISET TABLE ETL_DATA.PARSE_STR
  (
     SEQ_NO       INTEGER,
     SRC_STR     VARCHAR(70)
  );
 
CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:01.864

 

Let’s insert some data –

 

1
2
3
INSERT INTO ETL_DATA.PARSE_STR VALUES(1,'RAM,TRIDIB,ANUPAM,BIRESWAR,SUJAY')
;INSERT INTO ETL_DATA.PARSE_STR VALUES(2,'TUNKAI,SAYAN,BABU,PAPU')
;INSERT INTO ETL_DATA.PARSE_STR VALUES(3,'IK,ATBIS,SAPMUNDA');

 

Let’s check the value –

 

1
2
3
4
5
SEQ_NO          SRC_STR
------  ----------------------------------
    1   RAM,TRIDIB,ANUPAM,BIRESWAR,SUJAY
    2   TUNKAI,SAYAN,BABU,PAPU
    3   IK,ATBIS,SAPMUNDA

 

Fine, Now our objective will be split these comma separated values in each lines.

 

1
2
3
4
5
6
SELECT b.SEQ_NO,
       regexp_substr(b.SRC_STR,'[^,]+',1,day_of_calendar) AS SRC_STR
FROM sys_calendar.calendar ,
     PARSE_STR b
WHERE day_of_calendar BETWEEN 1 AND  (LENGTH(b.SRC_STR) - LENGTH(regexp_replace(b.SRC_STR,'[^A-Z]+','',1,0,'i'))+1 )
ORDER BY 1,2;

 

And, let’s check the output –

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SEQ_NO  SRC_STR
-----   ----------------------
1       ANUPAM
1       BIRESWAR
1       RAM
1       SUJAY
1       TRIDIB
2       BABU
2       PAPU
2       SAYAN
2       TUNKAI
3       ATBIS
3       IK
3       SAPMUNDA

 

Gr8! I guess, result is coming as per my expectation. 🙂

 

Case 2(Subsitute Of Listagg, wm_concat, group_concat in Oracle),

This we’ve to do it in Two small Steps for better understanding & performance.

First, let us create another table –

 

1
2
3
4
5
6
7
CREATE MULTISET TABLE ETL_DATA.WM_CONCAT_TAB
   (
      SEQ_NO   INTEGER,
      SRC_STR VARCHAR(20)
   );
    
CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:01.230

 

Good. Now we’ll populate some data into this table. We’ll populate data from Step 1 as this will provide the exact data that we’re expecting as input test data for Case 2.

Let’s insert those data –

 

1
2
3
4
5
6
INSERT INTO ETL_DATA.WM_CONCAT_TAB
SELECT b.SEQ_NO,
       regexp_substr(b.SRC_STR,'[^,]+',1,day_of_calendar) AS SRC_STR
FROM sys_calendar.calendar ,
     PARSE_STR b
WHERE day_of_calendar BETWEEN 1 AND  (LENGTH(b.SRC_STR) - LENGTH(regexp_replace(b.SRC_STR,'[^A-Z]+','',1,0,'i'))+1 );

 

Let’s check the data –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SEQ_NO  SRC_STR
------  --------------------
1       ANUPAM
1       BIRESWAR
1       RAM
1       SUJAY
1       TRIDIB
2       BABU
2       PAPU
2       SAYAN
2       TUNKAI
3       ATBIS
3       IK
3       SAPMUNDA

 

As you know in TD we’ve significant restcriction regarding Hirarchical Queries & Recursive Queries. So, In this step we’ll build one relationship like employee & manager in popular employee table. So, if we have that kind of relation then we can easily establish & fit that in TD model.

Let’s create this intermediate table. In this case we’ll go for mapping between current rows with next rows. This is also very useful process. In Oracle, they have LEAD or LAG functions to achieve the same. But, here we’ve to work a little bit more to achive the same.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
CREATE MULTISET VOLATILE TABLE VT_SRC_ARRNG
AS
     (
            SELECT SEQ_NO,
                   SRC_STR,
                   MAX(SRC_STR) OVER(
                                        PARTITION BY SEQ_NO
                                        ORDER BY SEQ_NO, SRC_STR
                                        ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING 
                                    ) AS PREV_SRC_STR,
                   COUNT(*)  OVER(
                                    PARTITION BY SEQ_NO
                                 ) AS MAX_RECUR_CNT
            FROM WM_CONCAT_TAB
      )
WITH DATA
ON COMMIT
PRESERVE ROWS;
 
CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:01.102

 

Let’s look the output –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT *
FROM VT_SRC_ARRNG
ORDER BY 1,2;
 
 
 
 
SEQ_NO  SRC_STR  PREV_SRC_STR    MAX_RECUR_CNT
-----   -------  --------------- ---------------------
1       ANUPAM      BIRESWAR     5
1       BIRESWAR    RAM          5
1       RAM         SUJAY        5
1       SUJAY       TRIDIB       5
1       TRIDIB      ?            5
2       BABU        PAPU         4
2       PAPU        SAYAN        4
2       SAYAN       TUNKAI       4
2       TUNKAI      ?            4
3       ATBIS       IK           3
3       IK          SAPMUNDA     3
3       SAPMUNDA    ?            3

 

Fine. From the above VT we can see every Source String has one Previous Source String. Also, we’ve noted down that in each window of SEQ_NO how many levels are there by MAX_RECUR_CNT. We’ll use this column later.

Let’s move to the 2nd & final part –

Let’s aggregate the values based on SEQ_NO & club them with comma –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
WITH RECURSIVE WM_CONCAT(SEQ_NO, SRC_STR, PREV_SRC_STR, MAX_RECUR_CNT, LVL,  COMMA_SEP_STR)
AS
     (
        SELECT SEQ_NO,
               SRC_STR,
               PREV_SRC_STR,
               MAX_RECUR_CNT,
               1 AS LVL,
               CAST( '' AS VARCHAR(100)) AS COMMA_SEP_STR
       FROM VT_SRC_ARRNG
       WHERE  PREV_SRC_STR IS NULL
       UNION ALL
       SELECT  b.SEQ_NO,
               b.SRC_STR,
               b.PREV_SRC_STR,
               b.MAX_RECUR_CNT,
               c.LVL+1 AS LVL,
               c.COMMA_SEP_STR||b.SRC_STR||',' AS COMMA_SEP_STR
       FROM VT_SRC_ARRNG b,
               WM_CONCAT c
       WHERE c.SRC_STR =  b.PREV_SRC_STR
     )
SELECT k.SEQ_NO,
       k.AGGR_STR
FROM (               
    SELECT SEQ_NO,
           SRC_STR,
           LVL,
           MAX_RECUR_CNT,
           MIN(CASE
                 WHEN LVL = 1 THEN
                    SRC_STR
               ELSE
                  'ZZZZZ'
               END   ) OVER(
                                 PARTITION BY SEQ_NO
                                 ORDER BY LVL ASC
                           ) ROOT_SRC_STR,
           COMMA_SEP_STR||ROOT_SRC_STR AS AGGR_STR
    FROM WM_CONCAT
    )  k
WHERE k.LVL = k.MAX_RECUR_CNT
ORDER BY 1,2;

 

Let’s check the output –

1
2
3
4
5
SEQ_NO  AGGR_STR
------- ---------------------------
1       SUJAY,RAM,BIRESWAR,ANUPAM,TRIDIB
2       SAYAN,PAPU,BABU,TUNKAI
3       IK,ATBIS,SAPMUNDA

 

I guess, We’ve done it. 😀

So, You can achieve the same without writing any UDF.

 

Performance of Regular Expression in Teradata 14.0

Today I’ll explain about the performance impact of these Regular expressions in Teradata.

It is believed that these functions have newly introduced. Hence, it may possible that these function may take some time to settle or in other words we may expect to see some patches before they can be considered as stable & ready to use in TD.

Before, we can go through this – we must understood about these functions & where we should use them properly. It is quite obvious that we would like to use them in such places where using teradata’s old stable function cannot achieve using a single SQL or we are looking for some kind of Stored-Proc in order to implement this business logic. Hence, it would be unfair to simply compare a simple solution with this. Rather, we should consider those complex parsing logic & the total performance by those Stored-Proc or any relevant process with these functions. In those cases – Regular expression will be very handy – I believe.

Let’s consider one simple case –

Let’s consider the following string – “SANTA’S JOYFULL GIFT“.

I want to fetch the a part of the string till it encounters first space character i.e. it will provide the following output as per the business logic – “SANTA’S“.

I’ll test that with significant volume of data & would like to compare the explain plan between the normal process & regular expression.

Let’s check the explain plan for the SQL that uses conventional functions –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
EXPLAIN SELECT C_KEY,
        C_CD,
        S_ORG_NM,
        SUBSTR(S_ORG_NM,1,POSITION(' ' IN S_ORG_NM||' ')) AS DER_S_ORG_NM
FROM MASTER_CLAIM
WHERE C_CD = '555';

  1) First, we lock EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM for
     access.
  2) Next, we do an all-AMPs RETRIEVE step from EDW_CORE_DB.MASTER_CLAIM in view
     ETL_VIEWS.MASTER_CLAIM by way of an all-rows scan with a condition
     of ("EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM.C_CD = '555 '")
     into Spool 1 (group_amps), which is built locally on the AMPs.
     The input table will not be cached in memory, but it is eligible
     for synchronized scanning.  The size of Spool 1 is estimated with
     high confidence to be 38,212,793 rows (5,082,301,469 bytes).  The
     estimated time for this step is 40.02 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 40.02 seconds.

 

Now, let’s try the same with the Regular expression –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
EXPLAIN SELECT C_KEY,
        C_CD,
        S_ORG_NM,
        regexp_substr(S_ORG_NM,'[^ ]+') AS DER_S_ORG_NM
FROM MASTER_CLAIM
WHERE C_CD = '555';

  1) First, we lock EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM for
     access.
  2) Next, we do an all-AMPs RETRIEVE step from EDW_CORE_DB.MASTER_CLAIM in view
     ETL_VIEWS.MASTER_CLAIM by way of an all-rows scan with a condition
     of ("EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM.C_CD = '555 '")
     into Spool 1 (group_amps), which is built locally on the AMPs.
     The input table will not be cached in memory, but it is eligible
     for synchronized scanning.  The size of Spool 1 is estimated with
     high confidence to be 38,212,793 rows (105,696,585,438 bytes).
     The estimated time for this step is 40.02 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 40.02 seconds.

 

So, from the above two – we really can’t find much difference in plan except the number of bytes that transfers. But, in both the cases the estimated time shows 40.02 seconds only.

So, now we can check what will be the actual time it will take. Let’s see that also.

First, let us create one Virtual Table & try to record the total create time –

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE MULTISET VOLATILE TABLE VT1
AS
    (
         SELECT C_KEY,
				        C_CD,
				        S_ORG_NM,
				        SUBSTR(S_ORG_NM,1,POSITION(' ' IN S_ORG_NM||' ')) AS DER_S_ORG_NM
		 FROM MASTER_CLAIM
		 WHERE C_CD = '555'
    )
WITH DATA
ON COMMIT
PRESERVE ROWS;

 

And, the response is as follows –

1
--CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:05.076

 

Let’s create another VT with the new approach –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE MULTISET VOLATILE TABLE VT2
AS
    (
         SELECT C_KEY,
				        C_CD,
				        S_ORG_NM,
				        regexp_substr(S_ORG_NM,'[^ ]+') AS DER_S_ORG_NM
		 FROM MASTER_CLAIM
		 WHERE C_CD = '555'
    )
WITH DATA
ON COMMIT
PRESERVE ROWS;

 

And, the response time –

1
--CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:05.762

 

So, as you can see there is not much difference between the old process & new process.

And, the total number of records we have test this –

 

1
2
3
4
SELECT COUNT(*)
FROM VT1;

40,781,904

 

So, from the above you can see that we’ve tested this on significant number of rows, which is very common in any TD system.

Let’s test whether both the SQLs actually returning same value. To do that – we’ll create one more VT are as follows –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE MULTISET VOLATILE TABLE VT3
AS
    (
         SELECT a.C_KEY,
				        a.C_CD,
				        a.S_ORG_NM,
				        a.DER_S_ORG_NM AS OLD_PRCHSR_ORG_NM,
				        b.DER_S_ORG_NM AS NEW_PRCHSR_ORG_NM,
				        CHAR_LENGTH(a.DER_S_ORG_NM) AS OLD_PRCHSR_ORG_NM_LEN,
				        CHAR_LENGTH(b.DER_S_ORG_NM) AS NEW_PRCHSR_ORG_NM_LEN
		 FROM VT1 a,
		             VT2 b
		 WHERE a.C_KEY = b.C_KEY
    )
WITH DATA
ON COMMIT
PRESERVE ROWS;

--CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:06.864

 

Now, lets test the output –

 

1
2
3
4
5
SELECT *
FROM VT3
WHERE OLD_PRCHSR_ORG_NM <> NEW_PRCHSR_ORG_NM;

--SELECT completed. 0 rows returned. Elapsed Time =  00:00:01.763

 

So, as you can see that from the above simulation – we can establish that the performance between the conventional SQL & SQL using Regular expression are negligible.

But, again I must clearly say – Regular expression will be ideal where we need multiple SQLs or PL/SQL to implement. Or, the place where you need to implement one complex parsing that is difficult to implement in a SQL.

Hope this will give you some clarity. 😀

Regular Expression on Teradata 14.0

I’ve been working for more than 8 years in Oracle 10g, 11g & worked significant queries on Regular expressions in various scenario using SQL. It is real handy if you know how to use it & can reduce lots of pain with single SQL. And, the performance will be better compared to the total effort to achieve the same functionalists by using multiple SQL queries or PL/SQL Procedures.

Last couple of years, I’m working on Teradata. And, on some occasion – I was expecting features like these, where I can easily manipulate data with regular expression. I’m pretty excited when I heard that Teradata also introduced Regular Expression from Version 14.0.


As a result, I tried all those features that I think can be handy & useful for various scenarios & followings are the successful queries that I get. There are two occasion, where Teradata partially able to manipulate those strings. I’ve checked the latest Teradata Manual. However, unable to find those solution. So, I’m expecting other forum members can contribute here in order to make this thread useful for every one of us. And, I’ll post here as soon as I get some answers on these partial conversions.

For better understanding, I’ve provided the actual column value & after transformation value of that column in the output. That will help us to grasp it easily – I guess. 🙂


Case 1,

1
2
3
4
5
SELECT regexp_replace('SatyakiDe','([[:lower:]]{1,})([[:upper:]]{1,})','\1 \2') AS COL_VAL;

COLA COL_VAL
---------------- ----------------------------------------
SatyakiDe Satyaki De


Case 2,

1
2
3
4
5
select regexp_replace('919047242526','^([[:digit:]]{2})([[:digit:]]{10})','+\1 \2') COL_VAL;

COLA COL_VAL
------------ ---------------
919047255555 +91 9047255555



Case 3,

1
2
3
4
5
select regexp_replace('+++C','^([[:punct:]]{2})([[:punct:]]{1})(.*)$','\1\3') COL_VAL;

COLA COL_VAL
---- -----
+++C ++C



Case 4,

1
2
3
4
5
select initcap(regexp_replace(regexp_substr(' satyaki.de@mail.com','[^@]+'),'(.*)(\.)(.*)','\1 \3')) COL_VAL;

COLA COL_VAL
-------------------------------- --------------------------------------------------
satyaki.de@mail.com Satyaki De



Case 5,

1
2
3
4
5
select regexp_replace('100011001','([[:digit:]]{3})([[:digit:]]{2})([[:digit:]]{4})','XXX-XX-\3') as COL_VAL;

COLA COL_VAL
---------------- --------------------
100011001 XXX-XX-1001



Case 6,

1
2
3
4
5
select regexp_replace('123456789','([[:digit:]]{3})([[:digit:]]{3})([[:digit:]]{3})','\3.\2.\1') as COL_VAL;

COLA COL_VAL
--------- ---------------
123456789 789.456.123



Case 7,

1
2
3
4
5
SELECT regexp_replace('satyaki9de0loves3to8work2on2sql0and2bi6tools1','[^0-9]+','',1,0,'i') AS DER_VAL;

COLA DER_VAL
--------------------------------------------- ----------
satyaki1de0loves3to8work2on2sql0and2bi4tools1 1038220241




As you can see, all the characters have filtered out from the string & only numbers are kept here. These sorts of queries are very useful in lots of different business scenarios as well.

So, any extra space may not produce desired result. And, needs to pay attention into these small details. 

And, I’ve tested all these queries in the following two versions –

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
select * from dbcinfo;

InfoKey InfoData
-------- ------------------------
1 VERSION 14.10.00.02
2 RELEASE 14.10.00.02
3 LANGUAGE SUPPORT MODE Standard


select * from dbcinfo;

InfoKey InfoData
-------- ------------------------
1 VERSION 14.10.01.05
2 RELEASE 14.10.01.04
3 LANGUAGE SUPPORT MODE Standard



Hope, this will give you much more clarity. 🙂

One more thing, I would like to clarify here – my intention is to describe more features about these regexp_(similar/substr/instr/replace) functions.

I’ve received one question whether these regexp functions available in TD 13 or not in Teradata forum while posting the same article over there.

And, here is my answer to that question –  

Regarding version 13,

Let us check whether they have these regexp functions or not –

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
select * from dbcinfo;

InfoKey InfoData
-------- ------------------------
1 VERSION 13.00.00.15
2 RELEASE 13.00.00.15
3 LANGUAGE SUPPORT MODE Standard


select * from dbcinfo;

InfoKey InfoData
-------- ------------------------
1 VERSION 13.10.07.12
2 RELEASE 13.10.07.12
3 LANGUAGE SUPPORT MODE Standard


1
2
3
4
5
6
7
8
9
select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) $','\1 \2\3') AS COL_VAL;

select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) $','\1 \2\3') AS COL_VAL;

select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) $','\1 \2\3') AS COL_VAL;
$
*** Failure 3706 Syntax error: expected something between '(' and the string 'S' keyword.
Statement# 1, Info =35
*** Total elapsed time was 1 second.



Hope this will give adequate clarity to the answer of that above question.

Now, Lets see some other functionality.

REGEXP_SIMILAR has similar functionality like REGEXP_LIKE in Oracle.

Let’s see couple of such cases –

Lets prepare the table with some dummy data –


 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SELECT * FROM dbc.dbcinfo;

InfoKey InfoData
-------- -----------------------
1 VERSION 14.10.01.05
2 RELEASE 14.10.01.04
3 LANGUAGE SUPPORT MODE Standard


CREATE MULTISET VOLATILE TABLE TEST_T1
(
COL1 VARCHAR(10)
)
ON COMMIT
PRESERVE ROWS;

INSERT INTO TEST_T1 VALUES('456')
;INSERT INTO TEST_T1 VALUES('123x')
;INSERT INTO TEST_T1 VALUES('x123')
;INSERT INTO TEST_T1 VALUES('y')
;INSERT INTO TEST_T1 VALUES('+789')
;INSERT INTO TEST_T1 VALUES('-789')
;INSERT INTO TEST_T1 VALUES('159-')
;INSERT INTO TEST_T1 VALUES('-1-');


Lets check the data now –

 1
2
3
4
5
6
7
8
9
10
11
12
SELECT *
FROM TEST_T1;

COL1
1 123x
2 456
3 x123
4 +789
5 -789
6 y
7 159-
8 -1-



Let’s look into the various scenarios now –


Case 1 (Returns Mixed Numbers, Signed Numbers & Non Numbers),

 1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT *
FROM TEST_T1
WHERE REGEXP_SIMILAR(COL1,'^[0-9]+$','c')=0;

COL1
-----
1 123x
2 x123
3 +789
4 -789
5 y
6 159-
7 -1-




Case 2 (Returns Only Unsigned Positive Numbers),

1
2
3
4
5
6
7
SELECT *
FROM TEST_T1
WHERE REGEXP_SIMILAR(COL1,'^[0-9]+$','c')=1;

COL1
-----
456



Case 3 (Returns All Numbers including Positive, Negative & unsigned),

 1
2
3
4
5
6
7
8
9
10
11
SELECT *
FROM TEST_T1
WHERE REGEXP_SIMILAR(COL1,'^[+-]?[0-9]+[+-]?$','c')=1;

COL1
-----
456
+789
-789
159-
-1-



Case 4 (Returns Only Non Numbers i.e. Characters),

1
2
3
4
5
6
7
SELECT *
FROM TEST_T1
WHERE REGEXP_SIMILAR(COL1,'[^0-9]+','c')=1;

COL1
----
y



Hope this will give you some additional idea. 🙂

My objective is to provide basic information to my friends. So, that they can write better SQL in TD while migrating from other popular databases or new developer in TD can get a flavor of this powerful feature & exploit them in all the positive aspect & apply them properly. 😀

Really appreciate your time to read this post.

Regards.

Satyaki De.