## Predicting Flipkart business growth factor using Linear-Regression Machine Learning Model

Hi Guys,

Today, We’ll be exploring the potential business growth factor using the “Linear-Regression Machine Learning” model. We’ve prepared a set of dummy data & based on that, we’ll predict.

Let’s explore a few sample data –

So, based on these data, we would like to predict YearlyAmountSpent dependent on any one of the following features, i.e. [ Time On App / Time On Website / Flipkart Membership Duration (In Year) ].

You need to install the following packages –

pip install pandas

pip install matplotlib

pip install sklearn

We’ll be discussing only the main calling script & class script. However, we’ll be posting the parameters without discussing it. And, we won’t discuss clsL.py as we’ve already discussed that in our previous post.

1. clsConfig.py (This script contains all the parameter details.)

```################################################
#### Written By: SATYAKI DE                 ####
#### Written On: 15-May-2020                ####
####                                        ####
#### Objective: This script is a config     ####
#### file, contains all the keys for        ####
#### Machine-Learning. Application will     ####
#### process these information & perform    ####
#### various analysis on Linear-Regression. ####
################################################

import os
import platform as pl

class clsConfig(object):
Curr_Path = os.path.dirname(os.path.realpath(__file__))

os_det = pl.system()
if os_det == "Windows":
sep = '\\'
else:
sep = '/'

config = {
'APP_ID': 1,
'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
'LOG_PATH': Curr_Path + sep + 'log' + sep,
'REPORT_PATH': Curr_Path + sep + 'report',
'FILE_NAME': Curr_Path + sep + 'Data' + sep + 'FlipkartCustomers.csv',
'SRC_PATH': Curr_Path + sep + 'Data' + sep,
'APP_DESC_1': 'IBM Watson Language Understand!',
'DEBUG_IND': 'N',
'INIT_PATH': Curr_Path
}
```

2. clsLinearRegression.py (This is the main script, which will invoke the Machine-Learning API & return 0 if successful.)

```##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 15-May-2020              ####
#### Modified On 15-May-2020              ####
####                                      ####
#### Objective: Main scripts for Linear   ####
#### Regression.                          ####
##############################################

import pandas as p
import numpy as np
import regex as re

import matplotlib.pyplot as plt
from clsConfig import clsConfig as cf

# %matplotlib inline -- for Jupyter Notebook
class clsLinearRegression:
def __init__(self):
self.fileName =  cf.config['FILE_NAME']

def predictResult(self):
try:

inputFileName = self.fileName

print()
print('Projecting sample rows: ')

print()
x_row = df.shape
x_col = df.shape

print('Total Number of Rows: ', x_row)
print('Total Number of columns: ', x_col)

x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]

# Target Variable - Trying to predict
y = df['YearlyAmountSpent']

# Now Train-Test Split of your source data
from sklearn.model_selection import train_test_split

# test_size => % of allocated data for your test cases
# random_state => A specific set of random split on your data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)

# Importing Model
from sklearn.linear_model import LinearRegression

# Creating an Instance
lm = LinearRegression()

# Train or Fit my model on Training Data
lm.fit(X_train, Y_train)

# Creating a prediction value
flipKartSalePrediction = lm.predict(X_test)

# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)

plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')

# Checking Individual Metrics
from sklearn import metrics

print()
mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction)
print('Mean Absolute Error (MEA): ', mea_val)

mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction)
print('Mean Square Error (MSE): ', mse_val)

rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction))
print('Square root Mean Square Error (RMSE): ', rmse_val)

print()

# Check Variance Score - R^2 Value
print('Variance Score:')
var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip()
print('Our Model is', var_score, '% accurate. ')
print()

# Finding Coeficent on X_train.columns
print()
print('Finding Coeficent: ')

cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient'])
print('Printing the All the Factors: ')
print(cedf)

print()

# Getting the Max Value from it

# Filtering the max Value to identify the biggest Business factor

# Dropping the derived column
dfMax = dfMax.reset_index()

print(dfMax)

# Extracting Actual Business Factor from Pandas dataframe
str_factor_temp = str(dfMax.iloc['index'])
str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp)
str_value = str(round(float(dfMax.iloc['Coefficient']),2))

print()
print('*' * 80)
print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')
print('*' * 80)
print()

# This is require when you are trying to print from conventional
# front & not using Jupyter notebook.
plt.show()

return 0

except Exception  as e:
x = str(e)
print('Error : ', x)

return 1
```

Key lines from the above snippet –

```# Adding Features
x = df[['TimeOnApp', 'TimeOnWebsite', 'FlipkartMembershipInYear']]```

Our application creating a subset of the main datagram, which contains all the features.

```# Target Variable - Trying to predict
y = df['YearlyAmountSpent']```

Now, the application is setting the target variable into ‘Y.’

```# Now Train-Test Split of your source data
from sklearn.model_selection import train_test_split

# test_size => % of allocated data for your test cases
# random_state => A specific set of random split on your data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.4, random_state=101)```

As per “Supervised Learning,” our application is splitting the dataset into two subsets. One is to train the model & another segment is to test your final model. However, you can divide the data into three sets that include the performance statistics for a large dataset. In our case, we don’t need that as this data is significantly less.

```# Train or Fit my model on Training Data
lm.fit(X_train, Y_train)```

Our application is now training/fit the data into the model.

```# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)```

Our application projected the outcome based on the predicted data in a scatterplot graph.

Also, the following concepts captured by using our program. For more details, I’ve provided the external link for your reference –

1. Mean Absolute Error (MEA)
2. Mean Square Error (MSE)
3. Square Root Mean Square Error (RMSE)

And, the implementation has shown as –

```mea_val = metrics.mean_absolute_error(Y_test, flipKartSalePrediction)
print('Mean Absolute Error (MEA): ', mea_val)

mse_val = metrics.mean_squared_error(Y_test, flipKartSalePrediction)
print('Mean Square Error (MSE): ', mse_val)

rmse_val = np.sqrt(metrics.mean_squared_error(Y_test, flipKartSalePrediction))
print('Square Root Mean Square Error (RMSE): ', rmse_val)```

At this moment, we would like to check the credibility of our model by using the variance score are as follows –

```var_score = str(round(metrics.explained_variance_score(Y_test, flipKartSalePrediction) * 100, 2)).strip()
print('Our Model is', var_score, '% accurate. ')```

Finally, extracting the coefficient to find out, which particular feature will lead Flikkart for better sale & growth by taking the maximum of coefficient value month the all features are as shown below –

```cedf = p.DataFrame(lm.coef_, x.columns, columns=['Coefficient'])

# Getting the Max Value from it

# Filtering the max Value to identify the biggest Business factor

# Dropping the derived column
dfMax = dfMax.reset_index()```

Note that we’ve used a regular expression to split the camel-case column name from our feature & represent that with a much more meaningful name without changing the column name.

```# Extracting Actual Business Factor from Pandas dataframe
str_factor_temp = str(dfMax.iloc['index'])
str_factor = re.sub("([a-z])([A-Z])", "\g<1> \g<2>", str_factor_temp)
str_value = str(round(float(dfMax.iloc['Coefficient']),2))

print('Major Busienss Activity - (', str_factor, ') - ', str_value, '%')```

3. callLinear.py (This is the first calling script.)

```##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 15-May-2020              ####
#### Modified On 15-May-2020              ####
####                                      ####
#### Objective: Main calling scripts.     ####
##############################################

from clsConfig import clsConfig as cf
import clsL as cl
import logging
import datetime
import clsLinearRegression as cw

# Disbling Warning
def warn(*args, **kwargs):
pass

import warnings
warnings.warn = warn

# Lookup functions from
# Azure cloud SQL DB

var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def main():
try:
ret_1 = 0
general_log_path = str(cf.config['LOG_PATH'])

# Enabling Logging Info
logging.basicConfig(filename=general_log_path + 'MachineLearning_LinearRegression.log', level=logging.INFO)

# Initiating Log Class
l = cl.clsL()

# Moving previous day log files to archive directory
log_dir = cf.config['LOG_PATH']
curr_ver =datetime.datetime.now().strftime("%Y-%m-%d")

tmpR0 = "*" * 157

logging.info(tmpR0)
tmpR9 = 'Start Time: ' + str(var)
logging.info(tmpR9)
logging.info(tmpR0)

print("Log Directory::", log_dir)
tmpR1 = 'Log Directory::' + log_dir
logging.info(tmpR1)

print('Machine Learning - Linear Regression Prediction : ')
print('-' * 200)

# Create the instance of the Linear-Regression Class
x2 = cw.clsLinearRegression()

ret = x2.predictResult()

if ret == 0:
print('Successful Linear-Regression Prediction Generated!')
else:
print('Failed to generate Linear-Regression Prediction!')

print("-" * 200)
print()

print('Finding Analysis points..')
print("*" * 200)
logging.info('Finding Analysis points..')
logging.info(tmpR0)

tmpR10 = 'End Time: ' + str(var)
logging.info(tmpR10)
logging.info(tmpR0)

except ValueError as e:
print(str(e))
logging.info(str(e))

except Exception as e:
print("Top level Error: args:{0}, message{1}".format(e.args, e.message))

if __name__ == "__main__":
main()
```

Key snippet from the above script –

```# Create the instance of the Linear-Regression
x2 = cw.clsLinearRegression()

ret = x2.predictResult()```

In the above snippet, our application initially creating an instance of the main class & finally invokes the “predictResult” method.

Let’s run our application –

Step 1:

First, the application will fetch the following sample rows from our source file – if it is successful.

Step 2:

Then, It will create the following scatterplot by executing the following snippet –

```# Creating a scatter plot based on Actual Value & Predicted Value
plt.scatter(Y_test, flipKartSalePrediction)```

Note that our model is pretty accurate & it has a balanced success rate compared to our predicted numbers.

Step 3:

Finally, it is successfully able to project the critical feature are shown below –

From the above picture, you can see that our model is pretty accurate (89% approx).

Also, highlighted red square identifying the key-features & their confidence score & finally, the projecting the winner feature marked in green.

So, as per that, we’ve come to one conclusion that Flipkart’s business growth depends on the tenure of their subscriber, i.e., old members are prone to buy more than newer members.

Let’s look into our directory structure –

So, we’ve done it.

I’ll be posting another new post in the coming days. Till then, Happy Avenging! 😀

Note: All the data posted here are representational data & available over the internet & for educational purpose only.

Today, we’ll be checking one new area where we can implement regular expression to achieve the password validation without involving any kind of Macro, Stored-Proc.

Let’s consider the following conditions to be implemented –

1. Password should contain characters between 6 & 10.

2. One character should be digit.

3. One character should be in upper case letter.

4. There should be at least one special character.

Let’s check the Query & Output –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```select seq_no, passwd, regexp_similar(passwd,'^(?=^([[:graph:]]{6,10})\$)(?=.*([[:upper:]]{1,}))(?=.*([[:digit:]]{1,})).*\$') as reg_test from scott.login_det order by 1; SEQ_NO PASSWD REG_TEST ----- ------- -------------- 1 hoti 0 2 hotimla 0 3 hotImla 0 4 hot@imla 0 5 hoT@imla 0 6 hoT@iml9a 1 7 hoT@iml9a66 0 ```

Similarly, you can add condition of lower case character if you want to make it more complex.

Hope, this will give you another way – to implement the same logic. 🙂

## Reverse String using Regexp Functions

Hi,

Today, we’ll be checking one new area where we can implement regular expression to achieve the same without involving any kind of Macro, Stored-Proc.

Many occasion we may have to parse various kind of strings. Assume that, we need to parse the string in reverse order. Until TD 14.0, you don’t have any easy method to implement the same. Off course, this new method also has some limits. It can only able to reverse couple of characters only. I have checked with 9 characters.  It supports that.

So, if you have specific string lengths, then you may also try this solution if you are using TD 14.0. That will be handy.

Let’s check the Query & Output –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```SELECT SEQ_NO, SRC_STR, regexp_replace(SRC_STR,'([[:alnum:]]{1,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})([[:alnum:]]{0,1})','\9\8\7\6\5\4\3\2\1') AS REV_SRC_STR FROM WM_CONCAT_TAB ORDER BY 1; SEQ_NO SRC_STR REV_SRC_STR ----- ------- -------------- 1 BIRESWAR RAWSERIB 1 TRIDIB BIDIRT 1 SUJAY YAJUS 1 ANUPAM MAPUNA 1 RAM MAR 2 PAPU UPAP 2 SAYAN NAYAS 2 TUNKAI IAKNUT 2 BABU UBAB 3 ATBIS SIBTA 3 SAPMUNDA ADNUMPAS 3 IK KI ```

Hope, this will give you another way – to implement the same logic. 🙂

Today, I’ll show couple of very useful functions or logic implemented in Teradata using It’s Regular Expression.

There is two very popular demand comes from most of the Developer across different databases regarding the following two cases –

1. How to Split Comma Separated Values in each rows

2. How to bind separate values in 1 row (Just opposite of Step 1)

2nd Options are very demanding as Cross platform database professional specially Oracle Developers looking for these kind of implementation as Oracle has directly built-in functions to do the same. Those functions are Listagg, wm_concat, group_concat.

Let’s check the solution –

Case 1,

Let’s create the table & prepare some data –

 ```1 2 3 4 5 6 7``` ```CREATE MULTISET TABLE ETL_DATA.PARSE_STR ( SEQ_NO INTEGER, SRC_STR VARCHAR(70) ); CREATE TABLE completed. 0 rows processed. Elapsed Time = 00:00:01.864 ```

Let’s insert some data –

 ```1 2 3``` ```INSERT INTO ETL_DATA.PARSE_STR VALUES(1,'RAM,TRIDIB,ANUPAM,BIRESWAR,SUJAY') ;INSERT INTO ETL_DATA.PARSE_STR VALUES(2,'TUNKAI,SAYAN,BABU,PAPU') ;INSERT INTO ETL_DATA.PARSE_STR VALUES(3,'IK,ATBIS,SAPMUNDA'); ```

Let’s check the value –

 ```1 2 3 4 5``` ```SEQ_NO SRC_STR ------ ---------------------------------- 1 RAM,TRIDIB,ANUPAM,BIRESWAR,SUJAY 2 TUNKAI,SAYAN,BABU,PAPU 3 IK,ATBIS,SAPMUNDA ```

Fine, Now our objective will be split these comma separated values in each lines.

 ```1 2 3 4 5 6``` ```SELECT b.SEQ_NO, regexp_substr(b.SRC_STR,'[^,]+',1,day_of_calendar) AS SRC_STR FROM sys_calendar.calendar , PARSE_STR b WHERE day_of_calendar BETWEEN 1 AND (LENGTH(b.SRC_STR) - LENGTH(regexp_replace(b.SRC_STR,'[^A-Z]+','',1,0,'i'))+1 ) ORDER BY 1,2; ```

And, let’s check the output –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14``` ```SEQ_NO SRC_STR ----- ---------------------- 1 ANUPAM 1 BIRESWAR 1 RAM 1 SUJAY 1 TRIDIB 2 BABU 2 PAPU 2 SAYAN 2 TUNKAI 3 ATBIS 3 IK 3 SAPMUNDA ```

Gr8! I guess, result is coming as per my expectation. 🙂

Case 2(Subsitute Of Listagg, wm_concat, group_concat in Oracle),

This we’ve to do it in Two small Steps for better understanding & performance.

First, let us create another table –

 ```1 2 3 4 5 6 7``` ```CREATE MULTISET TABLE ETL_DATA.WM_CONCAT_TAB ( SEQ_NO INTEGER, SRC_STR VARCHAR(20) ); CREATE TABLE completed. 0 rows processed. Elapsed Time = 00:00:01.230 ```

Good. Now we’ll populate some data into this table. We’ll populate data from Step 1 as this will provide the exact data that we’re expecting as input test data for Case 2.

Let’s insert those data –

 ```1 2 3 4 5 6``` ```INSERT INTO ETL_DATA.WM_CONCAT_TAB SELECT b.SEQ_NO, regexp_substr(b.SRC_STR,'[^,]+',1,day_of_calendar) AS SRC_STR FROM sys_calendar.calendar , PARSE_STR b WHERE day_of_calendar BETWEEN 1 AND (LENGTH(b.SRC_STR) - LENGTH(regexp_replace(b.SRC_STR,'[^A-Z]+','',1,0,'i'))+1 ); ```

Let’s check the data –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14``` ```SEQ_NO SRC_STR ------ -------------------- 1 ANUPAM 1 BIRESWAR 1 RAM 1 SUJAY 1 TRIDIB 2 BABU 2 PAPU 2 SAYAN 2 TUNKAI 3 ATBIS 3 IK 3 SAPMUNDA ```

As you know in TD we’ve significant restcriction regarding Hirarchical Queries & Recursive Queries. So, In this step we’ll build one relationship like employee & manager in popular employee table. So, if we have that kind of relation then we can easily establish & fit that in TD model.

Let’s create this intermediate table. In this case we’ll go for mapping between current rows with next rows. This is also very useful process. In Oracle, they have LEAD or LAG functions to achieve the same. But, here we’ve to work a little bit more to achive the same.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20``` ```CREATE MULTISET VOLATILE TABLE VT_SRC_ARRNG AS ( SELECT SEQ_NO, SRC_STR, MAX(SRC_STR) OVER( PARTITION BY SEQ_NO ORDER BY SEQ_NO, SRC_STR ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING ) AS PREV_SRC_STR, COUNT(*) OVER( PARTITION BY SEQ_NO ) AS MAX_RECUR_CNT FROM WM_CONCAT_TAB ) WITH DATA ON COMMIT PRESERVE ROWS; CREATE TABLE completed. 0 rows processed. Elapsed Time = 00:00:01.102 ```

Let’s look the output –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```SELECT * FROM VT_SRC_ARRNG ORDER BY 1,2; SEQ_NO SRC_STR PREV_SRC_STR MAX_RECUR_CNT ----- ------- --------------- --------------------- 1 ANUPAM BIRESWAR 5 1 BIRESWAR RAM 5 1 RAM SUJAY 5 1 SUJAY TRIDIB 5 1 TRIDIB ? 5 2 BABU PAPU 4 2 PAPU SAYAN 4 2 SAYAN TUNKAI 4 2 TUNKAI ? 4 3 ATBIS IK 3 3 IK SAPMUNDA 3 3 SAPMUNDA ? 3 ```

Fine. From the above VT we can see every Source String has one Previous Source String. Also, we’ve noted down that in each window of SEQ_NO how many levels are there by MAX_RECUR_CNT. We’ll use this column later.

Let’s move to the 2nd & final part –

Let’s aggregate the values based on SEQ_NO & club them with comma –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43``` ```WITH RECURSIVE WM_CONCAT(SEQ_NO, SRC_STR, PREV_SRC_STR, MAX_RECUR_CNT, LVL, COMMA_SEP_STR) AS ( SELECT SEQ_NO, SRC_STR, PREV_SRC_STR, MAX_RECUR_CNT, 1 AS LVL, CAST( '' AS VARCHAR(100)) AS COMMA_SEP_STR FROM VT_SRC_ARRNG WHERE PREV_SRC_STR IS NULL UNION ALL SELECT b.SEQ_NO, b.SRC_STR, b.PREV_SRC_STR, b.MAX_RECUR_CNT, c.LVL+1 AS LVL, c.COMMA_SEP_STR||b.SRC_STR||',' AS COMMA_SEP_STR FROM VT_SRC_ARRNG b, WM_CONCAT c WHERE c.SRC_STR = b.PREV_SRC_STR ) SELECT k.SEQ_NO, k.AGGR_STR FROM ( SELECT SEQ_NO, SRC_STR, LVL, MAX_RECUR_CNT, MIN(CASE WHEN LVL = 1 THEN SRC_STR ELSE 'ZZZZZ' END ) OVER( PARTITION BY SEQ_NO ORDER BY LVL ASC ) ROOT_SRC_STR, COMMA_SEP_STR||ROOT_SRC_STR AS AGGR_STR FROM WM_CONCAT ) k WHERE k.LVL = k.MAX_RECUR_CNT ORDER BY 1,2; ```

Let’s check the output –

 ```1 2 3 4 5``` ```SEQ_NO AGGR_STR ------- --------------------------- 1 SUJAY,RAM,BIRESWAR,ANUPAM,TRIDIB 2 SAYAN,PAPU,BABU,TUNKAI 3 IK,ATBIS,SAPMUNDA ```

I guess, We’ve done it. 😀

So, You can achieve the same without writing any UDF.

## Performance of Regular Expression in Teradata 14.0

Today I’ll explain about the performance impact of these Regular expressions in Teradata.

It is believed that these functions have newly introduced. Hence, it may possible that these function may take some time to settle or in other words we may expect to see some patches before they can be considered as stable & ready to use in TD.

Before, we can go through this – we must understood about these functions & where we should use them properly. It is quite obvious that we would like to use them in such places where using teradata’s old stable function cannot achieve using a single SQL or we are looking for some kind of Stored-Proc in order to implement this business logic. Hence, it would be unfair to simply compare a simple solution with this. Rather, we should consider those complex parsing logic & the total performance by those Stored-Proc or any relevant process with these functions. In those cases – Regular expression will be very handy – I believe.

Let’s consider one simple case –

Let’s consider the following string – “SANTA’S JOYFULL GIFT“.

I want to fetch the a part of the string till it encounters first space character i.e. it will provide the following output as per the business logic – “SANTA’S“.

I’ll test that with significant volume of data & would like to compare the explain plan between the normal process & regular expression.

Let’s check the explain plan for the SQL that uses conventional functions –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```EXPLAIN SELECT C_KEY, C_CD, S_ORG_NM, SUBSTR(S_ORG_NM,1,POSITION(' ' IN S_ORG_NM||' ')) AS DER_S_ORG_NM FROM MASTER_CLAIM WHERE C_CD = '555'; 1) First, we lock EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM for access. 2) Next, we do an all-AMPs RETRIEVE step from EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM by way of an all-rows scan with a condition of ("EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM.C_CD = '555 '") into Spool 1 (group_amps), which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 38,212,793 rows (5,082,301,469 bytes). The estimated time for this step is 40.02 seconds. 3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 40.02 seconds. ```

Now, let’s try the same with the Regular expression –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```EXPLAIN SELECT C_KEY, C_CD, S_ORG_NM, regexp_substr(S_ORG_NM,'[^ ]+') AS DER_S_ORG_NM FROM MASTER_CLAIM WHERE C_CD = '555'; 1) First, we lock EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM for access. 2) Next, we do an all-AMPs RETRIEVE step from EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM by way of an all-rows scan with a condition of ("EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM.C_CD = '555 '") into Spool 1 (group_amps), which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 38,212,793 rows (105,696,585,438 bytes). The estimated time for this step is 40.02 seconds. 3) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 40.02 seconds. ```

So, from the above two – we really can’t find much difference in plan except the number of bytes that transfers. But, in both the cases the estimated time shows 40.02 seconds only.

So, now we can check what will be the actual time it will take. Let’s see that also.

First, let us create one Virtual Table & try to record the total create time –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```CREATE MULTISET VOLATILE TABLE VT1 AS ( SELECT C_KEY, C_CD, S_ORG_NM, SUBSTR(S_ORG_NM,1,POSITION(' ' IN S_ORG_NM||' ')) AS DER_S_ORG_NM FROM MASTER_CLAIM WHERE C_CD = '555' ) WITH DATA ON COMMIT PRESERVE ROWS; ```

And, the response is as follows –

 `1` ```--CREATE TABLE completed. 0 rows processed. Elapsed Time = 00:00:05.076 ```

Let’s create another VT with the new approach –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13``` ```CREATE MULTISET VOLATILE TABLE VT2 AS ( SELECT C_KEY, C_CD, S_ORG_NM, regexp_substr(S_ORG_NM,'[^ ]+') AS DER_S_ORG_NM FROM MASTER_CLAIM WHERE C_CD = '555' ) WITH DATA ON COMMIT PRESERVE ROWS; ```

And, the response time –

 `1` ```--CREATE TABLE completed. 0 rows processed. Elapsed Time = 00:00:05.762 ```

So, as you can see there is not much difference between the old process & new process.

And, the total number of records we have test this –

 ```1 2 3 4``` ```SELECT COUNT(*) FROM VT1; 40,781,904 ```

So, from the above you can see that we’ve tested this on significant number of rows, which is very common in any TD system.

Let’s test whether both the SQLs actually returning same value. To do that – we’ll create one more VT are as follows –

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```CREATE MULTISET VOLATILE TABLE VT3 AS ( SELECT a.C_KEY, a.C_CD, a.S_ORG_NM, a.DER_S_ORG_NM AS OLD_PRCHSR_ORG_NM, b.DER_S_ORG_NM AS NEW_PRCHSR_ORG_NM, CHAR_LENGTH(a.DER_S_ORG_NM) AS OLD_PRCHSR_ORG_NM_LEN, CHAR_LENGTH(b.DER_S_ORG_NM) AS NEW_PRCHSR_ORG_NM_LEN FROM VT1 a, VT2 b WHERE a.C_KEY = b.C_KEY ) WITH DATA ON COMMIT PRESERVE ROWS; --CREATE TABLE completed. 0 rows processed. Elapsed Time = 00:00:06.864 ```

Now, lets test the output –

 ```1 2 3 4 5``` ```SELECT * FROM VT3 WHERE OLD_PRCHSR_ORG_NM <> NEW_PRCHSR_ORG_NM; --SELECT completed. 0 rows returned. Elapsed Time = 00:00:01.763 ```

So, as you can see that from the above simulation – we can establish that the performance between the conventional SQL & SQL using Regular expression are negligible.

But, again I must clearly say – Regular expression will be ideal where we need multiple SQLs or PL/SQL to implement. Or, the place where you need to implement one complex parsing that is difficult to implement in a SQL.

Hope this will give you some clarity. 😀

## Regular Expression on Teradata 14.0

I’ve been working for more than 8 years in Oracle 10g, 11g & worked significant queries on Regular expressions in various scenario using SQL. It is real handy if you know how to use it & can reduce lots of pain with single SQL. And, the performance will be better compared to the total effort to achieve the same functionalists by using multiple SQL queries or PL/SQL Procedures.

Last couple of years, I’m working on Teradata. And, on some occasion – I was expecting features like these, where I can easily manipulate data with regular expression. I’m pretty excited when I heard that Teradata also introduced Regular Expression from Version 14.0.

As a result, I tried all those features that I think can be handy & useful for various scenarios & followings are the successful queries that I get. There are two occasion, where Teradata partially able to manipulate those strings. I’ve checked the latest Teradata Manual. However, unable to find those solution. So, I’m expecting other forum members can contribute here in order to make this thread useful for every one of us. And, I’ll post here as soon as I get some answers on these partial conversions.

For better understanding, I’ve provided the actual column value & after transformation value of that column in the output. That will help us to grasp it easily – I guess. 🙂

Case 1,

 `12345` `SELECT regexp_replace('SatyakiDe','([[:lower:]]{1,})([[:upper:]]{1,})','\1 \2') AS COL_VAL; COLA COL_VAL---------------- ----------------------------------------SatyakiDe Satyaki De`

Case 2,

 `12345` `select regexp_replace('919047242526','^([[:digit:]]{2})([[:digit:]]{10})','+\1 \2') COL_VAL; COLA COL_VAL------------ ---------------919047255555 +91 9047255555`

Case 3,

 `12345` `select regexp_replace('+++C','^([[:punct:]]{2})([[:punct:]]{1})(.*)\$','\1\3') COL_VAL; COLA COL_VAL---- -----+++C ++C`

Case 4,

 `12345` `select initcap(regexp_replace(regexp_substr(' satyaki.de@mail.com','[^@]+'),'(.*)(\.)(.*)','\1 \3')) COL_VAL; COLA COL_VAL-------------------------------- --------------------------------------------------satyaki.de@mail.com Satyaki De`

Case 5,

 `12345` `select regexp_replace('100011001','([[:digit:]]{3})([[:digit:]]{2})([[:digit:]]{4})','XXX-XX-\3') as COL_VAL; COLA COL_VAL---------------- --------------------100011001 XXX-XX-1001`

Case 6,

 `12345` `select regexp_replace('123456789','([[:digit:]]{3})([[:digit:]]{3})([[:digit:]]{3})','\3.\2.\1') as COL_VAL; COLA COL_VAL--------- ---------------123456789 789.456.123`

Case 7,

 `12345` `SELECT regexp_replace('satyaki9de0loves3to8work2on2sql0and2bi6tools1','[^0-9]+','',1,0,'i') AS DER_VAL;COLA DER_VAL--------------------------------------------- ----------satyaki1de0loves3to8work2on2sql0and2bi4tools1 1038220241`

As you can see, all the characters have filtered out from the string & only numbers are kept here. These sorts of queries are very useful in lots of different business scenarios as well.

So, any extra space may not produce desired result. And, needs to pay attention into these small details.

And, I’ve tested all these queries in the following two versions –

 ` 1 2 3 4 5 6 7 8 910111213141516` `select * from dbcinfo; InfoKey InfoData -------- ------------------------1 VERSION 14.10.00.022 RELEASE 14.10.00.023 LANGUAGE SUPPORT MODE Standard select * from dbcinfo; InfoKey InfoData -------- ------------------------1 VERSION 14.10.01.052 RELEASE 14.10.01.043 LANGUAGE SUPPORT MODE Standard`

Hope, this will give you much more clarity. 🙂

One more thing, I would like to clarify here – my intention is to describe more features about these regexp_(similar/substr/instr/replace) functions.

I’ve received one question whether these regexp functions available in TD 13 or not in Teradata forum while posting the same article over there.

And, here is my answer to that question –

Regarding version 13,

Let us check whether they have these regexp functions or not –

 ` 1 2 3 4 5 6 7 8 910111213141516` `select * from dbcinfo; InfoKey InfoData -------- ------------------------1 VERSION 13.00.00.152 RELEASE 13.00.00.153 LANGUAGE SUPPORT MODE Standard select * from dbcinfo; InfoKey InfoData -------- ------------------------1 VERSION 13.10.07.122 RELEASE 13.10.07.123 LANGUAGE SUPPORT MODE Standard`

 `123456789` `select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) \$','\1 \2\3') AS COL_VAL; select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) \$','\1 \2\3') AS COL_VAL; select regexp_replace('SatyakiDe','^(.*)([[:upper:]]{1,})(.*) \$','\1 \2\3') AS COL_VAL; \$ *** Failure 3706 Syntax error: expected something between '(' and the string 'S' keyword. Statement# 1, Info =35 *** Total elapsed time was 1 second.`

Hope this will give adequate clarity to the answer of that above question.

Now, Lets see some other functionality.

REGEXP_SIMILAR has similar functionality like REGEXP_LIKE in Oracle.

Let’s see couple of such cases –

Lets prepare the table with some dummy data –

 ` 1 2 3 4 5 6 7 8 9101112131415161718192021222324` `SELECT * FROM dbc.dbcinfo; InfoKey InfoData -------- -----------------------1 VERSION 14.10.01.052 RELEASE 14.10.01.043 LANGUAGE SUPPORT MODE Standard CREATE MULTISET VOLATILE TABLE TEST_T1 ( COL1 VARCHAR(10) ) ON COMMIT PRESERVE ROWS; INSERT INTO TEST_T1 VALUES('456') ;INSERT INTO TEST_T1 VALUES('123x') ;INSERT INTO TEST_T1 VALUES('x123') ;INSERT INTO TEST_T1 VALUES('y') ;INSERT INTO TEST_T1 VALUES('+789') ;INSERT INTO TEST_T1 VALUES('-789') ;INSERT INTO TEST_T1 VALUES('159-') ;INSERT INTO TEST_T1 VALUES('-1-');`

Lets check the data now –

 ` 1 2 3 4 5 6 7 8 9101112` `SELECT *FROM TEST_T1; COL11 123x2 4563 x1234 +7895 -7896 y7 159-8 -1-`

Let’s look into the various scenarios now –

Case 1 (Returns Mixed Numbers, Signed Numbers & Non Numbers),

 ` 1 2 3 4 5 6 7 8 910111213` `SELECT * FROM TEST_T1 WHERE REGEXP_SIMILAR(COL1,'^[0-9]+\$','c')=0; COL1 -----1 123x2 x1233 +7894 -7895 y6 159-7 -1-`

Case 2 (Returns Only Unsigned Positive Numbers),

 `1234567` `SELECT * FROM TEST_T1 WHERE REGEXP_SIMILAR(COL1,'^[0-9]+\$','c')=1; COL1-----456`

Case 3 (Returns All Numbers including Positive, Negative & unsigned),

 ` 1 2 3 4 5 6 7 8 91011` `SELECT *FROM TEST_T1WHERE REGEXP_SIMILAR(COL1,'^[+-]?[0-9]+[+-]?\$','c')=1; COL1-----456+789-789159--1-`

Case 4 (Returns Only Non Numbers i.e. Characters),

 `1234567` `SELECT *FROM TEST_T1WHERE REGEXP_SIMILAR(COL1,'[^0-9]+','c')=1;COL1----y`

Hope this will give you some additional idea. 🙂

My objective is to provide basic information to my friends. So, that they can write better SQL in TD while migrating from other popular databases or new developer in TD can get a flavor of this powerful feature & exploit them in all the positive aspect & apply them properly. 😀