Performance of Regular Expression in Teradata 14.0

Today I’ll explain about the performance impact of these Regular expressions in Teradata.

It is believed that these functions have newly introduced. Hence, it may possible that these function may take some time to settle or in other words we may expect to see some patches before they can be considered as stable & ready to use in TD.

Before, we can go through this – we must understood about these functions & where we should use them properly. It is quite obvious that we would like to use them in such places where using teradata’s old stable function cannot achieve using a single SQL or we are looking for some kind of Stored-Proc in order to implement this business logic. Hence, it would be unfair to simply compare a simple solution with this. Rather, we should consider those complex parsing logic & the total performance by those Stored-Proc or any relevant process with these functions. In those cases – Regular expression will be very handy – I believe.

Let’s consider one simple case –

Let’s consider the following string – “SANTA’S JOYFULL GIFT“.

I want to fetch the a part of the string till it encounters first space character i.e. it will provide the following output as per the business logic – “SANTA’S“.

I’ll test that with significant volume of data & would like to compare the explain plan between the normal process & regular expression.

Let’s check the explain plan for the SQL that uses conventional functions –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
EXPLAIN SELECT C_KEY,
        C_CD,
        S_ORG_NM,
        SUBSTR(S_ORG_NM,1,POSITION(' ' IN S_ORG_NM||' ')) AS DER_S_ORG_NM
FROM MASTER_CLAIM
WHERE C_CD = '555';

  1) First, we lock EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM for
     access.
  2) Next, we do an all-AMPs RETRIEVE step from EDW_CORE_DB.MASTER_CLAIM in view
     ETL_VIEWS.MASTER_CLAIM by way of an all-rows scan with a condition
     of ("EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM.C_CD = '555 '")
     into Spool 1 (group_amps), which is built locally on the AMPs.
     The input table will not be cached in memory, but it is eligible
     for synchronized scanning.  The size of Spool 1 is estimated with
     high confidence to be 38,212,793 rows (5,082,301,469 bytes).  The
     estimated time for this step is 40.02 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 40.02 seconds.

 

Now, let’s try the same with the Regular expression –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
EXPLAIN SELECT C_KEY,
        C_CD,
        S_ORG_NM,
        regexp_substr(S_ORG_NM,'[^ ]+') AS DER_S_ORG_NM
FROM MASTER_CLAIM
WHERE C_CD = '555';

  1) First, we lock EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM for
     access.
  2) Next, we do an all-AMPs RETRIEVE step from EDW_CORE_DB.MASTER_CLAIM in view
     ETL_VIEWS.MASTER_CLAIM by way of an all-rows scan with a condition
     of ("EDW_CORE_DB.MASTER_CLAIM in view ETL_VIEWS.MASTER_CLAIM.C_CD = '555 '")
     into Spool 1 (group_amps), which is built locally on the AMPs.
     The input table will not be cached in memory, but it is eligible
     for synchronized scanning.  The size of Spool 1 is estimated with
     high confidence to be 38,212,793 rows (105,696,585,438 bytes).
     The estimated time for this step is 40.02 seconds.
  3) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 40.02 seconds.

 

So, from the above two – we really can’t find much difference in plan except the number of bytes that transfers. But, in both the cases the estimated time shows 40.02 seconds only.

So, now we can check what will be the actual time it will take. Let’s see that also.

First, let us create one Virtual Table & try to record the total create time –

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE MULTISET VOLATILE TABLE VT1
AS
    (
         SELECT C_KEY,
				        C_CD,
				        S_ORG_NM,
				        SUBSTR(S_ORG_NM,1,POSITION(' ' IN S_ORG_NM||' ')) AS DER_S_ORG_NM
		 FROM MASTER_CLAIM
		 WHERE C_CD = '555'
    )
WITH DATA
ON COMMIT
PRESERVE ROWS;

 

And, the response is as follows –

1
--CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:05.076

 

Let’s create another VT with the new approach –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE MULTISET VOLATILE TABLE VT2
AS
    (
         SELECT C_KEY,
				        C_CD,
				        S_ORG_NM,
				        regexp_substr(S_ORG_NM,'[^ ]+') AS DER_S_ORG_NM
		 FROM MASTER_CLAIM
		 WHERE C_CD = '555'
    )
WITH DATA
ON COMMIT
PRESERVE ROWS;

 

And, the response time –

1
--CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:05.762

 

So, as you can see there is not much difference between the old process & new process.

And, the total number of records we have test this –

 

1
2
3
4
SELECT COUNT(*)
FROM VT1;

40,781,904

 

So, from the above you can see that we’ve tested this on significant number of rows, which is very common in any TD system.

Let’s test whether both the SQLs actually returning same value. To do that – we’ll create one more VT are as follows –

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE MULTISET VOLATILE TABLE VT3
AS
    (
         SELECT a.C_KEY,
				        a.C_CD,
				        a.S_ORG_NM,
				        a.DER_S_ORG_NM AS OLD_PRCHSR_ORG_NM,
				        b.DER_S_ORG_NM AS NEW_PRCHSR_ORG_NM,
				        CHAR_LENGTH(a.DER_S_ORG_NM) AS OLD_PRCHSR_ORG_NM_LEN,
				        CHAR_LENGTH(b.DER_S_ORG_NM) AS NEW_PRCHSR_ORG_NM_LEN
		 FROM VT1 a,
		             VT2 b
		 WHERE a.C_KEY = b.C_KEY
    )
WITH DATA
ON COMMIT
PRESERVE ROWS;

--CREATE TABLE completed. 0 rows processed. Elapsed Time =  00:00:06.864

 

Now, lets test the output –

 

1
2
3
4
5
SELECT *
FROM VT3
WHERE OLD_PRCHSR_ORG_NM <> NEW_PRCHSR_ORG_NM;

--SELECT completed. 0 rows returned. Elapsed Time =  00:00:01.763

 

So, as you can see that from the above simulation – we can establish that the performance between the conventional SQL & SQL using Regular expression are negligible.

But, again I must clearly say – Regular expression will be ideal where we need multiple SQLs or PL/SQL to implement. Or, the place where you need to implement one complex parsing that is difficult to implement in a SQL.

Hope this will give you some clarity. 😀