Demonstration of GPT-3 model tuning using Python for an upcoming PyPi-package

Today, I’m very excited to demonstrate an effortless & new way to fine-tune the GPT-3 model using Python with the help of my new build (unpublished) PyPi package. In this post, I plan to deal with the custom website link as a response from this website depending upon the user queries with the help of the OpenAI-based tuned model.

In this post, I’ve directly subscribed to OpenAI & I’m not using OpenAI from Azure. However, I’ll explore that in the future as well.

Before I explain the process to invoke this new library, why not view the demo first & then discuss it?

Demo

Isn’t it exciting? Finally, we can efficiently handle your custom website URL using OpenAI tuned model.


What is ChatGPT?

ChatGPT is an advanced artificial intelligence language model developed by OpenAI based on the GPT-4 architecture. As an AI model, it is designed to understand and generate human-like text-based on the input it receives. ChatGPT can engage in various tasks, such as answering questions, providing recommendations, creating content, and simulating conversation. While it is highly advanced and versatile, it’s important to note that ChatGPT’s knowledge is limited to the data it was trained on, with a cutoff date of September 2021.

When to tune GPT model?

Tuning a GPT or any AI model might be necessary for various reasons. Here are some common scenarios when you should consider adjusting or fine-tuning a GPT model:

  1. Domain-specific knowledge: If you need your model to have a deeper understanding of a specific domain or industry, you can fine-tune it with domain-specific data to improve its performance.
  2. New or updated data: If new or updated information is not part of the original training data, you should fine-tune the model to ensure it has the most accurate and up-to-date knowledge.
  3. Customization: If you require the model to have a specific style, tone, or focus, you can fine-tune it with data that reflects those characteristics.
  4. Ethical or safety considerations: To make the model safer and more aligned with human values, you should fine-tune it to reduce biased or harmful outputs.
  5. Improve performance: If the base model’s performance is unsatisfactory for a particular task or application, you can fine-tune it on a dataset more relevant to the job, often leading to better results.

Remember that tuning or fine-tuning a GPT model requires access to appropriate data and computational resources and an understanding of the model’s architecture and training techniques. Additionally, monitoring and evaluating the model’s performance after fine-tuning is essential to ensure that the desired improvements have been achieved.


FLOW OF EVENTS:

Let us look at the flow diagram as it captures the sequence of events that unfold as part of the process.

The initial Python-based client interacts with the tuned OpenAI models. This process enables it to get a precise response with custom data in a very convenient way. So that anyone can understand.


SOURCE DATA:

Let us understand how to feed the source data as it will deal with your website URL link.

The first data that we are going to talk about is the one that contains the hyperlink. Let us explore the sample here.

From the above diagram, one can easily understand that the application will interpret a unique hash number associated with a specific URL. This data will be used to look up the URL after the OpenAI response from the tuned model as a result of any user query.

Now, let us understand the actual source data.

If we closely check, we’ll see the source file contains two columns – prompt & completion. And the website reference is put inside the curly braces as shown – “{Hash Code that represents your URL}.”

During the response, the newly created library replaces the hash value with the correct URL after the successful lookup & presents the complete answer.

CODE:

Why don’t we go through the code made accessible due to this new library for this particular use case?

  • clsConfigClient.py (This is the main calling Python script for the input parameters.)


################################################
#### Written By: SATYAKI DE ####
#### Written On: 15-May-2020 ####
#### Modified On: 21-Feb-2023 ####
#### ####
#### Objective: This script is a config ####
#### file, contains all the keys for ####
#### OpenAI fine-tune projects. ####
#### ####
################################################
import os
import platform as pl
class clsConfigClient(object):
Curr_Path = os.path.dirname(os.path.realpath(__file__))
os_det = pl.system()
if os_det == "Windows":
sep = '\\'
else:
sep = '/'
conf = {
'APP_ID': 1,
'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
'LOG_PATH': Curr_Path + sep + 'log' + sep,
'DATA_PATH': Curr_Path + sep + 'data' + sep,
'TEMP_PATH': Curr_Path + sep + 'temp' + sep,
'MODEL_DIR': 'model',
'APP_DESC_1': 'ChatGPT Training!',
'DEBUG_IND': 'N',
'INIT_PATH': Curr_Path,
'FILE_NAME': '2023-4-14-WP.csv',
'LKP_FILE_NAME': 'HyperDetails.csv',
'TEMP_FILE_NAME': 'chatGPTData.jsonl',
'TITLE': "GPT-3 Training!",
'PATH' : Curr_Path,
'OUT_DIR': 'data',
'OPEN_API_KEY': 'sk-hdhrujfrkfjfjfjfhjfjfisososT&jsdgL6KIxx',
'MODEL_CD':'davinci',
'URL': 'https://api.openai.com/v1/fine-tunes/',
'EPOCH': 10,
'SUFFIX': 'py-saty',
'EXIT_KEYWORD': 'bye'
}

Some of the important entries that will require later are as follows –

'FILE_NAME': '2023-4-14-WP.csv',
'LKP_FILE_NAME': 'HyperDetails.csv',
'OPEN_API_KEY': 'sk-hdhrujfrkfjfjfjfhjfjfisososT&jsdgL6KIxx',
'MODEL_CD':'davinci',
'URL': 'https://api.openai.com/v1/fine-tunes/',
'EXIT_KEYWORD': 'bye'

We’ll discuss these entries later.

  • trainChatGPTModel.py (This is the main calling Python script that will invoke the newly created fine-tune GPT-3 enabler.)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 12-Feb-2023 ####
#### Modified On 16-Feb-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### newly created fine-tune GPT-3 enabler. ####
#### ####
#####################################################
import pandas as p
import clsL as cl
from clsConfigClient import clsConfigClient as cf
import datetime
import clsTrainModel3 as tm
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
######################################
### Get your global values ####
######################################
debug_ind = 'Y'
#tModel = tm.clsTrainModel()
tModel = tm.clsTrainModel3()
# Initiating Logging Instances
clog = cl.clsL()
data_path = cf.conf['DATA_PATH']
data_file_name = cf.conf['FILE_NAME']
######################################
#### Global Flag ########
######################################
######################################
### Wrapper functions to invoke ###
### the desired class from newly ###
### built class. ###
######################################
######################################
### End of wrapper functions. ###
######################################
def main():
try:
var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('*'*120)
print('Start Time: ' + str(var))
print('*'*120)
FullFileName = data_path + data_file_name
r1 = tModel.trainModel(FullFileName)
if r1 == 0:
print('Successfully Trained!')
else:
print('Failed to Train!')
#clog.logr(OutPutFileName, debug_ind, df, subdir)
print('*'*120)
var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('End Time: ' + str(var1))
except Exception as e:
x = str(e)
print('Error: ', x)
if __name__ == "__main__":
main()

Following are the key snippet from the above script –

data_path = cf.conf['DATA_PATH']
data_file_name = cf.conf['FILE_NAME']

And, then –

tModel = tm.clsTrainModel3()
FullFileName = data_path + data_file_name
r1 = tModel.trainModel(FullFileName)

As one can see, the package needs only the source data file to fine-tune GPT-3 model.

  • checkFineTuneChatGPTModelStat.py (This is the main Python script that will check the status of the tuned process that will happen inside the OpenAI-cloud environment.)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 12-Feb-2023 ####
#### Modified On 16-Feb-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### newly created fine-tune job status inside ####
#### the OpenAI environment. ####
#####################################################
import clsL as cl
from clsConfigClient import clsConfigClient as cf
import datetime
import clsTestModel3 as tm
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
######################################
### Get your global values ####
######################################
debug_ind = 'Y'
# Initiating Logging Instances
clog = cl.clsL()
tmodel = tm.clsTestModel3()
url_part = cf.conf['URL']
open_api_key = cf.conf['OPEN_API_KEY']
######################################
#### Global Flag ########
######################################
def main():
try:
var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('*'*120)
print('Start Time: ' + str(var))
print('*'*120)
# Example usage
input_text = str(input("Please provide the fine tune Id (Start with ft-*): "))
url = url_part + input_text
print('URL: ', url)
r1 = tmodel.checkStat(url, open_api_key)
if r1 == 0:
print('Successfully checked the status of tuned GPT-3 model.')
else:
print('Failed to check the status of the tuned GPT-3 model.')
print('*'*120)
var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('End Time: ' + str(var1))
except Exception as e:
x = str(e)
print('Error: ', x)
if __name__ == "__main__":
main()

To check the status of the fine-tuned job inside the OpenAI environment, one needs to provide the fine tune id, which generally starts with -> “ft-*.” One would get this value after the train script’s successful run.

Some of the other key snippets are –

tmodel = tm.clsTestModel3()

url_part = cf.conf['URL']
open_api_key = cf.conf['OPEN_API_KEY']

And, then –

input_text = str(input("Please provide the fine tune Id (Start with ft-*): "))
url = url_part + input_text
print('URL: ', url)

r1 = tmodel.checkStat(url, open_api_key)

The above snippet is self-explanatory as one is passing the fine tune id along with the OpenAI API key.

  • testChatGPTModel.py (This is the main testing Python script that will invoke the newly created fine-tune GPT-3 enabler to get a response with custom data.)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 12-Feb-2023 ####
#### Modified On 19-Apr-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### newly created class that will test the ####
#### tuned model output. ####
#####################################################
import clsL as cl
from clsConfigClient import clsConfigClient as cf
import datetime
import pandas as p
import clsTestModel3 as tm
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
######################################
### Get your global values ####
######################################
debug_ind = 'Y'
# Initiating Logging Instances
clog = cl.clsL()
tmodel = tm.clsTestModel3()
open_api_key = cf.conf['OPEN_API_KEY']
lkpDataPath = cf.conf['DATA_PATH']
lkpFileName = cf.conf['LKP_FILE_NAME']
######################################
#### Global Flag ########
######################################
def main():
try:
var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('*' * 120)
print('Start Time: ' + str(var))
print('*' * 120)
LookUpFileName = lkpDataPath + lkpFileName
r1 = tmodel.testModel(LookUpFileName, open_api_key)
if r1 == 0:
print('Successfully tested the tuned GPT-3 model.')
else:
print('Failed to test the tuned GPT-3 model.')
var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('End Time: ' + str(var1))
except Exception as e:
x = str(e)
print('Error: ', x)
if __name__ == "__main__":
main()

Some of the key entries from the above snippet are as follows –

tmodel = tm.clsTestModel3()

open_api_key = cf.conf['OPEN_API_KEY']
lkpDataPath = cf.conf['DATA_PATH']
lkpFileName = cf.conf['LKP_FILE_NAME']

And, then –

LookUpFileName = lkpDataPath + lkpFileName
r1 = tmodel.testModel(LookUpFileName, open_api_key)

In the above lines, the application gets the correct URL value from the look file we’ve prepared for this specific use case.

  • deleteChatGPTModel.py (This is the main Python script that will delete the old intended tuned model, which is no longer needed.)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 12-Feb-2023 ####
#### Modified On 21-Feb-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### newly created delete model methods for ####
#### OpenAI. ####
#####################################################
import clsL as cl
from clsConfigClient import clsConfigClient as cf
import datetime
import clsTestModel3 as tm
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
######################################
### Get your global values ####
######################################
debug_ind = 'Y'
# Initiating Logging Instances
clog = cl.clsL()
tmodel = tm.clsTestModel3()
open_api_key = cf.conf['OPEN_API_KEY']
######################################
#### Global Flag ########
######################################
def main():
try:
var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('*' * 120)
print('Start Time: ' + str(var))
print('*' * 120)
r1 = tmodel.delOldModel(open_api_key)
if r1 == 0:
print('Successfully checked the status of tuned GPT-3 model.')
else:
print('Failed to check the status of the tuned GPT-3 model.')
var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('End Time: ' + str(var1))
except Exception as e:
x = str(e)
print('Error: ', x)
if __name__ == "__main__":
main()

Some of the key snippets from the above scripts are –

tmodel = tm.clsTestModel3()

open_api_key = cf.conf['OPEN_API_KEY']

And, then –

r1 = tmodel.delOldModel(open_api_key)

We’ve demonstrated that using a straightforward method, one can delete any old tuned model from OpenAI that is no longer required.

KEY FEATURES TO CONSIDER DURING TUNING:

  • Data quality: Ensure that the data used for fine-tuning is clean, relevant, and representative of the domain you want the model to understand. Check for biases, inconsistencies, and errors in the dataset.
  • Overfitting: Be cautious of overfitting, which occurs when the model performs exceptionally well on the training data but poorly on unseen data. You can address overfitting by using regularization techniques, early stopping, or cross-validation.
  • Model size and resource requirements: GPT models can be resource-intensive. Be mindful of the hardware limitations and computational resources available when selecting the model size and the time and cost associated with training.
  • Hyperparameter tuning: Select appropriate hyperparameters for your fine-tuning processes, such as learning rate, batch size, and the number of epochs. Experiment with different combinations to achieve the best results without overfitting.
  • Evaluation metrics: Choose suitable evaluation metrics to assess the performance of your fine-tuned model. Consider using multiple metrics to understand your model’s performance comprehensively.
  • Ethical considerations: Be aware of potential biases in your dataset and how the model’s predictions might impact users. Address ethical concerns during the fine-tuning process and consider using techniques such as data augmentation or adversarial training to mitigate these biases.
  • Monitoring and maintenance: Continuously monitor the model’s performance after deployment, and be prepared to re-tune or update it as needed. Regular maintenance ensures that the model remains relevant and accurate.
  • Documentation: Document your tuning process, including the data used, model architecture, hyperparameters, and evaluation metrics. This factor will facilitate easier collaboration, replication, and model maintenance.
  • Cost: OpenAI fine-tuning can be extremely expensive, even for a small volume of data. Hence, organization-wise, one needs to be extremely careful while using this feature.

COST FACTOR:

Before we discuss the actual spending, let us understand the tested data volume to train & tune the model.

So, we’re talking about a total size of 500 KB (at max). And, we did 10 epochs during the training as you can see from the config file mentioned above.

So, it is pretty expensive. Use it wisely.


So, finally, we’ve done it.

I know that this post is relatively bigger than my earlier post. But, I think, you can get all the details once you go through it.

You will get the complete codebase in the following GitHub link.

I’ll bring some more exciting topics in the coming days from the Python verse. Please share & subscribe to my post & let me know your feedback.

Till then, Happy Avenging! 🙂

Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. Some of the images (except my photo) we’ve used are available over the net. We don’t claim ownership of these images. There is always room for improvement & especially in the prediction quality.

Personal Virtual Assistant (SJ) implemented using python-based OpenAI, Rev_AI & PyTtSX3.

Today, I will discuss our Virtual personal assistant (SJ) with a combination of AI-driven APIs, which is now operational in Python. We will use the three most potent APIs using OpenAI, Rev-AI & Pyttsx3. Why don’t we see the demo first?

Great! Let us understand we can leverage this by writing a tiny snippet using this new AI model.

Architecture:

Let us understand the flow of events –

The application first invokes the API to capture the audio spoken through the audio device & then translate that into text, which is later parsed & shared as input by the openai for the response of the posted queries. Once, OpenAI shares the response, the python-based engine will take the response & using pyttsx3 to convert them to voice.


Python Packages:

Following are the python packages that are necessary to develop this brilliant use case –

pip install openai==0.25.0
pip install PyAudio==0.2.13
pip install playsound==1.3.0
pip install pandas==1.5.2
pip install rev-ai==2.17.1
pip install six==1.16.0
pip install websocket-client==0.59.0

CODE:

Let us now understand the code. For this use case, we will only discuss three python scripts. However, we need more than these three. However, we have already discussed them in some of the early posts. Hence, we will skip them here.

  • clsConfigClient.py (Main configuration file)


################################################
#### Written By: SATYAKI DE ####
#### Written On: 15-May-2020 ####
#### Modified On: 31-Dec-2022 ####
#### ####
#### Objective: This script is a config ####
#### file, contains all the keys for ####
#### personal AI-driven voice assistant. ####
#### ####
################################################
import os
import platform as pl
class clsConfigClient(object):
Curr_Path = os.path.dirname(os.path.realpath(__file__))
os_det = pl.system()
if os_det == "Windows":
sep = '\\'
else:
sep = '/'
conf = {
'APP_ID': 1,
'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
'LOG_PATH': Curr_Path + sep + 'log' + sep,
'REPORT_PATH': Curr_Path + sep + 'output' + sep,
'REPORT_DIR': 'output',
'SRC_PATH': Curr_Path + sep + 'data' + sep,
'CODE_PATH': Curr_Path + sep + 'Code' + sep,
'APP_DESC_1': 'Personal Voice Assistant (SJ)!',
'DEBUG_IND': 'N',
'INIT_PATH': Curr_Path,
'TITLE': "Personal Voice Assistant (SJ)!",
'PATH' : Curr_Path,
'OPENAI_API_KEY': "sk-aapwfMWDuFE5XXXUr2BH",
'REVAI_API_KEY': "02ks6kFhEKjdhdure8474JJAJJ945958_h8P_DEKDNkK6DwNNNHU17aRtCw",
'MODEL_NAME': "code-davinci-002",
"speedSpeech": 170,
"speedPitch": 0.8,
"soundRate": 44100,
"contentType": "audio/x-raw",
"layout": "interleaved",
"format": "S16LE",
"channels": 1
}

A few of the essential entries from the above snippet, which one should be looked for, are –

'OPENAI_API_KEY': "sk-aapwfMWDuFE5XXXUr2BH",
'REVAI_API_KEY': "02ks6kFhEKjdhdure8474JJAJJ945958_h8P_DEKDNkK6DwNNNHU17aRtCw",
'MODEL_NAME': "code-davinci-002",
"speedSpeech": 170,
"speedPitch": 0.8,
"soundRate": 44100,
"contentType": "audio/x-raw",
"layout": "interleaved",
"format": "S16LE",
"channels": 1

Note that, all the API-key are not real. You need to generate your own key.

  • clsText2Voice.py (The python script that will convert text to voice)


###############################################
#### Written By: SATYAKI DE ####
#### Written On: 27-Oct-2019 ####
#### Modified On 28-Jan-2023 ####
#### ####
#### Objective: Main class converting ####
#### text to voice using third-party API. ####
###############################################
import pyttsx3
from clsConfigClient import clsConfigClient as cf
class clsText2Voice:
def __init__(self):
self.speedSpeech = cf.conf['speedSpeech']
self.speedPitch = cf.conf['speedPitch']
def getAudio(self, srcString):
try:
speedSpeech = self.speedSpeech
speedPitch = self.speedPitch
engine = pyttsx3.init()
# Set the speed of the speech (in words per minute)
engine.setProperty('rate', speedSpeech)
# Set the pitch of the speech (1.0 is default)
engine.setProperty('pitch', speedPitch)
# Converting to MP3
engine.say(srcString)
engine.runAndWait()
return 0
except Exception as e:
x = str(e)
print('Error: ', x)
return 1

Some of the important snippet will be as follows –

def getAudio(self, srcString):
    try:
        speedSpeech = self.speedSpeech
        speedPitch = self.speedPitch
        
        engine = pyttsx3.init()

        # Set the speed of the speech (in words per minute)
        engine.setProperty('rate', speedSpeech)

        # Set the pitch of the speech (1.0 is default)
        engine.setProperty('pitch', speedPitch)

        # Converting to MP3
        engine.say(srcString)
        engine.runAndWait()

        return 0

The code is a function that generates speech audio from a given string using the Pyttsx3 library in Python. The function sets the speech rate and pitch using the “speedSpeech” and “speedPitch” properties of the calling object, initializes the Pyttsx3 engine, sets the speech rate and pitch on the engine, speaks the given string, and waits for the speech to finish. The function returns 0 after the speech is finished.


  • clsChatEngine.py (This python script will invoke the ChatGPT OpenAI class to initiate the response of the queries in python.)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 26-Dec-2022 ####
#### Modified On 28-Jan-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### ChatGPT OpenAI class to initiate the ####
#### response of the queries in python. ####
#####################################################
import os
import openai
import json
from clsConfigClient import clsConfigClient as cf
import sys
import errno
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
###############################################
### Global Section ###
###############################################
CODE_PATH=str(cf.conf['CODE_PATH'])
MODEL_NAME=str(cf.conf['MODEL_NAME'])
###############################################
### End of Global Section ###
###############################################
class clsChatEngine:
def __init__(self):
self.OPENAI_API_KEY=str(cf.conf['OPENAI_API_KEY'])
def findFromSJ(self, text):
try:
OPENAI_API_KEY = self.OPENAI_API_KEY
# ChatGPT API_KEY
openai.api_key = OPENAI_API_KEY
print('22'*60)
try:
# Getting response from ChatGPT
response = openai.Completion.create(
engine=MODEL_NAME,
prompt=text,
max_tokens=64,
top_p=1.0,
n=3,
temperature=0,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=["\"\"\""]
)
except IOError as e:
if e.errno == errno.EPIPE:
pass
print('44'*60)
res = response.choices[0].text
return res
except IOError as e:
if e.errno == errno.EPIPE:
pass
except Exception as e:
x = str(e)
print(x)
print('66'*60)
return x

Key snippets from the above-script are as follows –

def findFromSJ(self, text):
      try:
          OPENAI_API_KEY = self.OPENAI_API_KEY

          # ChatGPT API_KEY
          openai.api_key = OPENAI_API_KEY

          print('22'*60)

          try:
              # Getting response from ChatGPT
              response = openai.Completion.create(
              engine=MODEL_NAME,
              prompt=text,
              max_tokens=64,
              top_p=1.0,
              n=3,
              temperature=0,
              frequency_penalty=0.0,
              presence_penalty=0.0,
              stop=["\"\"\""]
              )
          except IOError as e:
              if e.errno == errno.EPIPE:
                  pass

          print('44'*60)
          res = response.choices[0].text

          return res

      except IOError as e:
          if e.errno == errno.EPIPE:
              pass

      except Exception as e:
          x = str(e)
          print(x)

          print('66'*60)

          return x

The code is a function that uses OpenAI’s ChatGPT model to generate text based on a given prompt text. The function takes the text to be completed as input and uses an API key stored in the OPENAI_API_KEY property of the calling object to request OpenAI’s API. If the request is successful, the function returns the top completion generated by the model, as stored in the text field of the first item in the choices list of the API response.

The function includes error handling for IOError and Exception. If an IOError occurs, the function checks if the error number is errno.EPIPE and, if it is, returns without doing anything. If an Exception occurs, the function converts the error message to a string and prints it, then returns the string.


  • clsVoice2Text.py (This python script will invoke the Rev-AI class to initiate the transformation of audio into the text.)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 26-Dec-2022 ####
#### Modified On 28-Jan-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### Rev-AI class to initiate the transformation ####
#### of audio into the text. ####
#####################################################
import pyaudio
from rev_ai.models import MediaConfig
from rev_ai.streamingclient import RevAiStreamingClient
from six.moves import queue
import ssl
import json
import pandas as p
import clsMicrophoneStream as ms
import clsL as cl
from clsConfigClient import clsConfigClient as cf
import datetime
# Initiating Log class
l = cl.clsL()
# Bypassing SSL Authentication
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
# Legacy python that doesn't verify HTTPS certificates by default
pass
else:
# Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
######################################
### Insert your access token here ####
######################################
debug_ind = 'Y'
################################################################
### Sampling rate of your microphone and desired chunk size ####
################################################################
class clsVoice2Text:
def __init__(self):
self.OPENAI_API_KEY=str(cf.conf['OPENAI_API_KEY'])
self.rate = cf.conf['soundRate']
def processVoice(self, var):
try:
OPENAI_API_KEY = self.OPENAI_API_KEY
accessToken = cf.conf['REVAI_API_KEY']
rate = self.rate
chunk = int(rate/10)
################################################################
### Creates a media config with the settings set for a raw ####
### microphone input ####
################################################################
sampleMC = MediaConfig('audio/x-raw', 'interleaved', 44100, 'S16LE', 1)
streamclient = RevAiStreamingClient(accessToken, sampleMC)
#####################################################################
### Opens microphone input. The input will stop after a keyboard ####
### interrupt. ####
#####################################################################
with ms.clsMicrophoneStream(rate, chunk) as stream:
#####################################################################
### Uses try method to enable users to manually close the stream ####
#####################################################################
try:
response_gen = ''
response = ''
finalText = ''
#########################################################################
### Starts the server connection and thread sending microphone audio ####
#########################################################################
response_gen = streamclient.start(stream.generator())
###################################################
### Iterates through responses and prints them ####
###################################################
for response in response_gen:
try:
print('JSON:')
print(response)
r = json.loads(response)
df = p.json_normalize(r["elements"])
l.logr('1.df_' + var + '.csv', debug_ind, df, 'log')
column_name = "confidence"
if column_name in df.columns:
print('DF:: ')
print(df)
finalText = "".join(df["value"])
print("TEXT:")
print(finalText)
df = p.DataFrame()
raise Exception
except Exception as e:
x = str(e)
break
streamclient.end()
return finalText
except Exception as e:
x = str(e)
#######################################
### Ends the WebSocket connection. ####
#######################################
streamclient.end()
return ''
except Exception as e:
x = str(e)
print('Error: ', x)
streamclient.end()
return x

Here is the important snippet from the above code –

def processVoice(self, var):
      try:
          OPENAI_API_KEY = self.OPENAI_API_KEY
          accessToken = cf.conf['REVAI_API_KEY']
          rate = self.rate
          chunk = int(rate/10)

          ################################################################
          ### Creates a media config with the settings set for a raw  ####
          ### microphone input                                        ####
          ################################################################

          sampleMC = MediaConfig('audio/x-raw', 'interleaved', 44100, 'S16LE', 1)

          streamclient = RevAiStreamingClient(accessToken, sampleMC)

          #####################################################################
          ### Opens microphone input. The input will stop after a keyboard ####
          ### interrupt.                                                   ####
          #####################################################################

          with ms.clsMicrophoneStream(rate, chunk) as stream:

              #####################################################################
              ### Uses try method to enable users to manually close the stream ####
              #####################################################################

              try:
                  response_gen = ''
                  response = ''
                  finalText = ''
                  
                  ############################################
                  ### Starts the server connection        ####
                  ### and thread sending microphone audio #### 
                  ############################################

                  response_gen = streamclient.start(stream.generator())

                  ###################################################
                  ### Iterates through responses and prints them ####
                  ###################################################

                  for response in response_gen:
                      try:
                          print('JSON:')
                          print(response)

                          r = json.loads(response)

                          df = p.json_normalize(r["elements"])
                          l.logr('1.df_' + var + '.csv', debug_ind, df, 'log')
                          column_name = "confidence"

                          if column_name in df.columns:
                              print('DF:: ')
                              print(df)

                              finalText = "".join(df["value"])
                              print("TEXT:")
                              print(finalText)

                              df = p.DataFrame()

                              raise Exception

                      except Exception as e:
                          x = str(e)
                          break

                  streamclient.end()

                  return finalText

              except Exception as e:
                  x = str(e)
                  #######################################
                  ### Ends the WebSocket connection. ####
                  #######################################

                  streamclient.end()

                  return ''

      except Exception as e:
          x = str(e)
          print('Error: ', x)

          streamclient.end()

          return x

The code is a python function called processVoice() that processes a user’s voice input using the Rev.AI API. The function takes in one argument, “var,” which is not used in the code.

  1. Let us understand the code –
    • First, the function sets several variables, including the Rev.AI API access token, the sample rate, and the chunk size for the audio input.
    • Then, it creates a media configuration object for raw microphone input.
    • A RevAiStreamingClient object is created using the access token and the media configuration.
    • The code opens the microphone input using a statement and the microphone stream class.
    • Within the statement, the code starts the server connection and a thread that sends microphone audio to the server.
    • The code then iterates through the responses from the server, normalizing the JSON response and storing the values in a pandas data-frame.
    • If the “confidence” column exists in the data-frame, the code joins all the values to form the final text and raises an exception.
      • If there is an exception, the WebSocket connection is ended, and the final text is returned.
      • If there is any error, the WebSocket connection is also ended, and an empty string or the error message is returned.

  • clsMicrophoneStream.py (This python script invoke the rev_ai template to capture the chunk voice data & stream it to the service for text translation & return the response to app.)


#####################################################
#### Modified By: SATYAKI DE ####
#### Modified On 28-Jan-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### rev_ai template to capture the chunk voice ####
#### data & stream it to the service for text ####
#### translation & return the response to app. ####
#####################################################
import pyaudio
from rev_ai.models import MediaConfig
from six.moves import queue
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
class clsMicrophoneStream(object):
#############################################
### Opens a recording stream as a ####
### generator yielding the audio chunks. ####
#############################################
def __init__(self, rate, chunk):
self._rate = rate
self._chunk = chunk
##################################################
### Create a thread-safe buffer of audio data ####
##################################################
self._buff = queue.Queue()
self.closed = True
def __enter__(self):
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
#########################################################
### The API currently only supports 1-channel (mono) ####
### audio. ####
#########################################################
channels=1, rate=self._rate,
input=True, frames_per_buffer=self._chunk,
####################################################################
### Run the audio stream asynchronously to fill the buffer ####
### object. Run the audio stream asynchronously to fill the ####
### buffer object. This is necessary so that the input device's ####
### buffer doesn't overflow while the calling thread makes ####
### network requests, etc. ####
####################################################################
stream_callback=self._fill_buffer,
)
self.closed = False
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
###############################################################
### Signal the generator to terminate so that the client's ####
### streaming_recognize method will not block the process ####
### termination. ####
###############################################################
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
##############################################################
### Continuously collect data from the audio stream, into ####
### the buffer. ####
##############################################################
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
while not self.closed:
######################################################################
### Use a blocking get() to ensure there's at least one chunk of ####
### data, and stop iteration if the chunk is None, indicating the ####
### end of the audio stream. ####
######################################################################
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]
##########################################################
### Now consume whatever other data's still buffered. ####
##########################################################
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break
yield b''.join(data)

The key snippet from the above script are as follows –

def __enter__(self):
    self._audio_interface = pyaudio.PyAudio()
    self._audio_stream = self._audio_interface.open(
        format=pyaudio.paInt16,

        #########################################################
        ### The API currently only supports 1-channel (mono) ####
        ### audio.                                           ####
        #########################################################

        channels=1, rate=self._rate,
        input=True, frames_per_buffer=self._chunk,

        ####################################################################
        ### Run the audio stream asynchronously to fill the buffer      ####
        ### object. Run the audio stream asynchronously to fill the     ####
        ### buffer object. This is necessary so that the input device's ####
        ### buffer doesn't overflow while the calling thread makes      ####
        ### network requests, etc.                                      ####
        ####################################################################

        stream_callback=self._fill_buffer,
    )

    self.closed = False

    return self

This code is a part of a context manager class (clsMicrophoneStream) and implements the __enter__ method of the class. The method sets up a PyAudio object and opens an audio stream using the PyAudio object. The audio stream is configured to have the following properties:

  • Format: 16-bit integer (paInt16)
  • Channels: 1 (mono)
  • Rate: The rate specified in the instance of the ms.clsMicrophoneStream class.
  • Input: True, meaning the audio stream is an input stream, not an output stream.
  • Frames per buffer: The chunk specified in the instance of the ms.clsMicrophoneStream class.
  • Stream callback: The method self._fill_buffer will be called when the buffer needs more data.

The self.closed attribute is set to False to indicate that the stream is open. The method returns the instance of the class (self).

def __exit__(self, type, value, traceback):
    self._audio_stream.stop_stream()
    self._audio_stream.close()
    self.closed = True

    ###############################################################
    ### Signal the generator to terminate so that the client's ####
    ### streaming_recognize method will not block the process  ####
    ### termination.                                           ####
    ###############################################################

    self._buff.put(None)
    self._audio_interface.terminate()

The exit method implements the “exit” behavior of a Python context manager. It is automatically called when the context manager is exited using the statement.

The method stops and closes the audio stream, sets the closed attribute to True, and places None in the buffer. The terminate method of the PyAudio interface is then called to release any resources used by the audio stream.

def _fill_buffer(self, in_data, frame_count, time_info, status_flags):

    ##############################################################
    ### Continuously collect data from the audio stream, into ####
    ### the buffer.                                           ####
    ##############################################################

    self._buff.put(in_data)
    return None, pyaudio.paContinue

The _fill_buffer method is a callback function that runs asynchronously to continuously collect data from the audio stream and add it to the buffer.

The _fill_buffer method takes four arguments:

  • in_data: the raw audio data collected from the audio stream.
  • frame_count: the number of frames of audio data that was collected.
  • time_info: information about the timing of the audio data.
  • status_flags: flags that indicate the status of the audio stream.

The method adds the collected in_data to the buffer using the put method of the buffer object. It returns a tuple of None and pyaudio.paContinue to indicate that the audio stream should continue.

def generator(self):
    while not self.closed:
        ######################################################################
        ### Use a blocking get() to ensure there's at least one chunk of  ####
        ### data, and stop iteration if the chunk is None, indicating the ####
        ### end of the audio stream.                                      ####
        ######################################################################

        chunk = self._buff.get()
        if chunk is None:
            return
        data = [chunk]

        ##########################################################
        ### Now consume whatever other data's still buffered. ####
        ##########################################################

        while True:
            try:
                chunk = self._buff.get(block=False)
                if chunk is None:
                    return
                data.append(chunk)
            except queue.Empty:
                break

        yield b''.join(data)

The logic of the code “def generator(self):” is as follows:

The function generator is an infinite loop that runs until self.closed is True. Within the loop, it uses a blocking get() method of the buffer object (self._buff) to retrieve a chunk of audio data. If the retrieved chunk is None, it means the end of the audio stream has been reached, and the function returns.

If the retrieved chunk is not None, it appends it to the data list. The function then enters another inner loop that continues to retrieve chunks from the buffer using the non-blocking get() method until there are no more chunks left. Finally, the function yields the concatenated chunks of data as a single-byte string.


  • SJVoiceAssistant.py (Main calling python script)


#####################################################
#### Written By: SATYAKI DE ####
#### Written On: 26-Dec-2022 ####
#### Modified On 31-Jan-2023 ####
#### ####
#### Objective: This is the main calling ####
#### python script that will invoke the ####
#### multiple classes to initiate the ####
#### AI-enabled personal assistant, which would ####
#### display & answer the queries through voice. ####
#####################################################
import pyaudio
from six.moves import queue
import ssl
import json
import pandas as p
import clsMicrophoneStream as ms
import clsL as cl
from clsConfigClient import clsConfigClient as cf
import datetime
import clsChatEngine as ce
import clsText2Voice as tv
import clsVoice2Text as vt
#from signal import signal, SIGPIPE, SIG_DFL
#signal(SIGPIPE,SIG_DFL)
###################################################
##### Adding the Instantiating Global classes #####
###################################################
x2 = ce.clsChatEngine()
x3 = tv.clsText2Voice()
x4 = vt.clsVoice2Text()
# Initiating Log class
l = cl.clsL()
###################################################
##### End of Global Classes #######
###################################################
# Bypassing SSL Authentication
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
# Legacy python that doesn't verify HTTPS certificates by default
pass
else:
# Handle target environment that doesn't support HTTPS verification
ssl._create_default_https_context = _create_unverified_https_context
# Disbling Warning
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
######################################
### Insert your access token here ####
######################################
debug_ind = 'Y'
######################################
#### Global Flag ########
######################################
def main():
try:
spFlag = True
var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('*'*120)
print('Start Time: ' + str(var))
print('*'*120)
exitComment = 'THANKS.'
while True:
try:
finalText = ''
if spFlag == True:
finalText = x4.processVoice(var)
else:
pass
val = finalText.upper().strip()
print('Main Return: ', val)
print('Exit Call: ', exitComment)
print('Length of Main Return: ', len(val))
print('Length of Exit Call: ', len(exitComment))
if val == exitComment:
break
elif finalText == '':
spFlag = True
else:
print('spFlag::',spFlag)
print('Inside: ', finalText)
resVal = x2.findFromSJ(finalText)
print('ChatGPT Response:: ')
print(resVal)
resAud = x3.getAudio(resVal)
spFlag = False
except Exception as e:
pass
var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
print('*'*120)
print('End Time: ' + str(var1))
print('SJ Voice Assistant exited successfully!')
print('*'*120)
except Exception as e:
x = str(e)
print('Error: ', x)
if __name__ == "__main__":
main()

And, the key snippet from the above script –

def main():
    try:
        spFlag = True

        var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        print('*'*120)
        print('Start Time: ' + str(var))
        print('*'*120)

        exitComment = 'THANKS.'

        while True:
            try:
                finalText = ''

                if spFlag == True:
                    finalText = x4.processVoice(var)
                else:
                    pass

                val = finalText.upper().strip()

                print('Main Return: ', val)
                print('Exit Call: ', exitComment)
                print('Length of Main Return: ', len(val))
                print('Length of Exit Call: ', len(exitComment))

                if val == exitComment:
                    break
                elif finalText == '':
                    spFlag = True
                else:
                    print('spFlag::',spFlag)
                    print('Inside: ', finalText)
                    resVal = x2.findFromSJ(finalText)

                    print('ChatGPT Response:: ')
                    print(resVal)

                    resAud = x3.getAudio(resVal)
                    spFlag = False
            except Exception as e:
                pass

        var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        print('*'*120)
        print('End Time: ' + str(var1))
        print('SJ Voice Assistant exited successfully!')
        print('*'*120)

    except Exception as e:
        x = str(e)
        print('Error: ', x)

The code is a Python script that implements a voice-based chatbot (likely named “SJ Voice Assistant”). The code performs the following operations:

  1. Initialize the string “exitComment” to “THANKS.” and set the “spFlag” to True.
  2. Start an infinite loop until a specific condition breaks the loop.
  3. In the loop, try to process the input voice with a function called “processVoice()” from an object “x4”. Store the result in “finalText.”
  4. Convert “finalText” to upper case, remove leading/trailing whitespaces, and store it in “val.” Print “Main Return” and “Exit Call” with their length.
  5. If “val” equals “exitComment,” break the loop. Suppose “finalText” is an empty string; set “spFlag” to True. Otherwise, perform further processing: a. Call the function “findFromSJ()” from an object “x2” with the input “finalText.” Store the result in “resVal.” b. Call the function “getAudio()” from an object “x3” with the input “resVal.” Store the result in “resAud.” Set “spFlag” to False.
  6. If an exception occurs, catch it and pass (do nothing).
  7. Finally the application will exit by displaying the following text – “SJ Voice Assistant exited successfully!”
  8. If an exception occurs outside the loop, catch it and print the error message.

So, finally, we’ve done it.

I know that this post is relatively bigger than my earlier post. But, I think, you can get all the details once you go through it.

You will get the complete codebase in the following GitHub link.

I’ll bring some more exciting topics in the coming days from the Python verse. Please share & subscribe to my post & let me know your feedback.

Till then, Happy Avenging! 🙂

Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. Some of the images (except my photo) we’ve used are available over the net. We don’t claim ownership of these images. There is always room for improvement & especially in the prediction quality.

Canada’s Covid19 analysis based on Logistic Regression

Hi Guys,

Today, I’ll be demonstrating some scenarios based on open-source data from Canada. In this post, I will only explain some of the significant parts of the code. Not the entire range of scripts here.

Let’s explore a couple of sample source data –

2. Sample Input Data

I would like to explore how much this disease caused an impact on the elderly in Canada.

Let’s explore the source directory structure –

3. Source Directory Structures

For this, you need to install the following packages –

pip install pandas

pip install seaborn

Please find the PyPi link given below –

In this case, we’ve downloaded the data from Canada’s site. However, they have created API. So, you can consume the data through that way as well. Since the volume is a little large. I decided to download that in CSV & then use that for my analysis.

Before I start, let me explain a couple of critical assumptions that I had to make due to data impurities or availabilities.

  • If there is no data available for a specific case, my application will consider that patient as COVID-Active.
  • We will consider the patient is affected through Community-spreading until we have data to find it otherwise.
  • If there is no data available for gender, we’re marking these records as “Other.” So, that way, we’re making it into that category, where the patient doesn’t want to disclose their sexual orientation.
  • If we don’t have any data, then by default, the application is considering the patient is alive.
  • Lastly, my application considers the middle point of the age range data for all the categories, i.e., the patient’s age between 20 & 30 will be considered as 25.

1. clsCovidAnalysisByCountryAdv (This is the main script, which will invoke the Machine-Learning API & return 0 if successful.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 01-Jun-2020              ####
#### Modified On 01-Jun-2020              ####
####                                      ####
#### Objective: Main scripts for Logistic ####
#### Regression.                          ####
##############################################

import pandas as p
import clsL as log
import datetime

import matplotlib.pyplot as plt
import seaborn as sns
from clsConfig import clsConfig as cf

# %matplotlib inline -- for Jupyter Notebook
class clsCovidAnalysisByCountryAdv:
    def __init__(self):
        self.fileName_1 = cf.config['FILE_NAME_1']
        self.fileName_2 = cf.config['FILE_NAME_2']
        self.Ind = cf.config['DEBUG_IND']
        self.subdir = str(cf.config['LOG_DIR_NAME'])

    def setDefaultActiveCases(self, row):
        try:
            str_status = str(row['case_status'])

            if str_status == 'Not Reported':
                return 'Active'
            else:
                return str_status
        except:
            return 'Active'

    def setDefaultExposure(self, row):
        try:
            str_exposure = str(row['exposure'])

            if str_exposure == 'Not Reported':
                return 'Community'
            else:
                return str_exposure
        except:
            return 'Community'

    def setGender(self, row):
        try:
            str_gender = str(row['gender'])

            if str_gender == 'Not Reported':
                return 'Other'
            else:
                return str_gender
        except:
            return 'Other'

    def setSurviveStatus(self, row):
        try:
            # 0 - Deceased
            # 1 - Alive
            str_active = str(row['ActiveCases'])

            if str_active == 'Deceased':
                return 0
            else:
                return 1
        except:
            return 1

    def getAgeFromGroup(self, row):
        try:
            # We'll take the middle of the Age group
            # If a age range falls with 20, we'll
            # consider this as 10.
            # Similarly, a age group between 20 & 30,
            # should reflect by 25.
            # Anything above 80 will be considered as
            # 85

            str_age_group = str(row['AgeGroup'])

            if str_age_group == '<20':
                return 10
            elif str_age_group == '20-29':
                return 25
            elif str_age_group == '30-39':
                return 35
            elif str_age_group == '40-49':
                return 45
            elif str_age_group == '50-59':
                return 55
            elif str_age_group == '60-69':
                return 65
            elif str_age_group == '70-79':
                return 75
            else:
                return 85
        except:
            return 100

    def predictResult(self):
        try:
            
            # Initiating Logging Instances
            clog = log.clsL()

            # Important variables
            var = datetime.datetime.now().strftime(".%H.%M.%S")
            print('Target File Extension will contain the following:: ', var)
            Ind = self.Ind
            subdir = self.subdir

            #######################################
            #                                     #
            # Using Logistic Regression to        #
            # Idenitfy the following scenarios -  #
            #                                     #
            # Age wise Infection Vs Deaths        #
            #                                     #
            #######################################
            inputFileName_2 = self.fileName_2

            # Reading from Input File
            df_2 = p.read_csv(inputFileName_2)

            # Fetching only relevant columns
            df_2_Mod = df_2[['date_reported','age_group','gender','exposure','case_status']]
            df_2_Mod['State'] = df_2['province_abbr']

            print()
            print('Projecting 2nd file sample rows: ')
            print(df_2_Mod.head())

            print()
            x_row_1 = df_2_Mod.shape[0]
            x_col_1 = df_2_Mod.shape[1]

            print('Total Number of Rows: ', x_row_1)
            print('Total Number of columns: ', x_col_1)

            #########################################################################################
            # Few Assumptions                                                                       #
            #########################################################################################
            # By default, if there is no data on exposure - We'll treat that as community spreading #
            # By default, if there is no data on case_status - We'll consider this as active        #
            # By default, if there is no data on gender - We'll put that under a separate Gender    #
            # category marked as the "Other". This includes someone who doesn't want to identify    #
            # his/her gender or wants to be part of LGBT community in a generic term.               #
            #                                                                                       #
            # We'll transform our data accordingly based on the above logic.                        #
            #########################################################################################
            df_2_Mod['ActiveCases'] = df_2_Mod.apply(lambda row: self.setDefaultActiveCases(row), axis=1)
            df_2_Mod['ExposureStatus'] = df_2_Mod.apply(lambda row: self.setDefaultExposure(row), axis=1)
            df_2_Mod['Gender'] = df_2_Mod.apply(lambda row: self.setGender(row), axis=1)

            # Filtering all other records where we don't get any relevant information
            # Fetching Data for
            df_3 = df_2_Mod[(df_2_Mod['age_group'] != 'Not Reported')]

            # Dropping unwanted columns
            df_3.drop(columns=['exposure'], inplace=True)
            df_3.drop(columns=['case_status'], inplace=True)
            df_3.drop(columns=['date_reported'], inplace=True)
            df_3.drop(columns=['gender'], inplace=True)

            # Renaming one existing column
            df_3.rename(columns={"age_group": "AgeGroup"}, inplace=True)

            # Creating important feature
            # 0 - Deceased
            # 1 - Alive
            df_3['Survived'] = df_3.apply(lambda row: self.setSurviveStatus(row), axis=1)

            clog.logr('2.df_3' + var + '.csv', Ind, df_3, subdir)

            print()
            print('Projecting Filter sample rows: ')
            print(df_3.head())

            print()
            x_row_2 = df_3.shape[0]
            x_col_2 = df_3.shape[1]

            print('Total Number of Rows: ', x_row_2)
            print('Total Number of columns: ', x_col_2)

            # Let's do some basic checkings
            sns.set_style('whitegrid')
            #sns.countplot(x='Survived', hue='Gender', data=df_3, palette='RdBu_r')

            # Fixing Gender Column
            # This will check & indicate yellow for missing entries
            #sns.heatmap(df_3.isnull(), yticklabels=False, cbar=False, cmap='viridis')

            #sex = p.get_dummies(df_3['Gender'], drop_first=True)
            sex = p.get_dummies(df_3['Gender'])
            df_4 = p.concat([df_3, sex], axis=1)

            print('After New addition of columns: ')
            print(df_4.head())

            clog.logr('3.df_4' + var + '.csv', Ind, df_4, subdir)

            # Dropping unwanted columns for our Machine Learning
            df_4.drop(columns=['Gender'], inplace=True)
            df_4.drop(columns=['ActiveCases'], inplace=True)
            df_4.drop(columns=['Male','Other','Transgender'], inplace=True)

            clog.logr('4.df_4_Mod' + var + '.csv', Ind, df_4, subdir)

            # Fixing Spread Columns
            spread = p.get_dummies(df_4['ExposureStatus'], drop_first=True)
            df_5 = p.concat([df_4, spread], axis=1)

            print('After Spread columns:')
            print(df_5.head())

            clog.logr('5.df_5' + var + '.csv', Ind, df_5, subdir)

            # Dropping unwanted columns for our Machine Learning
            df_5.drop(columns=['ExposureStatus'], inplace=True)

            clog.logr('6.df_5_Mod' + var + '.csv', Ind, df_5, subdir)

            # Fixing Age Columns
            df_5['Age'] = df_5.apply(lambda row: self.getAgeFromGroup(row), axis=1)
            df_5.drop(columns=["AgeGroup"], inplace=True)

            clog.logr('7.df_6' + var + '.csv', Ind, df_5, subdir)

            # Fixing Dummy Columns Name
            # Renaming one existing column Travel-Related with Travel_Related
            df_5.rename(columns={"Travel-Related": "TravelRelated"}, inplace=True)

            clog.logr('8.df_7' + var + '.csv', Ind, df_5, subdir)

            # Removing state for temporary basis
            df_5.drop(columns=['State'], inplace=True)
            # df_5.drop(columns=['State','Other','Transgender','Pending','TravelRelated','Male'], inplace=True)

            # Casting this entire dataframe into Integer
            # df_5_temp.apply(p.to_numeric)

            print('Info::')
            print(df_5.info())
            print("*" * 60)
            print(df_5.describe())
            print("*" * 60)

            clog.logr('9.df_8' + var + '.csv', Ind, df_5, subdir)

            print('Intermediate Sample Dataframe for Age::')
            print(df_5.head())

            # Plotting it to Graph
            sns.jointplot(x="Age", y='Survived', data=df_5)
            sns.jointplot(x="Age", y='Survived', data=df_5, kind='kde', color='red')
            plt.xlabel("Age")
            plt.ylabel("Data Point (0 - Died   Vs    1 - Alive)")

            # Another check with Age Group
            sns.countplot(x='Survived', hue='Age', data=df_5, palette='RdBu_r')
            plt.xlabel("Survived(0 - Died   Vs    1 - Alive)")
            plt.ylabel("Total No Of Patient")

            df_6 = df_5.drop(columns=['Survived'], axis=1)

            clog.logr('10.df_9' + var + '.csv', Ind, df_6, subdir)

            # Train & Split Data
            x_1 = df_6
            y_1 = df_5['Survived']

            # Now Train-Test Split of your source data
            from sklearn.model_selection import train_test_split

            # test_size => % of allocated data for your test cases
            # random_state => A specific set of random split on your data
            X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(x_1, y_1, test_size=0.3, random_state=101)

            # Importing Model
            from sklearn.linear_model import LogisticRegression

            logmodel = LogisticRegression()
            logmodel.fit(X_train_1, Y_train_1)

            # Adding Predictions to it
            predictions_1 = logmodel.predict(X_test_1)

            from sklearn.metrics import classification_report

            print('Classification Report:: ')
            print(classification_report(Y_test_1, predictions_1))

            from sklearn.metrics import confusion_matrix

            print('Confusion Matrix:: ')
            print(confusion_matrix(Y_test_1, predictions_1))

            # This is require when you are trying to print from conventional
            # front & not using Jupyter notebook.
            plt.show()

            return 0

        except Exception  as e:
            x = str(e)
            print('Error : ', x)

            return 1

Key snippets from the above script –

df_2_Mod['ActiveCases'] = df_2_Mod.apply(lambda row: self.setDefaultActiveCases(row), axis=1)
df_2_Mod['ExposureStatus'] = df_2_Mod.apply(lambda row: self.setDefaultExposure(row), axis=1)
df_2_Mod['Gender'] = df_2_Mod.apply(lambda row: self.setGender(row), axis=1)

# Filtering all other records where we don't get any relevant information
# Fetching Data for
df_3 = df_2_Mod[(df_2_Mod['age_group'] != 'Not Reported')]

# Dropping unwanted columns
df_3.drop(columns=['exposure'], inplace=True)
df_3.drop(columns=['case_status'], inplace=True)
df_3.drop(columns=['date_reported'], inplace=True)
df_3.drop(columns=['gender'], inplace=True)

# Renaming one existing column
df_3.rename(columns={"age_group": "AgeGroup"}, inplace=True)

# Creating important feature
# 0 - Deceased
# 1 - Alive
df_3['Survived'] = df_3.apply(lambda row: self.setSurviveStatus(row), axis=1)

The above lines point to the critical transformation areas, where the application is invoking various essential business logic.

Let’s see at this moment our sample data –

6. 4_4_mod

Let’s look into the following part –

# Fixing Spread Columns
spread = p.get_dummies(df_4['ExposureStatus'], drop_first=True)
df_5 = p.concat([df_4, spread], axis=1)

The above lines will transform the data into this –

7. 5_5_Mod

As you can see, we’ve transformed the row values into columns with binary values. This kind of transformation is beneficial.

# Plotting it to Graph
sns.jointplot(x="Age", y='Survived', data=df_5)
sns.jointplot(x="Age", y='Survived', data=df_5, kind='kde', color='red')
plt.xlabel("Age")
plt.ylabel("Data Point (0 - Died   Vs    1 - Alive)")

# Another check with Age Group
sns.countplot(x='Survived', hue='Age', data=df_5, palette='RdBu_r')
plt.xlabel("Survived(0 - Died   Vs    1 - Alive)")
plt.ylabel("Total No Of Patient")

The above lines will process the data & visualize based on that.

x_1 = df_6
y_1 = df_5['Survived']

In the above snippet, we’ve assigned the features & target variable for our final logistic regression model.

# Now Train-Test Split of your source data
from sklearn.model_selection import train_test_split

# test_size => % of allocated data for your test cases
# random_state => A specific set of random split on your data
X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(x_1, y_1, test_size=0.3, random_state=101)

# Importing Model
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train_1, Y_train_1)

In the above snippet, we’re splitting the primary data & create a set of test & train data. Once we have the collection, the application will put the logistic regression model. And, finally, we’ll fit the training data.

# Adding Predictions to it
predictions_1 = logmodel.predict(X_test_1)

from sklearn.metrics import classification_report

print('Classification Report:: ')
print(classification_report(Y_test_1, predictions_1))

The above lines, finally use the model & then we feed our test data.

Let’s see how it runs –

5.1.Run_Windows
5.2. Run_Windows

And, here is the log directory –

4. Logs

For better understanding, I’m just clubbing both the diagram at one place & the final outcome is showing as follows –

1. MergeReport

So, from the above picture, we can see that the maximum vulnerable patients are patients who are 80+. The next two categories that also suffered are 70+ & 60+.

Also, We’ve checked the Female Vs. Male in the following code –

sns.countplot(x='Survived', hue='Female', data=df_5, palette='RdBu_r')
plt.xlabel("Survived(0 - Died   Vs    1 - Alive)")
plt.ylabel("Female Vs Male (Including Other Genders)")

And, the analysis represents through this –

8. Female_Male

In this case, you have to consider that the Male part includes all the other genders apart from the actual Male. Hence, I believe death for females would be more compared to people who identified themselves as males.

So, finally, we’ve done it.

During this challenging time, I would request you to follow strict health guidelines & stay healthy.

N.B.: All the data that are used here can be found in the public domain. We use this solely for educational purposes. You can find the details here.

Analyzing Language using IBM Watson using Python

Hi Guys,

Today, I’ll be discussing the following topic – “How to analyze text using IBM Watson implementing through Python.”

IBM has significantly improved in the field of Visual Image Analysis or Text language analysis using its IBM Watson cloud platform. In this particular topic, we’ll be exploring the natural languages only.

To access IBM API, we need to first create an IBM Cloud account from this site.

Let us quickly go through the steps to create the IBM Language Understanding service. Click the Catalog on top of your browser menu as shown in the below picture –

6. Creating an Instance for Watson

After that, click the AI option on your left-hand side of the panel marked in RED.

Click the Watson-Studio & later choose the plan. In our case, We’ll select the “Lite” option as IBM provided this platform for all the developers to explore their cloud for free.

7. Choosing AI
8. Choosing Plan

Clicking the create option will lead to a blank page of Watson Studio as shown below –

9. Choosing Watson Studio

And, now, we need to click the Get Started button to launch it. This will lead to Create Project page, which can be done using the following steps –

10. Create Project Initial Screen

Now, clicking the create a project will lead you to the next screen –

11. Create Project - Continue

You can choose either an empty project, or you can create it from a sample file. In this case, we’ll be selecting the first option & this will lead us to the below page –

12. Creating a Project

And, then you will click the “Create” option, which will lead you to the next screen –

13. Adding to project

Now, you need to click “Add to Project.” This will give you a variety of services that you want to explore/use from the list. If you want to create your own natural language classifier, which you can do that as follows –

14. Adding Natural Language Components from IBM Cloud

Once, you click it – you need to select the associate service –

15. Adding Associte Service - Sound

Here, you need to click the hyperlink, which prompts to the next screen –

16. Choosing Associate Service - Sound

You need to check the price for both the Visual & Natural Language Classifier. They are pretty expensive. The visual classifier has the Lite plan. However, it has limitations of output.

Clicking the “Create” will prompt to the next screen –

18. Selecting Region - Sound

After successful creation, you will be redirected to the following page –

19. Landing Page - Sound

Now, We’ll be adding our “Natural Language Understand” for our test –

29. Choosing Natural Language Understanding

This will prompt the next screen –

7. Choosing AI - Natural Language Understanding

Once, it is successful. You will see the service registered as shown below –

3. Watson Services - Sound

If you click the service marked in RED, it will lead you to another page, where you will get the API Key & Url. You need both of this information in Python application to access this API as shown below –

4. Watson API Details - Sound

Now, we’re ready with the necessary cloud set-up. After this, we need to install the Python package for IBM Cloud as shown below –

1. Installing_Packages

We’ve noticed that, recently, IBM has launched one upgraded package. Hence, we installed that one as well. I would recommend you to install this second package directly instead of the first one shown above –

2. Installing Latest IBM_Watson Package

Now, we’re done with our set-up.

Let’s see the directory structure –

31. Directory Structure

We’ll be discussing only the main calling script & class script. However, we’ll be posting the parameters without discussing it. And, we won’t discuss clsL.py as we’ve already discussed that in our previous post.

1. clsConfig.py (This script contains all the parameter details.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 04-Apr-2020              ####
####                                      ####
#### Objective: This script is a config   ####
#### file, contains all the keys for      ####
#### IBM Cloud API.   Application will    ####
#### process these information & perform  ####
#### various analysis on IBM Watson cloud.####
##############################################

import os
import platform as pl

class clsConfig(object):
    Curr_Path = os.path.dirname(os.path.realpath(__file__))

    os_det = pl.system()
    if os_det == "Windows":
        sep = '\\'
    else:
        sep = '/'

    config = {
        'APP_ID': 1,
        'SERVICE_URL': "https://api.eu-gb.natural-language-understanding.watson.cloud.ibm.com/instances/xxxxxxxxxxxxxxXXXXXXXXXXxxxxxxxxxxxxxxxx",
        'API_KEY': "Xxxxxxxxxxxxxkdkdfifd984djddkkdkdkdsSSdkdkdd",
        'API_TYPE': "application/json",
        'CACHE': "no-cache",
        'CON': "keep-alive",
        'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
        'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
        'LOG_PATH': Curr_Path + sep + 'log' + sep,
        'REPORT_PATH': Curr_Path + sep + 'report',
        'SRC_PATH': Curr_Path + sep + 'Src_File' + sep,
        'APP_DESC_1': 'IBM Watson Language Understand!',
        'DEBUG_IND': 'N',
        'INIT_PATH': Curr_Path
    }

Note that you will be placing your API_KEY & URL here, as shown in the configuration file.

2. clsIBMWatson.py (This is the main script, which will invoke the IBM Watson API based on the input from the user & return 0 if successful.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 04-Apr-2020              ####
#### Modified On 04-Apr-2020              ####
####                                      ####
#### Objective: Main scripts to invoke    ####
#### IBM Watson Language Understand API.  ####
##############################################

import logging
from clsConfig import clsConfig as cf
import clsL as cl
import json
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, EntitiesOptions, KeywordsOptions, SentimentOptions, CategoriesOptions, ConceptsOptions
from ibm_watson import ApiException

class clsIBMWatson:
    def __init__(self):
        self.api_key =  cf.config['API_KEY']
        self.service_url = cf.config['SERVICE_URL']

    def calculateExpressionFromUrl(self, inputUrl, inputVersion):
        try:
            api_key = self.api_key
            service_url = self.service_url
            print('-' * 60)
            print('Beginning of the IBM Watson for Input Url.')
            print('-' * 60)

            authenticator = IAMAuthenticator(api_key)

            # Authentication via service credentials provided in our config files
            service = NaturalLanguageUnderstandingV1(version=inputVersion, authenticator=authenticator)
            service.set_service_url(service_url)

            response = service.analyze(
                url=inputUrl,
                features=Features(entities=EntitiesOptions(),
                                  sentiment=SentimentOptions(),
                                  concepts=ConceptsOptions())).get_result()

            print(json.dumps(response, indent=2))

            return 0

        except ApiException as ex:
            print('-' * 60)
            print("Method failed for Url with status code " + str(ex.code) + ": " + ex.message)
            print('-' * 60)

            return 1

    def calculateExpressionFromText(self, inputText, inputVersion):
        try:
            api_key = self.api_key
            service_url = self.service_url
            print('-' * 60)
            print('Beginning of the IBM Watson for Input Url.')
            print('-' * 60)

            authenticator = IAMAuthenticator(api_key)

            # Authentication via service credentials provided in our config files
            service = NaturalLanguageUnderstandingV1(version=inputVersion, authenticator=authenticator)
            service.set_service_url(service_url)

            response = service.analyze(
                text=inputText,
                features=Features(entities=EntitiesOptions(),
                                  sentiment=SentimentOptions(),
                                  concepts=ConceptsOptions())).get_result()

            print(json.dumps(response, indent=2))

            return 0

        except ApiException as ex:
            print('-' * 60)
            print("Method failed for Url with status code " + str(ex.code) + ": " + ex.message)
            print('-' * 60)

            return 1

Some of the key lines from the above snippet –

authenticator = IAMAuthenticator(api_key)

# Authentication via service credentials provided in our config files
service = NaturalLanguageUnderstandingV1(version=inputVersion, authenticator=authenticator)
service.set_service_url(service_url)

By providing the API Key & Url, the application is initiating the service for Watson.

response = service.analyze(
    url=inputUrl,
    features=Features(entities=EntitiesOptions(),
                      sentiment=SentimentOptions(),
                      concepts=ConceptsOptions())).get_result()

Based on your type of input, it will bring the features of entities, sentiment & concepts here. Apart from that, you can additionally check the following features as well – Keywords & Categories.

3. callIBMWatsonAPI.py (This is the first calling script. Based on user choice, it will receive input either as Url or as the plain text & then analyze it.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 04-Apr-2020              ####
#### Modified On 04-Apr-2020              ####
####                                      ####
#### Objective: Main calling scripts.     ####
##############################################

from clsConfig import clsConfig as cf
import clsL as cl
import logging
import datetime
import clsIBMWatson as cw

# Disbling Warning
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

# Lookup functions from
# Azure cloud SQL DB

var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def main():
    try:
        ret_1 = 0
        general_log_path = str(cf.config['LOG_PATH'])

        # Enabling Logging Info
        logging.basicConfig(filename=general_log_path + 'IBMWatson_NaturalLanguageAnalysis.log', level=logging.INFO)

        # Initiating Log Class
        l = cl.clsL()

        # Moving previous day log files to archive directory
        log_dir = cf.config['LOG_PATH']
        curr_ver =datetime.datetime.now().strftime("%Y-%m-%d")

        tmpR0 = "*" * 157

        logging.info(tmpR0)
        tmpR9 = 'Start Time: ' + str(var)
        logging.info(tmpR9)
        logging.info(tmpR0)

        print("Log Directory::", log_dir)
        tmpR1 = 'Log Directory::' + log_dir
        logging.info(tmpR1)

        print('Welcome to IBM Wantson Language Understanding Calling Program: ')
        print('-' * 60)
        print('Please Press 1 for Understand the language from Url.')
        print('Please Press 2 for Understand the language from your input-text.')
        input_choice = int(input('Please provide your choice:'))

        # Create the instance of the IBM Watson Class
        x2 = cw.clsIBMWatson()

        # Let's pass this to our map section
        if input_choice == 1:
            textUrl = str(input('Please provide the complete input url:'))
            ret_1 = x2.calculateExpressionFromUrl(textUrl, curr_ver)
        elif input_choice == 2:
            inputText = str(input('Please provide the input text:'))
            ret_1 = x2.calculateExpressionFromText(inputText, curr_ver)
        else:
            print('Invalid options!')

        if ret_1 == 0:
            print('Successful IBM Watson Language Understanding Generated!')
        else:
            print('Failed to generate IBM Watson Language Understanding!')

        print("-" * 60)
        print()

        print('Finding Analysis points..')
        print("*" * 157)
        logging.info('Finding Analysis points..')
        logging.info(tmpR0)


        tmpR10 = 'End Time: ' + str(var)
        logging.info(tmpR10)
        logging.info(tmpR0)

    except ValueError as e:
        print(str(e))
        print("Invalid option!")
        logging.info("Invalid option!")

    except Exception as e:
        print("Top level Error: args:{0}, message{1}".format(e.args, e.message))

if __name__ == "__main__":
    main()

This script is pretty straight forward as it is first creating an instance of the main class & then based on the user input, it is calling the respective functions here.

As of now, IBM Watson can work on a list of languages, which are available here.

If you want to start from scratch, please refer to the following link.

Please find the screenshot of our application run –

Case 1 (With Url): 

21. Win_Run_1_Url
23. Win_Run_3_Url

Case 2 (With Plain text):

25. Win_Run_1_InputText
26. Win_Run_2_InputText
27. Win_Run_3_InputText

Now, Don’t forget to delete all the services from your IBM Cloud.

32. Delete Service

As you can see, from the service, you need to delete all the services one-by-one as shown in the figure.

So, we’ve done it.

To explore my photography, you can visit the following link.

I’ll be posting another new post in the coming days. Till then, Happy Avenging! 😀

Note: All the data posted here are representational data & available over the internet & for educational purpose only.