audio Archives

AGENTIC AI IN THE ENTERPRISE: STRATEGY, ARCHITECTURE, AND IMPLEMENTATION – PART 5

Posted on September 6, 2025September 6, 2025 by SatyakiDe in ai, anthropic, audio, Azure, BOT, call, cloud, CPU, Data Science, design, escape, exposure, Fabric, faiss, GPU, grok, gui, HuggingFace, ibm, integration, IoT, json, LangChain, Langflow, Linear-Regression, llm, Logistic-regression, machine-learning, mcpprotocol, Microsoft, Model, mulesoft, natural-language, neural prophet, ngrok, objects, Open-CV, openai, Performance, prophet-api, rev_ai, Scikit-Learn, StabilityAI, StableDefussion, Technology, vector

This is a continuation of my previous post, which can be found here. This will be our last post of this series.

Let us recap the key takaways from our previous post –

Two cloud patterns show how MCP standardizes safe AI-to-system work. Azure “agent factory”: You ask in Teams; Azure AI Foundry dispatches a specialist agent (HR/Sales). The agent calls a specific MCP server (Functions/Logic Apps) for CRM, SharePoint, or SQL via API Management. Entra ID enforces access; Azure Monitor audits. AWS “composable serverless agents”: In Bedrock, domain agents (Financial/IT Ops) invoke Lambda-based MCP tools for DynamoDB, S3, or CloudWatch through API Gateway with IAM and optional VPC. In both, agents never hold credentials; tools map one-to-one to systems, improving security, clarity, scalability, and compliance.

In this post, we’ll discuss the GCP factory pattern.

Unified Workbench Pattern (GCP):

The GCP “unified workbench” pattern prioritizes a unified, data-centric platform for AI development, integrating seamlessly with Vertex AI and Google’s expertise in AI and data analytics. This approach is well-suited for AI-first companies and data-intensive organizations that want to build agents that leverage cutting-edge research tools.

Let’s explore the following diagram based on this –

Imagine Mia, a clinical operations lead, opens a simple app and asks: “Which clinics had the longest wait times this week? Give me a quick summary I can share.”

The app quietly sends Mia’s request to Vertex AI Agent Builder—think of it as the switchboard operator.
Vertex AI picks the Data Analysis agent (the “specialist” for questions like Mia’s).
That agent doesn’t go rummaging through databases. Instead, it uses a safe, preapproved tool—an MCP Server—to query BigQuery, where the data lives.
The tool fetches results and returns them to Mia—no passwords in the open, no risky shortcuts—just the answer, fast and safely.

Now meet Ravi, a developer who asks: “Show me the latest app metrics and confirm yesterday’s patch didn’t break the login table.”

The app routes Ravi’s request to Vertex AI.
Vertex AI chooses the Developer agent.
That agent calls a different tool—an MCP Server designed for Cloud SQL—to check the login table and run a safe query.
Results come back with guardrails intact. If the agent ever needs files, there’s also a Cloud Storage tool ready to fetch or store documents.

Let us understand how the underlying flow of activities took place –

User Interface:
- Entry point: Vertex AI console or a custom app.
- Sends a single request; no direct credentials or system access exposed to the user.
Orchestration: Vertex AI Agent Builder (MCP Host)
- Routes the request to the most suitable agent:
  - Agent A (Data Analysis) for analytics/BI-style questions.
  - Agent B (Developer) for application/data-ops tasks.
Tooling via MCP Servers on Cloud Run
- Each MCP Server is a purpose-built adapter with least-privilege access to exactly one service:
  - Server1 → BigQuery (analytics/warehouse) — used by Agent A in this diagram.
  - Server2 → Cloud Storage (GCS) (files/objects) — available when file I/O is needed.
  - Server3 → Cloud SQL (relational DB) — used by Agent B in this diagram.
- Agents never hold database credentials; they request actions from the right tool.
Enterprise Systems
- BigQuery, Cloud Storage, and Cloud SQL are the systems of record that the tools interact with.
Security, Networking, and Observability
- GCP IAM: AuthN/AuthZ for Vertex AI and each MCP Server (fine-grained roles, least privilege).
- GCP VPC: Private network paths for all Cloud Run MCP Servers (isolation, egress control).
- Cloud Monitoring: Metrics, logs, and alerts across agents and tools (auditability, SLOs).
Return Path
- Results flow back from the service → MCP Server → Agent → Vertex AI → UI.
- Policies and logs track who requested what, when, and how.

Why does this design work?

One entry point for questions.
Clear accountability: specialists (agents) act within guardrails.
Built-in safety (IAM/VPC) and visibility (Monitoring) for trust.
Separation of concerns: agents decide what to do; tools (MCP Servers) decide how to do it.
Scalable: add a new tool (e.g., Pub/Sub or Vertex AI Feature Store) without changing the UI or agents.
Auditable & maintainable: each tool maps to one service with explicit IAM and VPC controls.

So, we’ve concluded the series with the above post. I hope you like it.

I’ll bring some more exciting topics in the coming days from the new advanced world of technology.

Till then, Happy Avenging! 🙂

Note: All the data & scenarios posted here are representative of data & scenarios available on the internet for educational purposes only. There is always room for improvement in this kind of model & the solution associated with it. I’ve shown the basic ways to achieve the same for educational purposes only.

AGENTIC AI IN THE ENTERPRISE: STRATEGY, ARCHITECTURE, AND IMPLEMENTATION – PART 3

Posted on July 31, 2025August 31, 2025 by SatyakiDe in agents, ai, anthropic, api, audio, Azure, bharatgpt, BOT, call, circuitbreaker, clob, cloud, Computer-Vision, computing, CPU, Crossplatform, Data Science, deepseek, design, exposure, Fabric, faiss, features, function, gpt3, GPU, grok, gui, Haystack, HuggingFace, ibm, IoT, json, Keras, LangChain, Langflow, Linear-Regression, Listagg, llm, Logistic-regression, loop, machine-learning, mcpprotocol, Microsoft, mobile, Model, mulesoft, natural-language, neural prophet, ngrok, objects, Open-CV, openai, oracle-cloud, Performance, pl sql, Polars, prophet-api, React, Real-time, sarvam, Silicon, StabilityAI, StableDefussion, Technology, Tensorflow, Torch, video, voice, watson

This is a continuation of my previous post, which can be found here.

Let us recap the key takaways from our previous post –

Enterprise AI, utilizing the Model Context Protocol (MCP), leverages an open standard that enables AI systems to securely and consistently access enterprise data and tools. MCP replaces brittle “N×M” integrations between models and systems with a standardized client–server pattern: an MCP host (e.g., IDE or chatbot) runs an MCP client that communicates with lightweight MCP servers, which wrap external systems via JSON-RPC. Servers expose three assets—Resources (data), Tools (actions), and Prompts (templates)—behind permissions, access control, and auditability. This design enables real-time context, reduces hallucinations, supports model- and cloud-agnostic interoperability, and accelerates “build once, integrate everywhere” deployment. A typical flow (e.g., retrieving a customer’s latest order) encompasses intent parsing, authorized tool invocation, query translation/execution, and the return of a normalized JSON result to the model for natural-language delivery. Performance introduces modest overhead (RPC hops, JSON (de)serialization, network transit) and scale considerations (request volume, significant results, context-window pressure). Mitigations include in-memory/semantic caching, optimized SQL with indexing, pagination, and filtering, connection pooling, and horizontal scaling with load balancing. In practice, small latency costs are often outweighed by the benefits of higher accuracy, stronger governance, and a decoupled, scalable architecture.

How does MCP compare with other AI integration approaches?

Compared to other approaches, the Model Context Protocol (MCP) offers a uniquely standardized and secure framework for AI-tool integration, shifting from brittle, custom-coded connections to a universal plug-and-play model. It is not a replacement for underlying systems, such as APIs or databases, but instead acts as an intelligent, secure abstraction layer designed explicitly for AI agents.

MCP vs. Custom API integrations:

This approach was the traditional method for AI integration before standards like MCP emerged.

Custom API integrations (traditional): Each AI application requires a custom-built connector for every external system it needs to access, leading to an N x M integration problem (the number of connectors grows exponentially with the number of models and systems). This approach is resource-intensive, challenging to maintain, and prone to breaking when underlying APIs change.
MCP: The standardized protocol eliminates the N x M problem by creating a universal interface. Tool creators build a single MCP server for their system, and any MCP-compatible AI agent can instantly access it. This process decouples the AI model from the underlying implementation details, drastically reducing integration and maintenance costs.

For more detailed information, please refer to the following link.

MCP vs. Retrieval-Augmented Generation (RAG):

RAG is a technique that retrieves static documents to augment an LLM’s knowledge, while MCP focuses on live interactions. They are complementary, not competing.

RAG:
- Focus: Retrieving and summarizing static, unstructured data, such as documents, manuals, or knowledge bases.
- Best for: Providing background knowledge and general information, as in a policy lookup tool or customer service bot.
- Data type: Unstructured, static knowledge.
MCP:
- Focus: Accessing and acting on real-time, structured, and dynamic data from databases, APIs, and business systems.
- Best for: Agentic use cases involving real-world actions, like pulling live sales reports from a CRM or creating a ticket in a project management tool.
- Data type: Structured, real-time, and dynamic data.

MCP vs. LLM plugins and extensions:

Before MCP, platforms like OpenAI offered proprietary plugin systems to extend LLM capabilities.

LLM plugins:
- Proprietary: Tied to a specific AI vendor (e.g., OpenAI).
- Limited: Rely on the vendor’s API function-calling mechanism, which focuses on call formatting but not standardized execution.
- Centralized: Managed by the AI vendor, creating a risk of vendor lock-in.
MCP:
- Open standard: Based on a public, interoperable protocol (JSON-RPC 2.0), making it model-agnostic and usable across different platforms.
- Infrastructure layer: Provides a standardized infrastructure for agents to discover and use any compliant tool, regardless of the underlying LLM.
- Decentralized: Promotes a flexible ecosystem and reduces the risk of vendor lock-in.

How enterprise AI with MCP has opened up a specific Architecture pattern for Azure, AWS & GCP?

Microsoft Azure:

The “agent factory” pattern: Azure focuses on providing managed services for building and orchestrating AI agents, tightly integrated with its enterprise security and governance features. The MCP architecture is a core component of the Azure AI Foundry, serving as a secure, managed “agent factory.”

Azure architecture pattern with MCP:

AI orchestration layer: The Azure AI Agent Service, within Azure AI Foundry, acts as the central host and orchestrator. It provides the control plane for creating, deploying, and managing multiple specialized agents, and it natively supports the MCP standard.
AI model layer: Agents in the Foundry can be powered by various models, including those from Azure OpenAI Service, commercial models from partners, or open-source models.
MCP server and tool layer: MCP servers are deployed using serverless functions, such as Azure Functions or Azure Logic Apps, to wrap existing enterprise systems. These servers expose tools for interacting with enterprise data sources like SharePoint, Azure AI Search, and Azure Blob Storage.
Data and security layer: Data is secured using Microsoft Entra ID (formerly Azure AD) for authentication and access control, with robust security policies enforced via Azure API Management. Access to data sources, such as databases and storage, is managed securely through private networks and Managed Identity.

Amazon Web Services (AWS):

The “composable serverless agent” pattern: AWS emphasizes a modular, composable, and serverless approach, leveraging its extensive portfolio of services to build sophisticated, flexible, and scalable AI solutions. The MCP architecture here aligns with the principle of creating lightweight, event-driven services that AI agents can orchestrate.

AWS architecture pattern with MCP:

The AI orchestration layer, which includes Amazon Bedrock Agents or custom agent frameworks deployed via AWS Fargate or Lambda, acts as the MCP hosts. Bedrock Agents provide built-in orchestration, while custom agents offer greater flexibility and customization options.
AI model layer: The models are sourced from Amazon Bedrock, which provides a wide selection of foundation models.
MCP server and tool layer: MCP servers are deployed as serverless AWS Lambda functions. AWS offers pre-built MCP servers for many of its services, including the AWS Serverless MCP Server for managing serverless applications and the AWS Lambda Tool MCP Server for invoking existing Lambda functions as tools.
Data and security layer: Access is tightly controlled using AWS Identity and Access Management (IAM) roles and policies, with fine-grained permissions for each MCP server. Private data sources like databases (Amazon DynamoDB) and storage (Amazon S3) are accessed securely within a Virtual Private Cloud (VPC).

Google Cloud Platform (GCP):

The “unified workbench” pattern: GCP focuses on providing a unified, open, and data-centric platform for AI development. The MCP architecture on GCP integrates natively with the Vertex AI platform, treating MCP servers as first-class tools that can be dynamically discovered and used within a single workbench.

GCP architecture pattern with MCP:

AI orchestration layer: The Vertex AI Agent Builder serves as the central environment for building and managing conversational AI and other agents. It orchestrates workflows and manages tool invocation for agents.
AI model layer: Agents use foundation models available through the Vertex AI Model Garden or the Gemini API.
MCP server and tool layer: MCP servers are deployed as containerized microservices on Cloud Run or managed by services like App Engine. These servers contain tools that interact with GCP services, such as BigQuery, Cloud Storage, and Cloud SQL. GCP offers pre-built MCP server implementations, such as the GCP MCP Toolbox, for integration with its databases.
Data and security layer: Vertex AI Vector Search and other data sources are encapsulated within the MCP server tools to provide contextual information. Access to these services is managed by Identity and Access Management (IAM) and secured through virtual private clouds. The MCP server can leverage Vertex AI Context Caching for improved performance.

Note that all the native technology is referred to in each respective cloud. Hence, some of the better technologies can be used in place of the tool mentioned here. This is more of a concept-level comparison rather than industry-wise implementation approaches.

We’ll go ahead and conclude this post here & continue discussing on a further deep dive in the next post.

Till then, Happy Avenging! 🙂

Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. There is always room for improvement in this kind of model & the solution associated with it. I’ve shown the basic ways to achieve the same for educational purposes only.

Agentic AI in the Enterprise: Strategy, Architecture, and Implementation – Part 1

Posted on May 28, 2025August 31, 2025 by SatyakiDe in agents, ai, anthropic, api, audio, Azure, bharatgpt, BOT, call, cloud, Computer-Vision, computing, Data Science, design, function, gpt3, GPU, grok, gui, HuggingFace, integration, IoT, LangChain, Langflow, Logistic-regression, machine-learning, meshapi, Microsoft, mobile, Model, mulesoft, natural-language, neural prophet, ngrok, objects, Open-CV, openai, oracle-cloud, Performance, Real-time, StabilityAI, StableDefussion, Technology, Tensorflow, vector, video, voice, watson

Today, we won’t be discussing any solutions. Today, we’ll be discussing the Agentic AI & its implementation in the Enterprise landscape in a series of upcoming posts.

So, hang tight! We’re about to launch a new venture as part of our knowledge drive.

What is Agentic AI?

Agentic AI refers to artificial intelligence systems that can act autonomously to achieve goals, making decisions and taking actions without constant human oversight. Unlike traditional AI, which responds to prompts, agentic AI can plan, reason about next steps, utilize tools, and work toward objectives over extended periods of time.

Key characteristics of agentic AI include:

Autonomy and Goal-Directed Behavior: These systems can pursue objectives independently, breaking down complex tasks into smaller steps and executing them sequentially.
Tool Use and Environment Interaction: Agentic AI can interact with external systems, APIs, databases, and software tools to gather information and perform actions in the real world.
Planning and Reasoning: They can develop multi-step strategies, adapt their approach based on feedback, and reason through problems to find solutions.
Persistence: Unlike single-interaction AI, agentic systems can maintain context and continue working on tasks across multiple interactions or sessions.
Decision Making: They can evaluate options, weigh trade-offs, and make choices about how to proceed when faced with uncertainty.

Foundational Elements of Agentic AI Architectures:

Agentic AI systems have several interconnected components that work together to enable intelligent behaviour. Each element plays a crucial role in the overall functioning of the AI system, and they must interact seamlessly to achieve desired outcomes. Let’s explore each of these components in more detail.

Sensing:

The sensing module serves as the AI’s eyes and ears, enabling it to understand its surroundings and make informed decisions. Think of it as the system that helps the AI “see” and “hear” the world around it, much like how humans use their senses.

Gathering Information: The system collects data from multiple sources, including cameras for visual information, microphones for audio, sensors for physical touch, and digital systems for data. This step provides the AI with a comprehensive understanding of what’s happening.
Making Sense of Data: Raw information from sensors can be messy and overwhelming. This component processes the data to identify the essential patterns and details that actually matter for making informed decisions.
Recognizing What’s Important: Utilizing advanced techniques such as computer vision (for images), natural language processing (for text and speech), and machine learning (for data patterns), the system identifies and understands objects, people, events, and situations within the environment.

This sensing capability enables AI systems to transition from merely following pre-programmed instructions to genuinely understanding their environment and making informed decisions based on real-world conditions. It’s the difference between a basic automated system and an intelligent agent that can adapt to changing situations.

Observation:

The observation module serves as the AI’s decision-making center, where it sets objectives, develops strategies, and selects the most effective actions to take. This step is where the AI transforms what it perceives into purposeful action, much like humans think through problems and devise plans.

Setting Clear Objectives: The system establishes specific goals and desired outcomes, giving the AI a clear sense of direction and purpose. This approach helps ensure all actions are working toward meaningful results rather than random activity.
Strategic Planning: Using information about its own capabilities and the current situation, the AI creates step-by-step plans to reach its goals. It considers potential obstacles, available resources, and different approaches to find the most effective path forward.
Intelligent Decision-Making: When faced with multiple options, the system evaluates each choice against the current circumstances, established goals, and potential outcomes. It then selects the action most likely to move the AI closer to achieving its objectives.

This observation capability is what transforms an AI from a simple tool that follows commands into an intelligent system that can work independently toward business goals. It enables the AI to handle complex, multi-step tasks and adapt its approach when conditions change, making it valuable for a wide range of applications, from customer service to project management.

Action:

The action module serves as the AI’s hands and voice, turning decisions into real-world results. This step is where the AI actually puts its thinking and planning into action, carrying out tasks that make a tangible difference in the environment.

Control Systems: The system utilizes various tools to interact with the world, including motors for physical movement, speakers for communication, network connections for digital tasks, and software interfaces for system operation. These serve as the AI’s means of reaching out and making adjustments.
Task Implementation: Once the cognitive module determines the action to take, this component executes the actual task. Whether it’s sending an email, moving a robotic arm, updating a database, or scheduling a meeting, this module handles the execution from start to finish.

This action capability is what makes AI systems truly useful in business environments. Without it, an AI could analyze data and make significant decisions, but it couldn’t help solve problems or complete tasks. The action module bridges the gap between artificial intelligence and real-world impact, enabling AI to automate processes, respond to customers, manage systems, and deliver measurable business value.

Technology that is primarily involved in the Agentic AI is as follows –

1. Machine Learning

2. Deep Learning

3. Computer Vision

4. Natural Language Processing (NLP)

5. Planning and Decision-Making

6. Uncertainty and Reasoning

7. Simulation and Modeling

Agentic AI at Scale: MCP + A2A:

In an enterprise setting, agentic AI systems utilize the Model Context Protocol (MCP) and the Agent-to-Agent (A2A) protocol as complementary, open standards to achieve autonomous, coordinated, and secure workflows. An MCP-enabled agent gains the ability to access and manipulate enterprise tools and data. At the same time, A2A allows a network of these agents to collaborate on complex tasks by delegating and exchanging information.

This combined approach allows enterprises to move from isolated AI experiments to strategic, scalable, and secure AI programs.

How do the protocols work together in an enterprise?

Protocol	Function in Agentic AI	Focus	Example use case
Model Context Protocol (MCP)	Equips a single AI agent with the tools and data it needs to perform a specific job.	Vertical integration: connecting agents to enterprise systems like databases, CRMs, and APIs.	A sales agent uses MCP to query the company CRM for a client’s recent purchase history.
Agent-to-Agent (A2A)	Enables multiple specialized agents to communicate, delegate tasks, and collaborate on a larger, multi-step goal.	Horizontal collaboration: allowing agents from different domains to work together seamlessly.	An orchestrating agent uses A2A to delegate parts of a complex workflow to specialized HR, IT, and sales agents.

Advantages for the enterprise:

End-to-end automation: Agents can handle tasks from start to finish, including complex, multi-step workflows, autonomously.
Greater agility and speed: Enterprise-wide adoption of these protocols reduces the cost and complexity of integrating AI, accelerating deployment timelines for new applications.
Enhanced security and governance: Enterprise AI platforms built on these open standards incorporate robust security policies, centralized access controls, and comprehensive audit trails.
Vendor neutrality and interoperability: As open standards, MCP and A2A allow AI agents to work together seamlessly, regardless of the underlying vendor or platform.
Adaptive problem-solving: Agents can dynamically adjust their strategies and collaborate based on real-time data and contextual changes, leading to more resilient and efficient systems.

We will discuss this topic further in our upcoming posts.

Till then, Happy Avenging! 🙂

Building the optimized Indic Language bot by using the Python-based Sarvam AI LLMs – Part 2

Posted on August 31, 2024August 31, 2024 by SatyakiDe in ai, api, audio, BOT, cloud, code, function, json, Performance, sarvam

As we discover in our previous post about the Sarvam AI basic capabilities & a glimpse of code review. Today, we’ll finish the rest of the part & some of the matrices comparing against other popular LLMs.

Before that, you can refer to the previous post for a recap, which is available her e.

Also, we’re providing the demo here –

Now, let us jump into the rest of the code –

Code:

clsSarvamAI.py (This script will capture the audio input in Indic languages & then provide an LLM response in the form of audio in Indic languages. In this post, we’ll discuss part of the code. In the next part, we’ll be discussing the next important methods. Note that we’re only going to discuss a few important functions here.)

def createWavFile(self, audio, output_filename="output.wav", target_sample_rate=16000):
      try:
          # Get the raw audio data as bytes
          audio_data = audio.get_raw_data()

          # Get the original sample rate
          original_sample_rate = audio.sample_rate

          # Open the output file in write mode
          with wave.open(output_filename, 'wb') as wf:
              # Set parameters: nchannels, sampwidth, framerate, nframes, comptype, compname
              wf.setnchannels(1)  # Assuming mono audio
              wf.setsampwidth(2)  # 16-bit audio (int16)
              wf.setframerate(original_sample_rate)

              # Write audio data in chunks
              chunk_size = 1024 * 10  # Chunk size (adjust based on memory constraints)
              for i in range(0, len(audio_data), chunk_size):
                  wf.writeframes(audio_data[i:i+chunk_size])

          # Log the current timestamp
          var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
          print('Audio Time: ', str(var))

          return 0

      except Exception as e:
          print('Error: <Wav File Creation>: ', str(e))
          return 1

createWavFile:

Purpose:

This method saves recorded audio data into a WAV file format.

What it Does:

Takes raw audio data and converts it into bytes.
Gets the original sample rate of the audio.
Opens a new WAV file in write mode.
Sets the parameters for the audio file (like the number of channels, sample width, and frame rate).
Writes the audio data into the file in small chunks to manage memory usage.
Logs the current time to keep track of when the audio was saved.
Returns 0 on success or 1 if there was an error.

The “createWavFile” method takes the recorded audio and saves it as a WAV file on your computer. It converts the audio into bytes and writes them into small file parts. If something goes wrong, it prints an error message.

def chunkBengaliResponse(self, text, max_length=500):
      try:
          chunks = []
          current_chunk = ""

          # Use regex to split on sentence-ending punctuation
          sentences = re.split(r'(।|\?|!)', text)

          for i in range(0, len(sentences), 2):
              sentence = sentences[i] + (sentences[i+1] if i+1 < len(sentences) else '')

              if len(current_chunk) + len(sentence) <= max_length:
                  current_chunk += sentence
              else:
                  if current_chunk:
                      chunks.append(current_chunk.strip())
                  current_chunk = sentence

          if current_chunk:
              chunks.append(current_chunk.strip())

          return chunks
      except Exception as e:
          x = str(e)
          print('Error: <<Chunking Bengali Response>>: ', x)

          return ''

chunkBengaliResponse:

Purpose:

This method breaks down a large piece of text (in Bengali) into smaller, manageable chunks.

What it Does:

Initializes an empty list to store the chunks of text.
It uses a regular expression to split the text based on punctuation marks like full stops (।), question marks (?), and exclamation points (!).
Iterates through the split sentences to form chunks that do not exceed a specified maximum length (max_length).
Adds each chunk to the list until the entire text is processed.
Returns the list of chunks or an empty string if an error occurs.

The chunkBengaliResponse method takes a long Bengali text and splits it into smaller, easier-to-handle parts. It uses punctuation marks to determine where to split. If there’s a problem while splitting, it prints an error message.

def playWav(self, audio_data):
      try:
          # Create a wav file object from the audio data
          WavFile = wave.open(io.BytesIO(audio_data), 'rb')

          # Extract audio parameters
          channels = WavFile.getnchannels()
          sample_width = WavFile.getsampwidth()
          framerate = WavFile.getframerate()
          n_frames = WavFile.getnframes()

          # Read the audio data
          audio = WavFile.readframes(n_frames)
          WavFile.close()

          # Convert audio data to numpy array
          dtype_map = {1: np.int8, 2: np.int16, 3: np.int32, 4: np.int32}
          audio_np = np.frombuffer(audio, dtype=dtype_map[sample_width])

          # Reshape audio if stereo
          if channels == 2:
              audio_np = audio_np.reshape(-1, 2)

          # Play the audio
          sd.play(audio_np, framerate)
          sd.wait()

          return 0
      except Exception as e:
          x = str(e)
          print('Error: <<Playing the Wav>>: ', x)

          return 1

playWav:

Purpose:

This method plays audio data stored in a WAV file format.

What it Does:

Reads the audio data from a WAV file object.
Extracts parameters like the number of channels, sample width, and frame rate.
Converts the audio data into a format that the sound device can process.
If the audio is stereo (two channels), it reshapes the data for playback.
Plays the audio through the speakers.
Returns 0 on success or 1 if there was an error.

The playWav method takes audio data from a WAV file and plays it through your computer’s speakers. It reads the data and converts it into a format your speakers can understand. If there’s an issue playing the audio, it prints an error message.

  def audioPlayerWorker(self, queue):
      try:
          while True:
              var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
              print('Response Audio Time: ', str(var))
              audio_bytes = queue.get()
              if audio_bytes is None:
                  break
              self.playWav(audio_bytes)
              queue.task_done()

          return 0
      except Exception as e:
          x = str(e)
          print('Error: <<Audio Player Worker>>: ', x)

          return 1

audioPlayerWorker:

Purpose:

This method continuously plays audio from a queue until there is no more audio to play.

What it Does:

It enters an infinite loop to keep checking for audio data in the queue.
Retrieves audio data from the queue and plays it using the “playWav”-method.
Logs the current time each time an audio response is played.
It breaks the loop if it encounters a None value, indicating no more audio to play.
Returns 0 on success or 1 if there was an error.

The audioPlayerWorker method keeps checking a queue for new audio to play. It plays each piece of audio as it comes in and stops when there’s no more audio. If there’s an error during playback, it prints an error message.

  async def processChunk(self, chText, url_3, headers):
      try:
          sarvamAPIKey = self.sarvamAPIKey
          model_1 = self.model_1
          langCode_1 = self.langCode_1
          speakerName = self.speakerName

          print()
          print('Chunk Response: ')
          vText = chText.replace('*','').replace(':',' , ')
          print(vText)

          payload_3 = {
              "inputs": [vText],
              "target_language_code": langCode_1,
              "speaker": speakerName,
              "pitch": 0.15,
              "pace": 0.95,
              "loudness": 2.1,
              "speech_sample_rate": 16000,
              "enable_preprocessing": True,
              "model": model_1
          }
          response_3 = requests.request("POST", url_3, json=payload_3, headers=headers)
          audio_data = response_3.text
          data = json.loads(audio_data)
          byte_data = data['audios'][0]
          audio_bytes = base64.b64decode(byte_data)

          return audio_bytes
      except Exception as e:
          x = str(e)
          print('Error: <<Process Chunk>>: ', x)
          audio_bytes = base64.b64decode('')

          return audio_bytes

processChunk:

Purpose:

This asynchronous method processes a chunk of text to generate audio using an external API.

What it Does:

Cleans up the text chunk by removing unwanted characters.
Prepares a payload with the cleaned text and other parameters required for text-to-speech conversion.
Sends a POST request to an external API to generate audio from the text.
Decodes the audio data received from the API (in base64 format) into raw audio bytes.
Returns the audio bytes or an empty byte string if there is an error.

The processChunk method takes a text, sends it to an external service to be converted into speech, and returns the audio data. If something goes wrong, it prints an error message.

  async def processAudio(self, audio):
      try:
          model_2 = self.model_2
          model_3 = self.model_3
          url_1 = self.url_1
          url_2 = self.url_2
          url_3 = self.url_3
          sarvamAPIKey = self.sarvamAPIKey
          audioFile = self.audioFile
          WavFile = self.WavFile
          langCode_1 = self.langCode_1
          langCode_2 = self.langCode_2
          speakerGender = self.speakerGender

          headers = {
              "api-subscription-key": sarvamAPIKey
          }

          audio_queue = Queue()
          data = {
              "model": model_2,
              "prompt": templateVal_1
          }
          files = {
              "file": (audioFile, open(WavFile, "rb"), "audio/wav")
          }

          response_1 = requests.post(url_1, headers=headers, data=data, files=files)
          tempDert = json.loads(response_1.text)
          regionalT = tempDert['transcript']
          langCd = tempDert['language_code']
          statusCd = response_1.status_code
          payload_2 = {
              "input": regionalT,
              "source_language_code": langCode_2,
              "target_language_code": langCode_1,
              "speaker_gender": speakerGender,
              "mode": "formal",
              "model": model_3,
              "enable_preprocessing": True
          }

          response_2 = requests.request("POST", url_2, json=payload_2, headers=headers)
          regionalT_2 = response_2.text
          data_ = json.loads(regionalT_2)
          regionalText = data_['translated_text']
          chunked_response = self.chunkBengaliResponse(regionalText)

          audio_thread = Thread(target=self.audioPlayerWorker, args=(audio_queue,))
          audio_thread.start()

          for chText in chunked_response:
              audio_bytes = await self.processChunk(chText, url_3, headers)
              audio_queue.put(audio_bytes)

          audio_queue.join()
          audio_queue.put(None)
          audio_thread.join()

          var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
          print('Retrieval Time: ', str(var))

          return 0

      except Exception as e:
          x = str(e)
          print('Error: <<Processing Audio>>: ', x)

          return 1

processAudio:

Purpose:

This asynchronous method handles the complete audio processing workflow, including speech recognition, translation, and audio playback.

What it Does:

Initializes various configurations and headers required for processing.
Sends the recorded audio to an API to get the transcript and detected language.
Translates the transcript into another language using another API.
Splits the translated text into smaller chunks using the chunkBengaliResponse method.
Starts an audio playback thread to play each processed audio chunk.
Sends each text chunk to the processChunk method to convert to speech and adds the audio data to the queue for playback.
Waits for all audio chunks to be processed and played before finishing.
Logs the current time when the process is complete.
Returns 0 on success or 1 if there was an error.

The “processAudio”-method takes recorded audio, recognizes what was said, translates it into another language, splits the translated text into parts, converts each part into speech, and plays it back. It uses different services to do this; if there’s a problem at any step, it prints an error message.

And, here is the performance stats (Captured from Sarvam AI website) –

So, finally, we’ve done it. You can view the complete code in this GitHub link.

I’ll bring some more exciting topics in the coming days from the Python verse.

Till then, Happy Avenging! 🙂

Enable OpenAI chatbot with the selected YouTube video content using LangChain, FAISS & YouTube data-API.

Posted on May 28, 2023May 29, 2023 by SatyakiDe in ai, api, audio, BOT, chaining, cloud, code, computing, Data Science, design, faiss, function, human, IoT, json, machine-learning, mobile, Model, natural-language, objects, openai, Python, Real-time, Technology, video, voice, youtubedataapi

Today, I’m very excited to demonstrate an effortless & new way to extract the transcript from YouTube videos & then answer the questions based on the topics selected by the users. In this post, I plan to deal with the user inputs to consider the case first & then it can summarize the video content through useful advanced analytics with the help of the LangChain & OpenAI-based model.

In this post, I’ve directly subscribed to OpenAI & I’m not using OpenAI from Azure. However, I’ll explore that in the future as well.
Before I explain the process to invoke this new library, why not view the demo first & then discuss it?

Demo

Isn’t it very exciting? This will lead to a whole new ballgame, where one can get critical decision-making information from these human sources along with their traditional advanced analytical data.

How will it help?

Let’s say as per your historical data & analytics, the dashboard is recommending prod-A, prod-B & prod-C as the top three products for potential top-performing brands. Whereas, you are getting some alerts from the TV news on prod-B due to the recent incidents. So, in that case, you don’t want to continue with the prod-B investment. You may find a new product named prod-Z. That may reduce the risk of your investment.

What is LangChain?

LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model but will also be:

Data-aware: connect a language model to other sources of data
Agentic: allow a language model to interact with its environment

The LangChain framework works around these principles.

To know more about this, please click the following link.

As you can see, this is one of the critical components in our solution, which will bind the OpenAI bot & it will feed the necessary data to provide the correct response.

What is FAISS?

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that do not fit in RAM. It also has supporting code for evaluation and parameter tuning.

Faiss developed using C++ with complete wrappers for Python—some of the most beneficial algorithms available both on CPU & in GPU as well. Facebook AI Research develops it.

To know more about this, please click the following link.

FLOW OF EVENTS:

Let us look at the flow diagram as it captures the sequence of events that unfold as part of the process.

Here are the steps that will follow in sequence –

The application will first get the topic on which it needs to look from YouTube & find the top 5 videos using the YouTube data-API.

Once the application returns a list of websites from the above step, LangChain will drive the application will extract the transcripts from the video & then optimize the response size in smaller chunks to address the costly OpenAI calls. During this time, it will invoke FAISS to create document DBs.

Finally, it will send those chunks to OpenAI for the best response based on your supplied template that performs the final analysis with small data required for your query & gets the appropriate response with fewer costs.

CODE:

Why don’t we go through the code made accessible due to this new library for this particular use case?

clsConfigClient.py (This is the main calling Python script for the input parameters.)

	################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 15-May-2020 ####
	#### Modified On: 28-May-2023 ####
	#### ####
	#### Objective: This script is a config ####
	#### file, contains all the keys for ####
	#### personal OpenAI-based video content ####
	#### enable bot. ####
	#### ####
	################################################

	import os
	import platform as pl

	class clsConfigClient(object):
	Curr_Path = os.path.dirname(os.path.realpath(__file__))

	os_det = pl.system()
	if os_det == "Windows":
	sep = '\\'
	else:
	sep = '/'

	conf = {
	'APP_ID': 1,
	'ARCH_DIR': Curr_Path + sep + 'arch' + sep,
	'PROFILE_PATH': Curr_Path + sep + 'profile' + sep,
	'LOG_PATH': Curr_Path + sep + 'log' + sep,
	'DATA_PATH': Curr_Path + sep + 'data' + sep,
	'MODEL_PATH': Curr_Path + sep + 'model' + sep,
	'TEMP_PATH': Curr_Path + sep + 'temp' + sep,
	'MODEL_DIR': 'model',
	'APP_DESC_1': 'LangChain Demo!',
	'DEBUG_IND': 'N',
	'INIT_PATH': Curr_Path,
	'FILE_NAME': 'Output.csv',
	'MODEL_NAME': 'gpt-3.5-turbo',
	'OPEN_AI_KEY': "sk-kfrjfijdrkidjkfjd9474nbfjfkfjfhfhf84i84hnfhjdbv6Bgvv",
	'YOUTUBE_KEY': "AIjfjfUYGe64hHJ-LOFO5u-mkso9pPOJGFU",
	'TITLE': "LangChain Demo!",
	'TEMP_VAL': 0.2,
	'PATH' : Curr_Path,
	'MAX_CNT' : 5,
	'OUT_DIR': 'data'
	}

view raw

clsConfigClient.py

hosted with ❤ by GitHub

Some of the key entries from the above scripts are as follows –

'MODEL_NAME': 'gpt-3.5-turbo',
'OPEN_AI_KEY': "sk-kfrjfijdrkidjkfjd9474nbfjfkfjfhfhf84i84hnfhjdbv6Bgvv",
'YOUTUBE_KEY': "AIjfjfUYGe64hHJ-LOFO5u-mkso9pPOJGFU",
'TEMP_VAL': 0.2,

From the above code snippet, one can understand that we need both the API keys for YouTube & OpenAI. And they have separate costs & usage, which I’ll share later in the post. Also, notice that the temperature sets to 0.2 ( range between 0 to 1). That means our AI bot will be consistent in response. And our application will use the GPT-3.5-turbo model for its analytic response.

clsTemplate.py (Contains all the templates for OpenAI.)

	################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 27-May-2023 ####
	#### Modified On: 28-May-2023 ####
	#### ####
	#### Objective: This script is a config ####
	#### file, contains all the template for ####
	#### OpenAI prompts to get the correct ####
	#### response. ####
	#### ####
	################################################

	# Template to use for the system message prompt
	templateVal_1 = """
	You are a helpful assistant that that can answer questions about youtube videos
	based on the video's transcript: {docs}

	Only use the factual information from the transcript to answer the question.

	If you feel like you don't have enough information to answer the question, say "I don't know".

	Your answers should be verbose and detailed.
	"""

view raw

clsTemplate.py

hosted with ❤ by GitHub

The above code is self-explanatory. Here, we’re keeping the correct instructions for our OpenAI to respond within these guidelines.

clsVideoContentScrapper.py (Main class to extract the transcript from the YouTube videos & then answer the questions based on the topics selected by the users.)

	#####################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 27-May-2023 ####
	#### Modified On 28-May-2023 ####
	#### ####
	#### Objective: This is the main calling ####
	#### python class that will invoke the ####
	#### LangChain of package to extract ####
	#### the transcript from the YouTube videos & ####
	#### then answer the questions based on the ####
	#### topics selected by the users. ####
	#### ####
	#####################################################

	from langchain.document_loaders import YoutubeLoader
	from langchain.text_splitter import RecursiveCharacterTextSplitter
	from langchain.embeddings.openai import OpenAIEmbeddings
	from langchain.vectorstores import FAISS
	from langchain.chat_models import ChatOpenAI
	from langchain.chains import LLMChain

	from langchain.prompts.chat import (
	ChatPromptTemplate,
	SystemMessagePromptTemplate,
	HumanMessagePromptTemplate,
	)

	from googleapiclient.discovery import build

	import clsTemplate as ct
	from clsConfigClient import clsConfigClient as cf

	import os

	###############################################
	### Global Section ###
	###############################################
	open_ai_Key = cf.conf['OPEN_AI_KEY']
	os.environ["OPENAI_API_KEY"] = open_ai_Key
	embeddings = OpenAIEmbeddings(openai_api_key=open_ai_Key)

	YouTube_Key = cf.conf['YOUTUBE_KEY']
	youtube = build('youtube', 'v3', developerKey=YouTube_Key)

	# Disbling Warning
	def warn(args, *kwargs):
	pass

	import warnings
	warnings.warn = warn

	###############################################
	### End of Global Section ###
	###############################################

	class clsVideoContentScrapper:
	def __init__(self):
	self.model_name = cf.conf['MODEL_NAME']
	self.temp_val = cf.conf['TEMP_VAL']
	self.max_cnt = int(cf.conf['MAX_CNT'])

	def createDBFromYoutubeVideoUrl(self, video_url):
	try:
	loader = YoutubeLoader.from_youtube_url(video_url)
	transcript = loader.load()

	text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
	docs = text_splitter.split_documents(transcript)

	db = FAISS.from_documents(docs, embeddings)
	return db

	except Exception as e:
	x = str(e)
	print('Error: ', x)
	return ''


	def getResponseFromQuery(self, db, query, k=4):
	try:
	"""
	gpt-3.5-turbo can handle up to 4097 tokens. Setting the chunksize to 1000 and k to 4 maximizes
	the number of tokens to analyze.
	"""

	mod_name = self.model_name
	temp_val = self.temp_val

	docs = db.similarity_search(query, k=k)
	docs_page_content = " ".join([d.page_content for d in docs])

	chat = ChatOpenAI(model_name=mod_name, temperature=temp_val)

	# Template to use for the system message prompt
	template = ct.templateVal_1

	system_message_prompt = SystemMessagePromptTemplate.from_template(template)

	# Human question prompt
	human_template = "Answer the following question: {question}"
	human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

	chat_prompt = ChatPromptTemplate.from_messages(
	[system_message_prompt, human_message_prompt]
	)

	chain = LLMChain(llm=chat, prompt=chat_prompt)

	response = chain.run(question=query, docs=docs_page_content)
	response = response.replace("\n", "")
	return response, docs

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return '', ''

	def topFiveURLFromYouTube(self, service, **kwargs):
	try:
	video_urls = []
	channel_list = []
	results = service.search().list(**kwargs).execute()

	for item in results['items']:
	print("Title: ", item['snippet']['title'])
	print("Description: ", item['snippet']['description'])
	channel = item['snippet']['channelId']
	print("Channel Id: ", channel)

	# Fetch the channel name using the channel ID
	channel_response = service.channels().list(part='snippet',id=item['snippet']['channelId']).execute()
	channel_title = channel_response['items'][0]['snippet']['title']
	print("Channel Title: ", channel_title)
	channel_list.append(channel_title)

	print("Video Id: ", item['id']['videoId'])
	vidURL = "https://www.youtube.com/watch?v=" + item['id']['videoId']
	print("Video URL: " + vidURL)
	video_urls.append(vidURL)
	print("\n")

	return video_urls, channel_list

	except Exception as e:
	video_urls = []
	channel_list = []
	x = str(e)
	print('Error: ', x)

	return video_urls, channel_list

	def extractContentInText(self, topic, query):
	try:
	discussedTopic = []
	strKeyText = ''
	cnt = 0
	max_cnt = self.max_cnt

	urlList, channelList = self.topFiveURLFromYouTube(youtube, q=topic, part='id,snippet',maxResults=max_cnt,type='video')
	print('Returned List: ')
	print(urlList)
	print()

	for video_url in urlList:
	print('Processing Video: ')
	print(video_url)
	db = self.createDBFromYoutubeVideoUrl(video_url)

	response, docs = self.getResponseFromQuery(db, query)

	if len(response) > 0:
	strKeyText = 'As per the topic discussed in ' + channelList[cnt] + ', '
	discussedTopic.append(strKeyText + response)

	cnt += 1

	return discussedTopic
	except Exception as e:
	discussedTopic = []
	x = str(e)
	print('Error: ', x)

	return discussedTopic

view raw

clsVideoContentScrapper.py

hosted with ❤ by GitHub

Let us understand the key methods step by step in detail –

def topFiveURLFromYouTube(self, service, **kwargs):
    try:
        video_urls = []
        channel_list = []
        results = service.search().list(**kwargs).execute()

        for item in results['items']:
            print("Title: ", item['snippet']['title'])
            print("Description: ", item['snippet']['description'])
            channel = item['snippet']['channelId']
            print("Channel Id: ", channel)

            # Fetch the channel name using the channel ID
            channel_response = service.channels().list(part='snippet',id=item['snippet']['channelId']).execute()
            channel_title = channel_response['items'][0]['snippet']['title']
            print("Channel Title: ", channel_title)
            channel_list.append(channel_title)

            print("Video Id: ", item['id']['videoId'])
            vidURL = "https://www.youtube.com/watch?v=" + item['id']['videoId']
            print("Video URL: " + vidURL)
            video_urls.append(vidURL)
            print("\n")

        return video_urls, channel_list

    except Exception as e:
        video_urls = []
        channel_list = []
        x = str(e)
        print('Error: ', x)

        return video_urls, channel_list

The above code will fetch the most relevant YouTube URLs & bind them into a list along with the channel names & then share the lists with the main functions.

def createDBFromYoutubeVideoUrl(self, video_url):
    try:
        loader = YoutubeLoader.from_youtube_url(video_url)
        transcript = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        docs = text_splitter.split_documents(transcript)

        db = FAISS.from_documents(docs, embeddings)
        return db

    except Exception as e:
        x = str(e)
        print('Error: ', x)
        return ''

The provided Python code defines a function createDBFromYoutubeVideoUrl which appears to create a database of text documents from the transcript of a YouTube video. Here’s the explanation in simple English:

The function createDBFromYoutubeVideoUrl has defined with one argument: video_url.
The function uses a try-except block to handle any potential exceptions or errors that may occur.
Inside the try block, the following steps are going to perform:

First, it creates a YoutubeLoader object from the provided video_url. This object is likely responsible for interacting with the YouTube video specified by the URL.
The loader object then loads the transcript of the video. This object is the text version of everything spoken in the video.
It then creates a RecursiveCharacterTextSplitter object with a specified chunk_size of 1000 and chunk_overlap of 100. This object may split the transcript into smaller chunks (documents) of text for easier processing or analysis. Each piece will be around 1000 characters long, and there will overlap of 100 characters between consecutive chunks.
The split_documents method of the text_splitter object will split the transcript into smaller documents. These documents are stored in the docs variable.
The FAISS.from_documents method is then called with docs and embeddings as arguments to create a FAISS (Facebook AI Similarity Search) index. This index is a database used for efficient similarity search and clustering of high-dimensional vectors, which in this case, are the embeddings of the documents. The FAISS index is stored in the db variable.
Finally, the db variable is returned, representing the created database from the video transcript.

4. If an exception occurs during the execution of the try block, the code execution moves to the except block:

Here, it first converts the exception e to a string x.
Then it prints an error message.
Finally, it returns an empty string as an indication of the error.

def getResponseFromQuery(self, db, query, k=4):
      try:
          """
          gpt-3.5-turbo can handle up to 4097 tokens. Setting the chunksize to 1000 and k to 4 maximizes
          the number of tokens to analyze.
          """

          mod_name = self.model_name
          temp_val = self.temp_val

          docs = db.similarity_search(query, k=k)
          docs_page_content = " ".join([d.page_content for d in docs])

          chat = ChatOpenAI(model_name=mod_name, temperature=temp_val)

          # Template to use for the system message prompt
          template = ct.templateVal_1

          system_message_prompt = SystemMessagePromptTemplate.from_template(template)

          # Human question prompt
          human_template = "Answer the following question: {question}"
          human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

          chat_prompt = ChatPromptTemplate.from_messages(
              [system_message_prompt, human_message_prompt]
          )

          chain = LLMChain(llm=chat, prompt=chat_prompt)

          response = chain.run(question=query, docs=docs_page_content)
          response = response.replace("\n", "")
          return response, docs

      except Exception as e:
          x = str(e)
          print('Error: ', x)

          return '', ''

The Python function getResponseFromQuery is designed to search a given database (db) for a specific query and then generate a response using a language model (possibly GPT-3.5-turbo). The answer is based on the content found and the particular question. Here is a simple English summary:

The function getResponseFromQuery takes three parameters: db, query, and k. The k parameter is optional and defaults to 4 if not provided. db is the database to search, the query is the question or prompts to analyze, and k is the number of similar items to return.
The function initiates a try-except block for handling any errors that might occur.
Inside the try block:

The function retrieves the model name and temperature value from the instance of the class this function is a part of.
The function then searches the db database for documents similar to the query and saves these in docs.
It concatenates the content of the returned documents into a single string docs_page_content.
It creates a ChatOpenAI object with the model name and temperature value.
It creates a system message prompt from a predefined template.
It creates a human message prompt, which is the query.
It combines these two prompts to form a chat prompt.
An LLMChain object is then created using the ChatOpenAI object and the chat prompt.
This LLMChain object is used to generate a response to the query using the content of the documents found in the database. The answer is then formatted by replacing all newline characters with empty strings.
Finally, the function returns this response along with the original documents.

If any error occurs during these operations, the function goes to the except block where:

The error message is printed.
The function returns two empty strings to indicate an error occurred, and no response or documents could be produced.

def extractContentInText(self, topic, query):
    try:
        discussedTopic = []
        strKeyText = ''
        cnt = 0
        max_cnt = self.max_cnt

        urlList, channelList = self.topFiveURLFromYouTube(youtube, q=topic, part='id,snippet',maxResults=max_cnt,type='video')
        print('Returned List: ')
        print(urlList)
        print()

        for video_url in urlList:
            print('Processing Video: ')
            print(video_url)
            db = self.createDBFromYoutubeVideoUrl(video_url)

            response, docs = self.getResponseFromQuery(db, query)

            if len(response) > 0:
                strKeyText = 'As per the topic discussed in ' + channelList[cnt] + ', '
                discussedTopic.append(strKeyText + response)

            cnt += 1

        return discussedTopic
    except Exception as e:
        discussedTopic = []
        x = str(e)
        print('Error: ', x)

        return discussedTopic

This Python function, extractContentInText, is aimed to extract relevant content from the transcripts of top YouTube videos on a specific topic and generate responses to a given query. Here’s a simple English translation:

The function extractContentInText is defined with topic and query as parameters.
It begins with a try-except block to catch and handle any possible exceptions.
In the try block:

It initializes several variables: an empty list discussedTopic to store the extracted information, an empty string strKeyText to keep specific parts of the content, a counter cnt initialized at 0, and max_cnt retrieved from the self-object to specify the maximum number of YouTube videos to consider.
It calls the topFiveURLFromYouTube function (defined previously) to get the URLs of the top videos on the given topic from YouTube. It also retrieves the list of channel names associated with these videos.
It prints the returned list of URLs.
Then, it starts a loop over each URL in the urlList.
- For each URL, it prints the URL, then creates a database from the transcript of the YouTube video using the function createDBFromYoutubeVideoUrl.
- It then uses the getResponseFromQuery function to get a response to the query based on the content of the database.
- If the length of the response is greater than 0 (meaning there is a response), it forms a string strKeyText to indicate the channel that the topic was discussed on and then appends the answer to this string. This entire string is then added to the discussedTopic list.
- It increments the counter cnt by one after each iteration.
- Finally, it returns the discussedTopic list, which now contains relevant content extracted from the videos.

If any error occurs during these operations, the function goes into the except block:

It first resets discussedTopic to an empty list.
Then it converts the exception e to a string and prints the error message.
Lastly, it returns the empty discussedTopic list, indicating that no content could be extracted due to the error.

testLangChain.py (Main Python script to extract the transcript from the YouTube videos & then answer the questions based on the topics selected by the users.)

	#####################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 27-May-2023 ####
	#### Modified On 28-May-2023 ####
	#### ####
	#### Objective: This is the main calling ####
	#### python script that will invoke the ####
	#### clsVideoContentScrapper class to extract ####
	#### the transcript from the YouTube videos. ####
	#### ####
	#####################################################

	import clsL as cl
	from clsConfigClient import clsConfigClient as cf
	import datetime
	import textwrap

	import clsVideoContentScrapper as cvsc

	# Disbling Warning
	def warn(args, *kwargs):
	pass

	import warnings
	warnings.warn = warn

	######################################
	### Get your global values ####
	######################################
	debug_ind = 'Y'

	# Initiating Logging Instances
	clog = cl.clsL()

	data_path = cf.conf['DATA_PATH']
	data_file_name = cf.conf['FILE_NAME']

	cVCScrapper = cvsc.clsVideoContentScrapper()

	######################################
	#### Global Flag ########
	######################################

	def main():
	try:
	var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	print(''120)
	print('Start Time: ' + str(var))
	print(''120)

	#query = "What are they saying about Microsoft?"
	print('Please share your topic!')
	inputTopic = input('User: ')
	print('Please ask your questions?')
	inputQry = input('User: ')
	print()

	retList = cVCScrapper.extractContentInText(inputTopic, inputQry)
	cnt = 0

	for discussedTopic in retList:
	finText = str(cnt + 1) + ') ' + discussedTopic
	print()
	print(textwrap.fill(finText, width=150))

	cnt += 1

	r1 = len(retList)

	if r1 > 0:
	print()
	print('Successfully Scrapped!')
	else:
	print()
	print('Failed to Scrappe!')

	print(''120)
	var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	print('End Time: ' + str(var1))

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	if __name__ == "__main__":
	main()

view raw

testLangChain.py

hosted with ❤ by GitHub

Please find the key snippet –

def main():
    try:
        var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        print('*'*120)
        print('Start Time: ' + str(var))
        print('*'*120)

        #query = "What are they saying about Microsoft?"
        print('Please share your topic!')
        inputTopic = input('User: ')
        print('Please ask your questions?')
        inputQry = input('User: ')
        print()

        retList = cVCScrapper.extractContentInText(inputTopic, inputQry)
        cnt = 0

        for discussedTopic in retList:
            finText = str(cnt + 1) + ') ' + discussedTopic
            print()
            print(textwrap.fill(finText, width=150))

            cnt += 1

        r1 = len(retList)

        if r1 > 0:
            print()
            print('Successfully Scrapped!')
        else:
            print()
            print('Failed to Scrappe!')

        print('*'*120)
        var1 = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
        print('End Time: ' + str(var1))

    except Exception as e:
        x = str(e)
        print('Error: ', x)

if __name__ == "__main__":
    main()

The above main application will capture the topics from the user & then will give the user a chance to ask specific questions on the topics, invoking the main class to extract the transcript from YouTube & then feed it as a source using ChainLang & finally deliver the response. If there is no response, then it will skip the overall options.

USAGE & COST FACTOR:

Please find the OpenAI usage –

Please find the YouTube API usage –

So, finally, we’ve done it.

I know that this post is relatively bigger than my earlier post. But, I think, you can get all the details once you go through it.

You will get the complete codebase in the following GitHub link.

I’ll bring some more exciting topics in the coming days from the Python verse. Please share & subscribe to my post & let me know your feedback.

Till then, Happy Avenging! 🙂

Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. Some of the images (except my photo) we’ve used are available over the net. We don’t claim ownership of these images. There is always room for improvement & especially in the prediction quality. Sample video taken from Santrel Media & you would find the link over here.

Real-time augmented reality (AR) using Python-based Computer Vision

Posted on June 25, 2022 by SatyakiDe in analytic function, api, audio, Azure, break, call, cloud, code, combining, Computer-Vision, computing, Crossplatform, Data Science, design, exposure, features, function, gui, human, IoT, machine-learning, matplotlib, mobile, Model, numpy, objects, Open-CV, Pandas, pygame, Python, Real-time, return, snippet, Technology, video

Hi Team,

Today, I’m going to discuss another Computer Vision installment. I’ll discuss how to implement Augmented Reality using Open-CV Computer Vision with full audio. We will be using part of a Bengali OTT Series called “Feludar Goendagiri” entirely for educational purposes & also as a tribute to the great legendary director, late Satyajit Roy. To know more about him, please click the following link.

Why don’t we see the demo first before jumping into the technical details?

Demo

Architecture:

Let us understand the architecture –

The above diagram shows that the application, which uses the Open-CV, analyzes individual frames from the source & blends that with the video trailer. Finally, it creates another video by correctly mixing the source audio.

Python Packages:

Following are the python packages that are necessary to develop this brilliant use case –

pip install opencv-python
pip install pygame

CODE:

Let us now understand the code. For this use case, we will only discuss three python scripts. However, we need more than these three. However, we have already discussed them in some of the early posts. Hence, we will skip them here.

clsAugmentedReality.py (This is the main class of python script that will embed the source video with the WebCAM streams in real-time.)

	##################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 20-Jun-2022 ####
	#### Modified On 25-Jun-2022 ####
	#### ####
	#### Objective: This is the main class of ####
	#### python script that will embed the source ####
	#### video with the WebCAM streams in ####
	#### real-time. ####
	##################################################

	# Importing necessary packages
	import numpy as np
	import cv2

	from clsConfig import clsConfig as cf

	# Initialize our cached reference points
	CACHED_REF_PTS = None

	class clsAugmentedReality:
	def __init__(self):
	self.TOP_LEFT_X = int(cf.conf['TOP_LEFT_X'])
	self.TOP_LEFT_Y = int(cf.conf['TOP_LEFT_Y'])
	self.TOP_RIGHT_X = int(cf.conf['TOP_RIGHT_X'])
	self.TOP_RIGHT_Y = int(cf.conf['TOP_RIGHT_Y'])
	self.BOTTOM_RIGHT_X = int(cf.conf['BOTTOM_RIGHT_X'])
	self.BOTTOM_RIGHT_Y = int(cf.conf['BOTTOM_RIGHT_Y'])
	self.BOTTOM_LEFT_X = int(cf.conf['BOTTOM_LEFT_X'])
	self.BOTTOM_LEFT_Y = int(cf.conf['BOTTOM_LEFT_Y'])

	def getWarpImages(self, frame, source, cornerIDs, arucoDict, arucoParams, zoomFlag, useCache=False):
	try:
	# Assigning values
	TOP_LEFT_X = self.TOP_LEFT_X
	TOP_LEFT_Y = self.TOP_LEFT_Y
	TOP_RIGHT_X = self.TOP_RIGHT_X
	TOP_RIGHT_Y = self.TOP_RIGHT_Y
	BOTTOM_RIGHT_X = self.BOTTOM_RIGHT_X
	BOTTOM_RIGHT_Y = self.BOTTOM_RIGHT_Y
	BOTTOM_LEFT_X = self.BOTTOM_LEFT_X
	BOTTOM_LEFT_Y = self.BOTTOM_LEFT_Y

	# Grab a reference to our cached reference points
	global CACHED_REF_PTS

	if source is None:
	raise

	# Grab the width and height of the frame and source image,
	# respectively
	# Extracting Frame from Camera
	# Exracting Source from Video
	(imgH, imgW) = frame.shape[:2]
	(srcH, srcW) = source.shape[:2]

	# Detect Aruco markers in the input frame
	(corners, ids, rejected) = cv2.aruco.detectMarkers(frame, arucoDict, parameters=arucoParams)

	print('Ids: ', str(ids))
	print('Rejected: ', str(rejected))

	# if we did not find our four ArUco markers, initialize an
	# empty IDs list, otherwise flatten the ID list
	print('Detecting Corners: ', str(len(corners)))
	ids = np.array([]) if len(corners) != 4 else ids.flatten()

	# Initialize our list of reference points
	refPts = []
	refPtTL1 = []

	# Loop over the IDs of the ArUco markers in Top-Left, Top-Right,
	# Bottom-Right, and Bottom-Left order
	for i in cornerIDs:
	# Grab the index of the corner with the current ID
	j = np.squeeze(np.where(ids == i))

	# If we receive an empty list instead of an integer index,
	# then we could not find the marker with the current ID
	if j.size == 0:
	continue

	# Otherwise, append the corner (x, y)-coordinates to our list
	# of reference points
	corner = np.squeeze(corners[j])
	refPts.append(corner)

	# Check to see if we failed to find the four ArUco markers
	if len(refPts) != 4:
	# If we are allowed to use cached reference points, fall
	# back on them
	if useCache and CACHED_REF_PTS is not None:
	refPts = CACHED_REF_PTS

	# Otherwise, we cannot use the cache and/or there are no
	# previous cached reference points, so return early
	else:
	return None

	# If we are allowed to use cached reference points, then update
	# the cache with the current set
	if useCache:
	CACHED_REF_PTS = refPts

	# Unpack our Aruco reference points and use the reference points
	# to define the Destination transform matrix, making sure the
	# points are specified in Top-Left, Top-Right, Bottom-Right, and
	# Bottom-Left order
	(refPtTL, refPtTR, refPtBR, refPtBL) = refPts
	dstMat = [refPtTL[0], refPtTR[1], refPtBR[2], refPtBL[3]]
	dstMat = np.array(dstMat)

	# For zoom option recalculating all the 4 points
	refPtTL1_L_X = refPtTL[0][0]-TOP_LEFT_X
	refPtTL1_L_Y = refPtTL[0][1]-TOP_LEFT_Y

	refPtTL1.append((refPtTL1_L_X,refPtTL1_L_Y))

	refPtTL1_R_X = refPtTL[1][0]+TOP_RIGHT_X
	refPtTL1_R_Y = refPtTL[1][1]+TOP_RIGHT_Y

	refPtTL1.append((refPtTL1_R_X,refPtTL1_R_Y))

	refPtTD1_L_X = refPtTL[2][0]+BOTTOM_RIGHT_X
	refPtTD1_L_Y = refPtTL[2][1]+BOTTOM_RIGHT_Y

	refPtTL1.append((refPtTD1_L_X,refPtTD1_L_Y))

	refPtTD1_R_X = refPtTL[3][0]-BOTTOM_LEFT_X
	refPtTD1_R_Y = refPtTL[3][1]+BOTTOM_LEFT_Y

	refPtTL1.append((refPtTD1_R_X,refPtTD1_R_Y))

	dstMatMod = [refPtTL1[0], refPtTL1[1], refPtTL1[2], refPtTL1[3]]
	dstMatMod = np.array(dstMatMod)

	# Define the transform matrix for the source image in Top-Left,
	# Top-Right, Bottom-Right, and Bottom-Left order
	srcMat = np.array([[0, 0], [srcW, 0], [srcW, srcH], [0, srcH]])

	# Compute the homography matrix and then warp the source image to
	# the destination based on the homography depending upon the
	# zoom flag
	if zoomFlag == 1:
	(H, _) = cv2.findHomography(srcMat, dstMat)
	else:
	(H, _) = cv2.findHomography(srcMat, dstMatMod)

	warped = cv2.warpPerspective(source, H, (imgW, imgH))

	# Construct a mask for the source image now that the perspective
	# warp has taken place (we'll need this mask to copy the source
	# image into the destination)
	mask = np.zeros((imgH, imgW), dtype="uint8")
	if zoomFlag == 1:
	cv2.fillConvexPoly(mask, dstMat.astype("int32"), (255, 255, 255), cv2.LINE_AA)
	else:
	cv2.fillConvexPoly(mask, dstMatMod.astype("int32"), (255, 255, 255), cv2.LINE_AA)

	# This optional step will give the source image a black
	# border surrounding it when applied to the source image, you
	# can apply a dilation operation
	rect = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
	mask = cv2.dilate(mask, rect, iterations=2)

	# Create a three channel version of the mask by stacking it
	# depth-wise, such that we can copy the warped source image
	# into the input image
	maskScaled = mask.copy() / 255.0
	maskScaled = np.dstack([maskScaled] * 3)

	# Copy the warped source image into the input image by
	# (1) Multiplying the warped image and masked together,
	# (2) Then multiplying the original input image with the
	# mask (giving more weight to the input where there
	# are not masked pixels), and
	# (3) Adding the resulting multiplications together
	warpedMultiplied = cv2.multiply(warped.astype("float"), maskScaled)
	imageMultiplied = cv2.multiply(frame.astype(float), 1.0 – maskScaled)
	output = cv2.add(warpedMultiplied, imageMultiplied)
	output = output.astype("uint8")

	# Return the output frame to the calling function
	return output

	except Exception as e:

	# Delibarately raising the issue
	# That way the control goes to main calling methods
	# exception section
	raise

view raw

clsAugmentedReality.py

hosted with ❤ by GitHub

Please find the key snippet from the above script –

(imgH, imgW) = frame.shape[:2]
(srcH, srcW) = source.shape[:2]

# Detect Aruco markers in the input frame
(corners, ids, rejected) = cv2.aruco.detectMarkers(frame, arucoDict, parameters=arucoParams)

Identifying the Aruco markers are key here. The above lines help the program detect all four corners.

However, let us discuss more on the Aruco markers & strategies that I’ve used for several different surfaces.

As you can see, the right-hand side Aruco marker is tiny compared to the left one. Hence, that one will be ideal for a curve surface like Coffee Mug, Bottle rather than a flat surface.

Also, we’ve demonstrated the zoom capability with the smaller Aruco marker that will Augment almost double the original surface area.

Let us understand why we need that; as you know, any spherical surface like a bottle is round-shaped. Hence, detecting relatively more significant Aruco markers in four corners will be difficult for any camera to identify.

Hence, we need a process where close four corners can be extrapolated mathematically to relatively larger projected areas easily detectable by any WebCAM.

Let’s observe the following figure –

As you can see that the original position of the four corners is represented using the following points, i.e., (x1, y1), (x2, y2), (x3, y3) & (x4, y4).

And these positions are very close to each other. Hence, it will be easier for the camera to detect all the points (like a plain surface) without many retries.

And later, you can add specific values of x & y to them to get the derived four corners as shown in the above figures through the following points, i.e. (x1.1, y1.1), (x2.1, y2.1), (x3.1, y3.1) & (x4.1, y4.1).

# Loop over the IDs of the ArUco markers in Top-Left, Top-Right,
# Bottom-Right, and Bottom-Left order
for i in cornerIDs:
  # Grab the index of the corner with the current ID
  j = np.squeeze(np.where(ids == i))

  # If we receive an empty list instead of an integer index,
  # then we could not find the marker with the current ID
  if j.size == 0:
    continue

  # Otherwise, append the corner (x, y)-coordinates to our list
  # of reference points
  corner = np.squeeze(corners[j])
  refPts.append(corner)

# Check to see if we failed to find the four ArUco markers
if len(refPts) != 4:
  # If we are allowed to use cached reference points, fall
  # back on them
  if useCache and CACHED_REF_PTS is not None:
    refPts = CACHED_REF_PTS

  # Otherwise, we cannot use the cache and/or there are no
  # previous cached reference points, so return early
  else:
    return None

# If we are allowed to use cached reference points, then update
# the cache with the current set
if useCache:
  CACHED_REF_PTS = refPts

# Unpack our Aruco reference points and use the reference points
# to define the Destination transform matrix, making sure the
# points are specified in Top-Left, Top-Right, Bottom-Right, and
# Bottom-Left order
(refPtTL, refPtTR, refPtBR, refPtBL) = refPts
dstMat = [refPtTL[0], refPtTR[1], refPtBR[2], refPtBL[3]]
dstMat = np.array(dstMat)

In the above snippet, the application will scan through all the points & try to detect Aruco markers & then create a list of reference points, which will later be used to define the destination transformation matrix.

# For zoom option recalculating all the 4 points
refPtTL1_L_X = refPtTL[0][0]-TOP_LEFT_X
refPtTL1_L_Y = refPtTL[0][1]-TOP_LEFT_Y

refPtTL1.append((refPtTL1_L_X,refPtTL1_L_Y))

refPtTL1_R_X = refPtTL[1][0]+TOP_RIGHT_X
refPtTL1_R_Y = refPtTL[1][1]+TOP_RIGHT_Y

refPtTL1.append((refPtTL1_R_X,refPtTL1_R_Y))

refPtTD1_L_X = refPtTL[2][0]+BOTTOM_RIGHT_X
refPtTD1_L_Y = refPtTL[2][1]+BOTTOM_RIGHT_Y

refPtTL1.append((refPtTD1_L_X,refPtTD1_L_Y))

refPtTD1_R_X = refPtTL[3][0]-BOTTOM_LEFT_X
refPtTD1_R_Y = refPtTL[3][1]+BOTTOM_LEFT_Y

refPtTL1.append((refPtTD1_R_X,refPtTD1_R_Y))

dstMatMod = [refPtTL1[0], refPtTL1[1], refPtTL1[2], refPtTL1[3]]
dstMatMod = np.array(dstMatMod)

The above snippets calculate the revised points for the zoom-out capabilities as discussed in one of the earlier figures.

# Define the transform matrix for the *source* image in Top-Left,
# Top-Right, Bottom-Right, and Bottom-Left order
srcMat = np.array([[0, 0], [srcW, 0], [srcW, srcH], [0, srcH]])

The above snippet will create a transformation matrix for the video trailer.

# Compute the homography matrix and then warp the source image to
# the destination based on the homography depending upon the
# zoom flag
if zoomFlag == 1:
  (H, _) = cv2.findHomography(srcMat, dstMat)
else:
  (H, _) = cv2.findHomography(srcMat, dstMatMod)

warped = cv2.warpPerspective(source, H, (imgW, imgH))

# Construct a mask for the source image now that the perspective
# warp has taken place (we'll need this mask to copy the source
# image into the destination)
mask = np.zeros((imgH, imgW), dtype="uint8")
if zoomFlag == 1:
  cv2.fillConvexPoly(mask, dstMat.astype("int32"), (255, 255, 255), cv2.LINE_AA)
else:
  cv2.fillConvexPoly(mask, dstMatMod.astype("int32"), (255, 255, 255), cv2.LINE_AA)

# This optional step will give the source image a black
# border surrounding it when applied to the source image, you
# can apply a dilation operation
rect = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
mask = cv2.dilate(mask, rect, iterations=2)

# Create a three channel version of the mask by stacking it
# depth-wise, such that we can copy the warped source image
# into the input image
maskScaled = mask.copy() / 255.0
maskScaled = np.dstack([maskScaled] * 3)

# Copy the warped source image into the input image by
# (1) Multiplying the warped image and masked together,
# (2) Then multiplying the original input image with the
#     mask (giving more weight to the input where there
#     are not masked pixels), and
# (3) Adding the resulting multiplications together
warpedMultiplied = cv2.multiply(warped.astype("float"), maskScaled)
imageMultiplied = cv2.multiply(frame.astype(float), 1.0 - maskScaled)
output = cv2.add(warpedMultiplied, imageMultiplied)
output = output.astype("uint8")

Finally, depending upon the zoom flag, the application will create a warped image surrounded by an optionally black border.

clsEmbedVideoWithStream.py (This is the main class of python script that will invoke the clsAugmentedReality class to initiate augment reality after splitting the audio & video & then project them via the Web-CAM with a seamless broadcast.)

	##################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 22-Jun-2022 ####
	#### Modified On 25-Jun-2022 ####
	#### ####
	#### Objective: This is the main class of ####
	#### python script that will invoke the ####
	#### clsAugmentedReality class to initiate ####
	#### augment reality after splitting the ####
	#### audio & video & then project them via ####
	#### the Web-CAM with a seamless broadcast. ####
	##################################################

	# Importing necessary packages
	import clsAugmentedReality as ar
	from clsConfig import clsConfig as cf

	from imutils.video import VideoStream
	from collections import deque

	import imutils
	import time
	import cv2
	import subprocess
	import os
	import pygame
	import time
	import threading
	import sys

	###############################################
	### Global Section ###
	###############################################
	# Instantiating the dependant class

	x1 = ar.clsAugmentedReality()

	###############################################
	### End of Global Section ###
	###############################################

	class BreakLoop(Exception):
	pass

	class clsEmbedVideoWithStream:
	def __init__(self):
	self.sep = str(cf.conf['SEP'])
	self.Curr_Path = str(cf.conf['INIT_PATH'])
	self.FileName = str(cf.conf['FILE_NAME'])
	self.CacheL = int(cf.conf['CACHE_LIM'])
	self.FileName_1 = str(cf.conf['FILE_NAME_1'])
	self.audioLen = int(cf.conf['audioLen'])
	self.audioFreq = float(cf.conf['audioFreq'])
	self.videoFrame = float(cf.conf['videoFrame'])
	self.stopFlag=cf.conf['stopFlag']
	self.zFlag=int(cf.conf['zoomFlag'])
	self.title = str(cf.conf['TITLE'])

	def playAudio(self, audioFile, audioLen, freq, stopFlag=False):
	try:
	pygame.mixer.init()
	pygame.init()
	pygame.mixer.music.load(audioFile)

	pygame.mixer.music.set_volume(10)

	val = int(audioLen)
	i = 0

	while i < val:
	pygame.mixer.music.play(loops=0, start=float(i))
	time.sleep(freq)

	i = i + 1

	if (i >= val):
	raise BreakLoop

	if (stopFlag==True):
	raise BreakLoop

	return 0
	except BreakLoop as s:
	return 0
	except Exception as e:
	x = str(e)
	print(x)

	return 1

	def extractAudio(self, video_file, output_ext="mp3"):
	try:
	"""Converts video to audio directly using `ffmpeg` command
	with the help of subprocess module"""
	filename, ext = os.path.splitext(video_file)
	subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"],
	stdout=subprocess.DEVNULL,
	stderr=subprocess.STDOUT)

	return 0
	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return 1

	def processStream(self, debugInd, var):
	try:
	sep = self.sep
	Curr_Path = self.Curr_Path
	FileName = self.FileName
	CacheL = self.CacheL
	FileName_1 = self.FileName_1
	audioLen = self.audioLen
	audioFreq = self.audioFreq
	videoFrame = self.videoFrame
	stopFlag = self.stopFlag
	zFlag = self.zFlag
	title = self.title

	print('audioFreq:')
	print(str(audioFreq))

	print('videoFrame:')
	print(str(videoFrame))

	# Construct the source for Video & Temporary Audio
	videoFile = Curr_Path + sep + 'Video' + sep + FileName
	audioFile = Curr_Path + sep + 'Video' + sep + FileName_1

	# Load the Aruco dictionary and grab the Aruco parameters
	print("[INFO] initializing marker detector…")
	arucoDict = cv2.aruco.Dictionary_get(cv2.aruco.DICT_ARUCO_ORIGINAL)
	arucoParams = cv2.aruco.DetectorParameters_create()

	# Initialize the video file stream
	print("[INFO] accessing video stream…")
	vf = cv2.VideoCapture(videoFile)

	x = self.extractAudio(videoFile)

	if x == 0:
	print('Successfully Audio extracted from the source file!')
	else:
	print('Failed to extract the source audio!')

	# Initialize a queue to maintain the next frame from the video stream
	Q = deque(maxlen=128)

	# We need to have a frame in our queue to start our augmented reality
	# pipeline, so read the next frame from our video file source and add
	# it to our queue
	(grabbed, source) = vf.read()
	Q.appendleft(source)

	# Initialize the video stream and allow the camera sensor to warm up
	print("[INFO] starting video stream…")
	vs = VideoStream(src=0).start()

	time.sleep(2.0)
	flg = 0

	t = threading.Thread(target=self.playAudio, args=(audioFile, audioLen, audioFreq, stopFlag,))
	t.daemon = True

	try:
	# Loop over the frames from the video stream
	while len(Q) > 0:
	try:
	# Grab the frame from our video stream and resize it
	frame = vs.read()
	frame = imutils.resize(frame, width=1020)

	# Attempt to find the ArUCo markers in the frame, and provided
	# they are found, take the current source image and warp it onto
	# input frame using our augmented reality technique
	warped = x1.getWarpImages(
	frame, source,
	cornerIDs=(923, 1001, 241, 1007),
	arucoDict=arucoDict,
	arucoParams=arucoParams,
	zoomFlag=zFlag,
	useCache=CacheL > 0)

	# If the warped frame is not None, then we know (1) we found the
	# four ArUCo markers and (2) the perspective warp was successfully
	# applied
	if warped is not None:
	# Set the frame to the output augment reality frame and then
	# grab the next video file frame from our queue
	frame = warped
	source = Q.popleft()

	if flg == 0:

	t.start()
	flg = flg + 1

	# For speed/efficiency, we can use a queue to keep the next video
	# frame queue ready for us — the trick is to ensure the queue is
	# always (or nearly full)
	if len(Q) != Q.maxlen:
	# Read the next frame from the video file stream
	(grabbed, nextFrame) = vf.read()

	# If the frame was read (meaning we are not at the end of the
	# video file stream), add the frame to our queue
	if grabbed:
	Q.append(nextFrame)

	# Show the output frame
	cv2.imshow(title, frame)
	time.sleep(videoFrame)

	# If the `q` key was pressed, break from the loop
	if cv2.waitKey(2) & 0xFF == ord('q'):
	stopFlag = True
	break

	except BreakLoop:
	raise BreakLoop
	except Exception as e:
	pass

	if (len(Q) == Q.maxlen):
	time.sleep(2)
	break

	except BreakLoop as s:
	print('Processed completed!')

	# Performing cleanup at the end
	cv2.destroyAllWindows()
	vs.stop()

	except Exception as e:
	x = str(e)
	print(x)

	# Performing cleanup at the end
	cv2.destroyAllWindows()
	vs.stop()

	return 0
	except Exception as e:
	x = str(e)
	print('Error:', x)

	return 1

view raw

clsEmbedVideoWithStream.py

hosted with ❤ by GitHub

Please find the key snippet from the above script –

def playAudio(self, audioFile, audioLen, freq, stopFlag=False):
  try:
    pygame.mixer.init()
    pygame.init()
    pygame.mixer.music.load(audioFile)

    pygame.mixer.music.set_volume(10)

    val = int(audioLen)
    i = 0

    while i < val:
      pygame.mixer.music.play(loops=0, start=float(i))
      time.sleep(freq)

      i = i + 1

      if (i >= val):
        raise BreakLoop

      if (stopFlag==True):
        raise BreakLoop

    return 0
  except BreakLoop as s:
    return 0
  except Exception as e:
    x = str(e)
    print(x)

    return 1

The above function will initiate the pygame library to run the sound of the video file that has been extracted as part of a separate process.

def extractAudio(self, video_file, output_ext="mp3"):
    try:
        """Converts video to audio directly using `ffmpeg` command
        with the help of subprocess module"""
        filename, ext = os.path.splitext(video_file)
        subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"],
                        stdout=subprocess.DEVNULL,
                        stderr=subprocess.STDOUT)

        return 0
    except Exception as e:
        x = str(e)
        print('Error: ', x)

        return 1

The above function temporarily extracts the audio file from the source trailer video.

# Initialize the video file stream
print("[INFO] accessing video stream...")
vf = cv2.VideoCapture(videoFile)

x = self.extractAudio(videoFile)

if x == 0:
    print('Successfully Audio extracted from the source file!')
else:
    print('Failed to extract the source audio!')

# Initialize a queue to maintain the next frame from the video stream
Q = deque(maxlen=128)

# We need to have a frame in our queue to start our augmented reality
# pipeline, so read the next frame from our video file source and add
# it to our queue
(grabbed, source) = vf.read()
Q.appendleft(source)

# Initialize the video stream and allow the camera sensor to warm up
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()

time.sleep(2.0)
flg = 0

The above snippets read the frames from the video file after invoking the audio extraction. Then, it uses a Queue method to store all the video frames for better performance. And finally, it starts consuming the standard streaming video from the WebCAM to augment the trailer video on top of it.

t = threading.Thread(target=self.playAudio, args=(audioFile, audioLen, audioFreq, stopFlag,))
t.daemon = True

Now, the application has instantiated an orphan thread to spin off the audio play function. The reason is to void the performance & video frame frequency impact on top of it.

while len(Q) > 0:
  try:
    # Grab the frame from our video stream and resize it
    frame = vs.read()
    frame = imutils.resize(frame, width=1020)

    # Attempt to find the ArUCo markers in the frame, and provided
    # they are found, take the current source image and warp it onto
    # input frame using our augmented reality technique
    warped = x1.getWarpImages(
      frame, source,
      cornerIDs=(923, 1001, 241, 1007),
      arucoDict=arucoDict,
      arucoParams=arucoParams,
      zoomFlag=zFlag,
      useCache=CacheL > 0)

    # If the warped frame is not None, then we know (1) we found the
    # four ArUCo markers and (2) the perspective warp was successfully
    # applied
    if warped is not None:
      # Set the frame to the output augment reality frame and then
      # grab the next video file frame from our queue
      frame = warped
      source = Q.popleft()

      if flg == 0:

        t.start()
        flg = flg + 1

    # For speed/efficiency, we can use a queue to keep the next video
    # frame queue ready for us -- the trick is to ensure the queue is
    # always (or nearly full)
    if len(Q) != Q.maxlen:
      # Read the next frame from the video file stream
      (grabbed, nextFrame) = vf.read()

      # If the frame was read (meaning we are not at the end of the
      # video file stream), add the frame to our queue
      if grabbed:
        Q.append(nextFrame)

    # Show the output frame
    cv2.imshow(title, frame)
    time.sleep(videoFrame)

    # If the `q` key was pressed, break from the loop
    if cv2.waitKey(2) & 0xFF == ord('q'):
      stopFlag = True
      break

  except BreakLoop:
    raise BreakLoop
  except Exception as e:
    pass

  if (len(Q) == Q.maxlen):
    time.sleep(2)
    break

The final segment will call the getWarpImages function to get the Augmented image on top of the video. It also checks for the upcoming frames & whether the source video is finished or not. In case of the end, the application will initiate a break method to come out from the infinite WebCAM read. Also, there is a provision for manual exit by pressing the ‘Q’ from the MacBook keyboard.

# Performing cleanup at the end
cv2.destroyAllWindows()
vs.stop()

It is always advisable to close your camera & remove any temporarily available windows that are still left once the application finishes the process.

augmentedMovieTrailer.py (Main calling script)

	#####################################################
	#### Written By: SATYAKI DE ####
	#### Written On: 22-Jun-2022 ####
	#### Modified On 25-Jun-2022 ####
	#### ####
	#### Objective: This is the main calling ####
	#### python script that will invoke the ####
	#### clsEmbedVideoWithStream class to initiate ####
	#### the augmented reality in real-time ####
	#### & display a trailer on top of any surface ####
	#### via Web-CAM. ####
	#####################################################

	# We keep the setup code in a different class as shown below.
	import clsEmbedVideoWithStream as evws

	from clsConfig import clsConfig as cf

	import datetime
	import logging

	###############################################
	### Global Section ###
	###############################################
	# Instantiating all the main class

	x1 = evws.clsEmbedVideoWithStream()

	###############################################
	### End of Global Section ###
	###############################################

	def main():
	try:
	# Other useful variables
	debugInd = 'Y'
	var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	var1 = datetime.datetime.now()

	print('Start Time: ', str(var))
	# End of useful variables

	# Initiating Log Class
	general_log_path = str(cf.conf['LOG_PATH'])

	# Enabling Logging Info
	logging.basicConfig(filename=general_log_path + 'augmentedMovieTrailer.log', level=logging.INFO)

	print('Started augmenting videos!')

	# Execute all the pass
	r1 = x1.processStream(debugInd, var)

	if (r1 == 0):
	print('Successfully identified human emotions!')
	else:
	print('Failed to identify the human emotions!')

	var2 = datetime.datetime.now()

	c = var2 – var1
	minutes = c.total_seconds() / 60
	print('Total difference in minutes: ', str(minutes))

	print('End Time: ', str(var1))

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	if __name__ == "__main__":
	main()

view raw

augmentedMovieTrailer.py

hosted with ❤ by GitHub

The above script will initially instantiate the main calling class & then invoke the processStream function to create the Augmented Reality.

FOLDER STRUCTURE:

Here is the folder structure that contains all the files & directories in MAC O/S –

You will get the complete codebase in the following Github link.

If you want to know more about this legendary director & his famous work, please visit the following link.

I’ll bring some more exciting topic in the coming days from the Python verse. Please share & subscribe my post & let me know your feedback.

Till then, Happy Avenging! 🙂

Note: All the data & scenario posted here are representational data & scenarios & available over the internet & for educational purpose only. Some of the images (except my photo) that we’ve used are available over the net. We don’t claim the ownership of these images. There is an always room for improvement & especially the prediction quality.

Restoring old video’s with python-based application

Posted on December 29, 2021December 29, 2021 by SatyakiDe in api, audio, cloud, code, computing, Crossplatform, numpy, Pandas, Python, Technology, video

Hi Guys!

Today, I’ll be demonstrating a primary way to improve the quality of old video using the Open-CV package. This post is the first of such a series of Open-CV that I’ll be posting in the coming years.

Let me tell you one thing – there are many brilliant papers on this, especially image enhancement with OpenCV, Pillow & many more valuable libraries. I’ll share some of the fascinating links later at the end of my blog.

What are we planning here?

We’ll de-noise the old video.
Slightly bright the video.

What kind of video should be the ideal candidate for this test?

Any video with more noise with low light will be an ideal candidate for this use case.

Why don’t we see the demo?

Demo

Architecture:

Let us find the basic architecture –

Code:

Let us explore the the key code base as follows –

clsVideo2Frame.py (This will convert the supplied video into multiple frames. It will also extract the audio from the source file, which will later merge with the enhanced frames.)

	##############################################
	#### Updated By: SATYAKI DE ####
	#### Updated On: 12-Nov-2021 ####
	#### ####
	#### Objective: Consuming Streaming data ####
	#### from Ably channels & captured IoT ####
	#### events from the simulator & publish ####
	#### them in Kivy-I/OS App through ####
	#### measured KPIs. ####
	#### ####
	##############################################

	import av
	import os
	import platform as pl
	import subprocess
	import sys

	from clsConfig import clsConfig as cf

	os_det = pl.system()
	if os_det == "Windows":
	sep = '\\'
	else:
	sep = '/'

	class clsVideo2Frame:
	def __init__(self):
	self.fileNm = str(cf.conf['FILE_NAME'])
	self.base_path = str(cf.conf['INIT_PATH'])

	def convert_video_to_audio_ffmpeg(self, video_file, output_ext="mp3"):
	try:
	"""Converts video to audio directly using `ffmpeg` command
	with the help of subprocess module"""
	filename, ext = os.path.splitext(video_file)
	subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"],
	stdout=subprocess.DEVNULL,
	stderr=subprocess.STDOUT)

	return 0
	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return 1

	def genFrame(self, dInd, var):
	try:
	base_path = self.base_path
	fileNm = self.fileNm

	path_to_src_video = base_path + sep + 'Source' + sep + fileNm + '.mp4'
	temp_path = base_path + sep + 'Temp' + sep

	print('Path: ', path_to_src_video)

	x = self.convert_video_to_audio_ffmpeg(path_to_src_video)

	if x == 0:
	print('Successfully Audio extracted from the source file!')
	else:
	print('Failed to extract the source audio!')

	container = av.open(path_to_src_video)

	for frame in container.decode(video=0):
	frame.to_image().save(temp_path + 'frame-%04d.jpg' % frame.index)

	print('Successfully Converted to Frames!')

	return 0

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return 1

view raw

clsVideo2Frame.py

hosted with ❤ by GitHub

Let us understand some of the key snippet below –

    def convert_video_to_audio_ffmpeg(self, video_file, output_ext="mp3"):
        try:
            """Converts video to audio directly using `ffmpeg` command
            with the help of subprocess module"""
            filename, ext = os.path.splitext(video_file)
            subprocess.call(["ffmpeg", "-y", "-i", video_file, f"{filename}.{output_ext}"],
                            stdout=subprocess.DEVNULL,
                            stderr=subprocess.STDOUT)

            return 0
        except Exception as e:
            x = str(e)
            print('Error: ', x)

            return 1

In the above step, using the FFmpeg package python application is extracting the source audio & storing that into the source directory itself.

for frame in container.decode(video=0):
    frame.to_image().save(temp_path + 'frame-%04d.jpg' % frame.index)

From the above snippet, we can say that the application is splitting the videos into multiple frames & storing them into the temp directory, which will require later enhancement by another class.

2. clsFrameEnhance.py (This will enhance the frames as per your logic & upscale them with the parameters provided by you.)

	##############################################
	#### Updated By: SATYAKI DE ####
	#### Updated On: 12-Nov-2021 ####
	#### ####
	#### Objective: This python script will ####
	#### enhance the old existing frame by ####
	#### applying machine-learning algorithm ####
	#### to improve their quality one at a ####
	#### time. ####
	#### ####
	##############################################

	import av
	import os
	import platform as pl
	import numpy as np
	import cv2
	import glob
	from PIL import Image
	from numpy import asarray
	import numpy as np

	from clsConfig import clsConfig as cf

	import sys

	# Global Variable
	os_det = pl.system()

	if os_det == "Windows":
	sep = '\\'
	else:
	sep = '/'

	class clsFrameEnhance:
	def __init__(self):
	self.fileNm = str(cf.conf['FILE_NAME'])
	self.base_path = str(cf.conf['INIT_PATH'])

	def show(self, enhanced_path, fileNameOnly, buff):
	cv2.imwrite(enhanced_path + fileNameOnly, buff)

	def unsharp_mask(self, image, kernel_size=(3, 3), sigma=1.0, amount=2.0, threshold=2):
	"""Return a sharpened version of the image, using an unsharp mask."""
	blurred = cv2.GaussianBlur(image, kernel_size, sigma)
	sharpened = float(amount + 1) * image – float(amount) * blurred
	sharpened = np.maximum(sharpened, np.zeros(sharpened.shape))
	sharpened = np.minimum(sharpened, 255 * np.ones(sharpened.shape))
	sharpened = sharpened.round().astype(np.uint8)
	if threshold > 0:
	low_contrast_mask = np.absolute(image – blurred) < threshold
	np.copyto(sharpened, image, where=low_contrast_mask)
	return sharpened

	def doEnhance(self, dInd, var):
	try:

	base_path = self.base_path

	temp_path = base_path + sep + 'Temp' + sep
	enhanced_path = base_path + sep + 'Enhanced' + sep

	for filename in sorted(glob.glob(temp_path + '*.jpg')):

	print('Full File Name: ', str(filename))

	img = cv2.imread(filename)

	if img is None:
	print('Failed to load image file:', filename)
	sys.exit(1)

	sharpened_image = self.unsharp_mask(img)

	img = np.asarray(sharpened_image)

	dst = cv2.fastNlMeansDenoising(img,None,7,7,21)

	Inten_matrix = np.ones(dst.shape, dtype='uint8')*20

	bright_img = cv2.add(dst, Inten_matrix)

	head, tail = os.path.split(filename)

	self.show(enhanced_path, tail, bright_img)

	# Remove Files
	os.remove(filename)

	print('Successfully Enhanced the Frames!')

	return 0

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return 1

view raw

clsFrameEnhance.py

hosted with ❤ by GitHub

Let us understand some of the key snippet below –

    def unsharp_mask(self, image, kernel_size=(3, 3), sigma=1.0, amount=2.0, threshold=2):
        """Return a sharpened version of the image, using an unsharp mask."""
        blurred = cv2.GaussianBlur(image, kernel_size, sigma)
        sharpened = float(amount + 1) * image - float(amount) * blurred
        sharpened = np.maximum(sharpened, np.zeros(sharpened.shape))
        sharpened = np.minimum(sharpened, 255 * np.ones(sharpened.shape))
        sharpened = sharpened.round().astype(np.uint8)
        if threshold > 0:
            low_contrast_mask = np.absolute(image - blurred) < threshold
            np.copyto(sharpened, image, where=low_contrast_mask)
        return sharpened

This will provide the sharpen version of the image. If you want to know more about this. Please refer the following link.

img = np.asarray(sharpened_image)

dst = cv2.fastNlMeansDenoising(img,None,7,7,21)

Inten_matrix = np.ones(dst.shape, dtype='uint8')*20
bright_img = cv2.add(dst, Inten_matrix)

As you can see, the image has further enhanced with the use of de-noise & the addition of brightest pixels using Inten_matrix.

3. clsFrame2Video.py (This will combine the frames along with the audio & produce the final video.)

	###############################################
	#### Updated By: SATYAKI DE ####
	#### Updated On: 17-Dec-2021 ####
	#### ####
	#### Objective: This script will convert ####
	#### enhanced frames to restored better ####
	#### quality videos & merge it with source ####
	#### audio. ####
	#### ####
	###############################################

	import os
	import platform as pl
	import cv2
	import numpy as np
	import glob
	import re
	import ffmpeg

	from clsConfig import clsConfig as cf

	import logging

	os_det = pl.system()
	if os_det == "Windows":
	sep = '\\'
	else:
	sep = '/'

	class clsFrame2Video:
	def __init__(self):
	self.fileNm = str(cf.conf['FILE_NAME'])
	self.base_path = str(cf.conf['INIT_PATH'])

	def convert2Vid(self, dInd, var):
	try:
	img_array = []
	fileNm = self.fileNm
	base_path = self.base_path

	enhanced_path = base_path + sep + 'Enhanced' + sep
	target_path = base_path + sep + 'Target' + sep
	path_to_src_audio = base_path + sep + 'Source' + sep + fileNm + '.mp3'

	files = glob.glob(enhanced_path + '*.jpg')

	for filename in sorted(files, key=lambda x:float(re.findall("(-\d+)",x)[0].replace('-',''))):
	print('Processing… ', str(filename))
	img = cv2.imread(filename)
	height, width, layers = img.shape
	size = (width,height)
	img_array.append(img)

	# Deleting Frames
	os.remove(filename)

	print('Successfully Removed Old Enhanced Frames!')


	out = cv2.VideoWriter(target_path + 'Temp.avi',cv2.VideoWriter_fourcc(*'DIVX'), 23, size)

	for i in range(len(img_array)):
	out.write(img_array[i])
	out.release()

	print('Temporary File generated!')

	Temp_Target_File = str(target_path + 'Temp.avi')
	print('Temporary Video File Name: ', Temp_Target_File)
	print('Temporary Audio File Name: ', str(path_to_src_audio))

	infile1 = ffmpeg.input(Temp_Target_File)
	infile2 = ffmpeg.input(path_to_src_audio)

	ffmpeg.concat(infile1, infile2, v=1, a=1).output(target_path + fileNm + '.mp4').run()

	# Deleting Frames
	os.remove(Temp_Target_File)

	print('Successfully Converted to Videos!')

	return 0
	except Exception as e:
	x = str(e)
	print('Error: ', x)

	return 1

view raw

clsFrame2Video.py

hosted with ❤ by GitHub

Let’s explore the key snippets –

files = glob.glob(enhanced_path + '*.jpg')

for filename in sorted(files, key=lambda x:float(re.findall("(-\d+)",x)[0].replace('-',''))):
    print('Processing... ', str(filename))
    img = cv2.imread(filename)
    height, width, layers = img.shape
    size = (width,height)
    img_array.append(img)

    # Deleting Frames
    os.remove(filename)

print('Successfully Removed Old Enhanced Frames!')

In the above snippet, the application first stitched frames together to form a temporary video without the audio.

Temp_Target_File = str(target_path + 'Temp.avi')
print('Temporary Video File Name: ', Temp_Target_File)
print('Temporary Audio File Name: ', str(path_to_src_audio))

infile1 = ffmpeg.input(Temp_Target_File)
infile2 = ffmpeg.input(path_to_src_audio)

ffmpeg.concat(infile1, infile2, v=1, a=1).output(target_path + fileNm + '.mp4').run()


out = cv2.VideoWriter(target_path + 'Temp.avi',cv2.VideoWriter_fourcc(*'DIVX'), 23, size)

Finally, merge the audio with the video to produce the final enriched video.

4. restoreOldVideo.py (This is the main application, which will invoke all the python class.)

	###############################################
	#### Written By: SATYAKI DE ####
	#### Written On: 17-Dec-2021 ####
	#### Modified On 17-Dec-2021 ####
	#### ####
	#### Objective: This python script will ####
	#### convert the old B&W video & restore ####
	#### them to relatively better quality. ####
	###############################################

	# We keep the setup code in a different class as shown below.
	import clsVideo2Frame as vf
	import clsFrameEnhance as fe
	import clsFrame2Video as fv
	from clsConfig import clsConfig as cf

	import datetime
	import logging

	###############################################
	### Global Section ###
	###############################################
	# Instantiating all the three classes

	x1 = vf.clsVideo2Frame()
	x2 = fe.clsFrameEnhance()
	x3 = fv.clsFrame2Video()

	###############################################
	### End of Global Section ###
	###############################################

	def main():
	try:
	# Other useful variables
	debugInd = 'Y'
	var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
	var1 = datetime.datetime.now()

	print('Start Time: ', str(var))
	# End of useful variables

	# Initiating Log Class
	general_log_path = str(cf.conf['LOG_PATH'])

	# Enabling Logging Info
	logging.basicConfig(filename=general_log_path + 'restoreVideo.log', level=logging.INFO)

	print('Started Transformation!')

	# Execute all the pass
	r1 = x1.genFrame(debugInd, var)
	r2 = x2.doEnhance(debugInd, var)
	r3 = x3.convert2Vid(debugInd, var)

	if ((r1 == 0) and (r2 == 0) and (r3 == 0)):
	print('Successfully File Enhanced!')
	else:
	print('Failed to enhance the source file!')

	var2 = datetime.datetime.now()

	c = var2 – var1
	minutes = c.total_seconds() / 60
	print('Total difference in minutes: ', str(minutes))

	print('End Time: ', str(var1))

	except Exception as e:
	x = str(e)
	print('Error: ', x)

	if __name__ == "__main__":
	main()

view raw

restoreOldVideo.py

hosted with ❤ by GitHub

Let us understand the final script –

x1 = vf.clsVideo2Frame()
x2 = fe.clsFrameEnhance()
x3 = fv.clsFrame2Video()

The above lines will instantiate the main python class.

# Execute all the pass
r1 = x1.genFrame(debugInd, var)
r2 = x2.doEnhance(debugInd, var)
r3 = x3.convert2Vid(debugInd, var)

Invoking all the functions with parameters to perform the video upscale.

So, we’ve done it.

You will get the complete codebase in the following Github link.

I’ll bring some more exciting topic in the coming days from the Python verse. Please share & subscribe my post & let me know your feedback.

Till then, Happy Avenging!

Converting text to voice in Python

Posted on October 28, 2019January 2, 2021 by SatyakiDe in api, audio, cloud, function, json, member function, operating system, Python, sql

Hi Guys!

Today, we’ll be discussing one new post of converting text into a voice using some third-party APIs. This is particularly very useful in many such cases, where you can use this method to get more realistic communication.

There are many such providers, where you can get an almost realistic voice for both males & females. However, most of them are subscription-based. So, you have to be very careful about your budget & how to proceed.

For testing purposes, I’ll be using voice.org to simulate this.

Let’s look out the architecture of this process –

As you can see, the user-initiated the application & provide some input in the form of plain text. Once the data is given, the app will send it to the third-party API for the process. Now, the Third-party API will verify the authentication & then it will check all the associate parameters before it starting to generate the audio response. After that, it will send the payload & that will be received by the calling python application. Here, it will be decoded & create the audio file & finally, that will be played at the invoking computer.

This third-party API has lots of limitations. However, they are giving you the platform to test your concept.

As of now, they support the following languages – English, Chinese, Catalan, French, Finnish, Dutch, Danish, German, Italian, Japanese, Korean, Polish, Norwegian, Portuguese, Russian, Spanish & Sweedish.

In our case, we’ll be checking with English.

To work with this, you need to have the following modules installed in python –

playsound
requests
base64

Let’s see the directory structure –

Again, we are not going to discuss any script, which we’ve already discussed here.

Hence, we’re skipping clsL.py here.

1. clsConfig.py (This script contains all the parameters of the server.)

##############################################
#### Written By: SATYAKI DE               ####
#### Written On: 12-Oct-2019              ####
####                                      ####
#### Objective: This script is a config   ####
#### file, contains all the keys for      ####
#### azure cosmos db. Application will    ####
#### process these information & perform  ####
#### various CRUD operation on Cosmos DB. ####
##############################################

import os
import platform as pl

class clsConfig(object):
    Curr_Path = os.path.dirname(os.path.realpath(__file__))

    os_det = pl.system()
    if os_det == "Windows":
        sep = '\\'
    else:
        sep = '/'

    config = {
        'APP_ID': 1,
        'url': "https://voicerss-text-to-speech.p.rapidapi.com/",
        'host': "voicerss-text-to-speech.p.rapidapi.com",
        'api_key': "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        'targetFile': "Bot_decode.mp3",
        'pitch_speed': "-6",
        'bot_language': "en-us",
        'audio_type': "mp3",
        'audio_freq': "22khz_8bit_stereo",
        'query_string_api': "hhhhhhhhhhhhhhhhhhhhhhhhhhhh",
        'b64_encoding': True,
        'APP_DESC_1': 'Text to voice conversion.',
        'DEBUG_IND': 'N',
        'INIT_PATH': Curr_Path,
        'LOG_PATH': Curr_Path + sep + 'log' + sep
    }

For security reasons, sensitive information masked with the dummy value.

‘api_key’: “xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx”,
‘query_string_api’: “hhhhhhhhhhhhhhhhhhhhhhhhhhhh”,

This two information is private to each subscriber. Hence, I’ve removed them & updated with some dummy values.

You have to fill-up with your subscribed information.

2. clsText2Voice.py (This script will convert the text data into an audio file using a GET API request from the third-party API & then play that using the web media player.)

###############################################
#### Written By: SATYAKI DE                ####
#### Written On: 27-Oct-2019               ####
#### Modified On 27-Oct-2019               ####
####                                       ####
#### Objective: Main class converting      ####
#### text to voice using third-party API.  ####
###############################################

from playsound import playsound
import requests
import base64
from clsConfig import clsConfig as cf

class clsText2Voice:
    def __init__(self):
        self.url = cf.config['url']
        self.api_key = cf.config['api_key']
        self.targetFile = cf.config['targetFile']
        self.pitch_speed = cf.config['pitch_speed']
        self.bot_language = cf.config['bot_language']
        self.audio_type = cf.config['audio_type']
        self.audio_freq = cf.config['audio_freq']
        self.b64_encoding = cf.config['b64_encoding']
        self.query_string_api = cf.config['query_string_api']
        self.host = cf.config['host']

    def getAudio(self, srcString):
        try:
            url = self.url
            api_key = self.api_key
            tarFile = self.targetFile
            pitch_speed = self.pitch_speed
            bot_language = self.bot_language
            audio_type = self.audio_type
            audio_freq = self.audio_freq
            b64_encoding = self.b64_encoding
            query_string_api = self.query_string_api
            host = self.host

            querystring = {
                "r": pitch_speed,
                "c": audio_type,
                "f": audio_freq,
                "src": srcString,
                "hl": bot_language,
                "key": query_string_api,
                "b64": b64_encoding
            }

            headers = {
                'x-rapidapi-host': host,
                'x-rapidapi-key': api_key
            }

            response = requests.request("GET", url, headers=headers, params=querystring)

            # Converting to MP3
            targetFile = tarFile
            mp3File_64_decode = base64.decodebytes(bytes(response.text, encoding="utf-8"))
            mp3File_result = open(targetFile, 'wb')

            # create a writable mp3File and write the decoding result
            mp3File_result.write(mp3File_64_decode)
            mp3File_result.close()

            playsound(targetFile)

            return 0
        except Exception as e:
            x = str(e)
            print('Error: ', x)

            return 1

Few crucial lines from the above script –

querystring = {
    "r": pitch_speed,
    "c": audio_type,
    "f": audio_freq,
    "src": srcString,
    "hl": bot_language,
    "key": query_string_api,
    "b64": b64_encoding
}

You can configure the voice of the audio by adjusting all the configurations. And, the text content will receive at srcString. So, whatever user will be typing that will be directly captured here & form the JSON payload accordingly.

response = requests.request("GET", url, headers=headers, params=querystring)

In this case, you will be receiving the audio file in the form of a base64 text file. Hence, you need to convert them back to the sound file by these following lines –

# Converting to MP3
targetFile = tarFile
mp3File_64_decode = base64.decodebytes(bytes(response.text, encoding="utf-8"))
mp3File_result = open(targetFile, 'wb')

# create a writable mp3File and write the decoding result
mp3File_result.write(mp3File_64_decode)
mp3File_result.close()

As you can see that, we’ve extracted the response.text & then we’ve decoded that to byte object to form the mp3 sound file at the receiving end.

Once we have our mp3 file ready, the following line simply plays the audio record.

playsound(targetFile)

Thus you can hear the actual voice.

3. callText2Voice.py (This is the main script that will invoke the text to voice API & then playback the audio once it gets the response from the third-party API.)

###############################################
#### Written By: SATYAKI DE                ####
#### Written On: 27-Oct-2019               ####
#### Modified On 27-Oct-2019               ####
####                                       ####
#### Objective: Main class converting      ####
#### text to voice using third-party API.  ####
###############################################

from clsConfig import clsConfig as cf
import clsL as cl
import logging
import datetime
import clsText2Voice as ct

# Disbling Warning
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

def main():
    try:
        ret_2 = ''
        debug_ind = 'Y'

        general_log_path = str(cf.config['LOG_PATH'])

        # Enabling Logging Info
        logging.basicConfig(filename=general_log_path + 'consolidatedTwitter.log', level=logging.INFO)

        # Initiating Log Class
        l = cl.clsL()

        # Moving previous day log files to archive directory
        log_dir = cf.config['LOG_PATH']

        tmpR0 = "*" * 157

        logging.info(tmpR0)
        tmpR9 = 'Start Time: ' + str(var)
        logging.info(tmpR9)
        logging.info(tmpR0)

        print("Log Directory::", log_dir)
        tmpR1 = 'Log Directory::' + log_dir
        logging.info(tmpR1)

        # Query using parameters
        rawQry = str(input('Enter your string:'))

        x1 = ct.clsText2Voice()
        ret_2 = x1.getAudio(rawQry)

        if ret_2 == 0:
            print("Successfully converted from text to voice!")
            logging.info("Successfully converted from text to voice!")
            print("*" * 157)
            logging.info(tmpR0)
        else:
            print("Successfuly converted!")
            logging.info("Successfuly converted!")
            print("*" * 157)
            logging.info(tmpR0)

        print("*" * 157)
        logging.info(tmpR0)

        tmpR10 = 'End Time: ' + str(var)
        logging.info(tmpR10)
        logging.info(tmpR0)

    except ValueError:
        print("No relevant data to proceed!")
        logging.info("No relevant data to proceed!")

    except Exception as e:
        print("Top level Error: args:{0}, message{1}".format(e.args, e.message))

if __name__ == "__main__":
    main()

Essential lines from the above script –

# Query using parameters
rawQry = str(input('Enter your string:'))

x1 = ct.clsText2Voice()
ret_2 = x1.getAudio(rawQry)

As you can see, here the user will be passing the text content, which will be given to our class & then it will project the audio sound of that text.

Let’s see how it runs –

Input Text: Welcome to Satyaki De’s blog. This site mainly deals with the Python, SQL from different DBs & many useful areas from the leading cloud providers.

And, here is the run command under Windows OS looks like –

And, please find the sample voice that it generates –

So, We’ve done it! 😀

Let us know your comment on this.

So, we’ll come out with another exciting post in the coming days!

N.B.: This is demonstrated for RnD/study purposes. All the data posted here are representational data & available over the internet.

	The LLM Security Chr… on The LLM Security Chronicles…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on AGENTIC AI IN THE ENTERPRISE:…
	AGENTIC AI IN THE EN… on Agentic AI in the Enterprise:…

Category: audio

AGENTIC AI IN THE ENTERPRISE: STRATEGY, ARCHITECTURE, AND IMPLEMENTATION – PART 5

Like this:

Enable OpenAI chatbot with the selected YouTube video content using LangChain, FAISS & YouTube data-API.

Like this:

Restoring old video’s with python-based application

Like this:

Converting text to voice in Python

Like this:

Share this:

Like this:

Share this:

Like this:

1. Machine Learning

2. Deep Learning

3. Computer Vision

4. Natural Language Processing (NLP)

5. Planning and Decision-Making

6. Uncertainty and Reasoning

7. Simulation and Modeling

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: