In the rapidly evolving landscape of artificial intelligence, Sarvam AI has emerged as a pioneering force in developing language technologies for Indian languages. This article series aims to provide an in-depth look at Sarvam AI’s Indic APIs, exploring their features, performance, and potential impact on the Indian tech ecosystem.
This LLM aims to bridge the language divide in India’s digital landscape by providing powerful, accessible AI tools for Indic languages.
The importance of Indic language processing:
India has 22 official languages and hundreds of dialects, presenting a unique challenge for technology adoption and digital inclusion. Even though all the government work happens in both the official language along with English language.
Developers can fine-tune the models for specific domains or use cases, improving accuracy for specialized applications.
Supported languages and use cases:
As of 2024, Sarvam AI’s Indic APIs support the following languages:
- Hindi
- Bengali
- Tamil
- Telugu
- Marathi
- Gujarati
- Kannada
- Malayalam
- Punjabi
- Odia
Before delving into the details, I strongly recommend taking a look at the demo.
Isn’t this exciting? Let us understand the flow of events in the following diagram –

The application interacts with Sarvam AI’s API. After interpreting the initial audio inputs from the computer, it uses Sarvam AI’s API to get the answer based on the selected Indic language, Bengali.
Package Installation:
pip install SpeechRecognition==3.10.4
pip install pydub==0.25.1
pip install sounddevice==0.5.0
pip install numpy==1.26.4
pip install soundfile==0.12.1
Code:
clsSarvamAI.py (This script will capture the audio input in Indic languages & then provide an LLM response in the form of audio in Indic languages. In this post, we’ll discuss part of the code. In the next part, we’ll be discussing the next important methods. Note that we’re only going to discuss a few important functions here.)
def initializeMicrophone(self):
try:
for index, name in enumerate(sr.Microphone.list_microphone_names()):
print(f"Microphone with name \"{name}\" found (device_index={index})")
return sr.Microphone()
except Exception as e:
x = str(e)
print('Error: <<Initiating Microphone>>: ', x)
return ''
def realTimeTranslation(self):
try:
WavFile = self.WavFile
recognizer = sr.Recognizer()
try:
microphone = self.initializeMicrophone()
except Exception as e:
print(f"Error initializing microphone: {e}")
return
with microphone as source:
print("Adjusting for ambient noise. Please wait...")
recognizer.adjust_for_ambient_noise(source, duration=5)
print("Microphone initialized. Start speaking...")
try:
while True:
try:
print("Listening...")
audio = recognizer.listen(source, timeout=5, phrase_time_limit=5)
print("Audio captured. Recognizing...")
#var = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
#print('Before Audio Time: ', str(var))
self.createWavFile(audio, WavFile)
try:
text = recognizer.recognize_google(audio, language="bn-BD") # Bengali language code
sentences = text.split('।') # Bengali full stop
print('Sentences: ')
print(sentences)
print('*'*120)
if not text:
print("No speech detected. Please try again.")
continue
if str(text).lower() == 'টাটা':
raise BreakOuterLoop("Based on User Choice!")
asyncio.run(self.processAudio(audio))
except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
print(f"Could not request results from Google Speech Recognition service; {e}")
except sr.WaitTimeoutError:
print("No speech detected within the timeout period. Listening again...")
except BreakOuterLoop:
raise
except Exception as e:
print(f"An unexpected error occurred: {e}")
time.sleep(1) # Short pause before next iteration
except BreakOuterLoop as e:
print(f"Exited : {e}")
# Removing the temporary audio file that was generated at the begining
os.remove(WavFile)
return 0
except Exception as e:
x = str(e)
print('Error: <<Real-time Translation>>: ', x)
return 1initializeMicrophone:
Purpose:
This method is responsible for setting up and initializing the microphone for audio input.
What it Does:
- It attempts to list all available microphones connected to the system.
- It prints the microphone’s name and corresponding device index (a unique identifier) for each microphone.
- If successful, it returns a microphone object (sr.Microphone()), which can be used later to capture audio.
- If this process encounters an error (e.g., no microphones being found or an internal error), it catches the exception, prints an error message, and returns an empty string (“).
The “initializeMicrophone” Method finds all microphones connected to the computer and prints their names. If it finds a microphone, it prepares to use it for recording. If something goes wrong, it tells you what went wrong and stops the process.
realTimeTranslation:
Purpose:
This Method uses the microphone to handle real-time speech translation from a user. It captures spoken audio, converts it into text, and processes it further.
What it Does:
- Initializes a recognizer object (sr.Recognizer()) for speech recognition.
- Call initializeMicrophone to set up the microphone. If initialization fails, an error message is printed, and the process is stopped.
- Once the microphone is set up successfully, it adjusts for ambient noise to enhance accuracy.
- Enters a loop to continuously listen for audio input from the user:
- It waits for the user to speak and captures the audio.
- Converts the captured audio to text using Google’s Speech Recognition service, specifying Bengali as the language.
- If text is successfully captured and recognized:
- Splits the text into sentences using the Bengali full-stop character.
- Prints the sentences.
- It checks if the text is a specific word (“টাটা”), and if so, it raises an exception to stop the loop (indicating that the user wants to exit).
- Otherwise, it processes the audio asynchronously with processAudio.
- If no speech is detected or an error occurs, it prints the relevant message and continues listening.
- If the user decides to exit or if an error occurs, it breaks out of the loop, deletes any temporary audio files created, and returns a status code (0 for success, 1 for failure).
The “realTimeTranslation” method continuously listens to the microphone for the user to speak. It captures what is said and tries to understand it using Google’s service, specifically for the Bengali language. It then splits what was said into sentences and prints them out. If the user says “টাটা” (which means “goodbye” in Bengali), it stops listening and exits. If it cannot understand the user or if there is a problem, it will let the user know and try again. It will print an error and stop the process if something goes wrong.
Let’s wait for the next part & enjoy this part.
Note: All the data & scenarios posted here are representational data & scenarios & available over the internet & for educational purposes only. There is always room for improvement in this kind of model & the solution associated with it. I’ve shown the basic ways to achieve the same for educational purposes only.


































You must be logged in to post a comment.