Experience looking for various transcription services

Experience looking for various transcription services

January 9, 2024
2024
Python

Experience looking for various transcription services #

Our team keeps on recording various audio and video for internal documentation. We did find various paid offering but as an experiment we wanted to try doing transcription in Python. This blog post summarizes the experience we had:

Following were things we tried:

  1. Veed.io
  2. Sphinx
  3. SpeechRecognition
  4. DeepSpeech

Following are steps to setup SpeechRecognition

Step 1: Installing pre-requisit libraries #

Specifically install SpeechRecognition==3.8.1 version for the recognize_google to run.

 pip install SpeechRecognition==3.8.1
 pip install pydub
 pip install librosa
 pip install IPython
 pip install SoundFile

Step 2: Convert the .m4a file format to .wav file format #

Type the following on a python shell

 sound = AudioSegment.from_file("audio.m4a", format="m4a")
 sound.export("audio.wav", format="wav")

To achieve the same result using terminal command:

ffmpeg -i audio.m4a audio.wav

Step 3: Convert wave file into PCM format #

Librosa requires the audio file to be in 16hz and subtype=“PCM_16” format. Librosa is python package that allows audio analysis. Type the following in Python shell

 y, sr = librosa.load(f"audio.wav")
 ipd.Audio(y, rate=sr)

 print(f"y: {y}, sr: {sr}")

 data = librosa.resample(y, orig_sr=sr, target_sr=16000)
 ipd.Audio(data, rate=16000)

 sf.write("audio1.wav", data, 16000, subtype="PCM_16")

Same result can also be achieved using:

ffmpeg -i audio.wav -ar 16000 -acodec pcm_s16le audio1.wav

Step 4: Create chuncks of large audio file to increase transcription quality. #

Type the following in python shell

 audio = AudioSegment.from_file("audio1.wav")

 segment_length = 30 * 1000  # 30 seconds in milliseconds

 start_time = 0
 end_time = segment_length

 segments = []
 while end_time < len(audio):
     segment = audio[start_time:end_time]
     segments.append(segment)
     start_time += segment_length
     end_time += segment_length

 BASE_PATH = f"{os.getcwd()}/audio"
 if not os.path.exists(BASE_PATH):
     os.mkdir(BASE_PATH)

 for i, segment in enumerate(segments):
     segment.export(f"{BASE_PATH}/audio1_{i}.wav", format="wav")

Step 5: Transcribe the audio file chuncks into transcribed_text.txt #

 files = os.listdir(f"{os.getcwd()}/audio")
 sorted_files = sorted(files, key=lambda x: int(x.split('_')[-1].split('.')[0]))
 paragraph = []

 with open('transcribed_text.txt', 'w') as text_file:
     print("Transcribing audio...")
     for num, name in enumerate(sorted_files):
         # transcribe audio file
         AUDIO_FILE = f"audio/{name}"

         # use the audio file as the audio source
         r = sr.Recognizer()
         r.energy_threshold = 4000

         with sr.AudioFile(AUDIO_FILE) as source:
             audio = r.record(source)  # read the entire audio file
             try:
                 text = r.recognize_google(audio, language='en-US')  # Recognize speech using Google Web Speech API
                 print(f"text{num}: {text}")
                 paragraph.append(text)
             except sr.UnknownValueError:
                 print("Speech Recognition could not understand the audio")
                 print(" ")
             except sr.RequestError as e:
                 print(f"Could not request results from Google Web Speech API; {e}")
                 print(" ")
     print(paragraph)
     text_file.write(" ".join(paragraph))
 print("Transcription complete. Text saved to 'transcribed_text.txt'")