Experience looking for various transcription services
January 9, 2024
Experience looking for various transcription services #
Our team keeps on recording various audio and video for internal documentation. We did find various paid offering but as an experiment we wanted to try doing transcription in Python. This blog post summarizes the experience we had:
Following were things we tried:
Following are steps to setup SpeechRecognition
Step 1: Installing pre-requisit libraries #
Specifically install SpeechRecognition==3.8.1 version for the recognize_google to run.
pip install SpeechRecognition==3.8.1
pip install pydub
pip install librosa
pip install IPython
pip install SoundFile
Step 2: Convert the .m4a file format to .wav file format #
Type the following on a python shell
sound = AudioSegment.from_file("audio.m4a", format="m4a")
sound.export("audio.wav", format="wav")
To achieve the same result using terminal command:
ffmpeg -i audio.m4a audio.wav
Step 3: Convert wave file into PCM format #
Librosa requires the audio file to be in 16hz and subtype=“PCM_16” format. Librosa is python package that allows audio analysis. Type the following in Python shell
y, sr = librosa.load(f"audio.wav")
ipd.Audio(y, rate=sr)
print(f"y: {y}, sr: {sr}")
data = librosa.resample(y, orig_sr=sr, target_sr=16000)
ipd.Audio(data, rate=16000)
sf.write("audio1.wav", data, 16000, subtype="PCM_16")
Same result can also be achieved using:
ffmpeg -i audio.wav -ar 16000 -acodec pcm_s16le audio1.wav
Step 4: Create chuncks of large audio file to increase transcription quality. #
Type the following in python shell
audio = AudioSegment.from_file("audio1.wav")
segment_length = 30 * 1000 # 30 seconds in milliseconds
start_time = 0
end_time = segment_length
segments = []
while end_time < len(audio):
segment = audio[start_time:end_time]
segments.append(segment)
start_time += segment_length
end_time += segment_length
BASE_PATH = f"{os.getcwd()}/audio"
if not os.path.exists(BASE_PATH):
os.mkdir(BASE_PATH)
for i, segment in enumerate(segments):
segment.export(f"{BASE_PATH}/audio1_{i}.wav", format="wav")
Step 5: Transcribe the audio file chuncks into transcribed_text.txt
#
files = os.listdir(f"{os.getcwd()}/audio")
sorted_files = sorted(files, key=lambda x: int(x.split('_')[-1].split('.')[0]))
paragraph = []
with open('transcribed_text.txt', 'w') as text_file:
print("Transcribing audio...")
for num, name in enumerate(sorted_files):
# transcribe audio file
AUDIO_FILE = f"audio/{name}"
# use the audio file as the audio source
r = sr.Recognizer()
r.energy_threshold = 4000
with sr.AudioFile(AUDIO_FILE) as source:
audio = r.record(source) # read the entire audio file
try:
text = r.recognize_google(audio, language='en-US') # Recognize speech using Google Web Speech API
print(f"text{num}: {text}")
paragraph.append(text)
except sr.UnknownValueError:
print("Speech Recognition could not understand the audio")
print(" ")
except sr.RequestError as e:
print(f"Could not request results from Google Web Speech API; {e}")
print(" ")
print(paragraph)
text_file.write(" ".join(paragraph))
print("Transcription complete. Text saved to 'transcribed_text.txt'")