Convert Audio to Text in Python with OpenAI Whisper

Find AI Tools in second

Find AI Tools

No difficulty

No complicated process

Find ai tools

Home AI News Convert Audio to Text in Python with OpenAI Whisper

Updated on Dec 26,2023

Convert Audio to Text in Python with OpenAI Whisper

Introduction
Speech-to-Text Using Hugging Face Transformers
1. Overview of the Hugging Face Transformers Library
2. Introduction to Open AI Whisper
Steps to Perform Speech-to-Text Using Hugging Face Transformers
1. Set Up the Environment
2. Install the Transformers Library
3. Import the Required Modules
4. Specify the Task and Model
5. Load the Audio File
6. Perform Speech-to-Text Conversion
Examples and Results
1. Example 1: Transcribing Dialogue from a Movie
2. Example 2: Transcribing a Noisy Audio Clip
Conclusion

Speech-to-Text Using Hugging Face Transformers

In this Python tutorial, we will learn how to perform speech-to-text conversion using just three lines of Python code. We will be using the Hugging Face Transformers library, specifically their pipeline feature, along with the Open AI Whisper model. Speech-to-text or automatic speech recognition is a complex task, but with the advancements in NLP and the availability of powerful models like Open AI Whisper, it has become much easier to accomplish. In this tutorial, we will guide You through the process of downloading the Whisper model from the Hugging Face Model Hub and using it to transcribe speech to text.

Introduction to Hugging Face Transformers

The Hugging Face Transformers library is a versatile tool for natural language processing (NLP) tasks, machine learning tasks, and computer vision tasks. It provides a simple and unified API called pipeline that allows users to easily perform various machine learning tasks without the need for intricate code. By leveraging the capabilities of the Transformers library, developers can streamline their workflow and build applications that incorporate cutting-edge machine learning models.

Introduction to Open AI Whisper

Open AI Whisper is a speech recognition Transformer model trained on an extensive dataset of approximately 680,000 hours of audio. Its impressive training data allows it to accurately transcribe spoken language into written text. Open AI Whisper is known for its exceptional performance and multilingual capabilities. In this tutorial, we will be using the Whisper medium model, which strikes a balance between model size and performance.

Steps to Perform Speech-to-Text Using Hugging Face Transformers

To perform speech-to-text conversion using Hugging Face Transformers and the Open AI Whisper model, follow the steps outlined below.

1. Set Up the Environment

Before we begin, ensure that you have a Google Colab notebook with GPU runtime. While it is possible to run the code on CPU, using a GPU will significantly speed up the inference process. If you do not have access to a GPU, you can still proceed with CPU runtime.

2. Install the Transformers Library

To get started, you will need to install the Transformers library from the Hugging Face GitHub repository. This can be done by running the following command:

!pip install transformers

Once the library is installed, you can include it in your Python script using the following import statement:

from transformers import pipeline

3. Import the Required Modules

In order to utilize the pipeline feature of the Transformers library, you need to import the necessary modules. This can be done using the following import statement:

from transformers import pipeline

4. Specify the Task and Model

To perform automatic speech recognition using the Whisper model, you need to specify the task as "automatic-speech-recognition" and the model as "facebook/wav2vec2-large-xlsr-53". The Whisper medium model can be used by specifying the model as "facebook/wav2vec2-medium-xlsr-53".

recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-medium-xlsr-53", device=0)

Note: If you're running the code on a CPU, you can omit the "device=0" parameter.

5. Load the Audio File

To perform speech-to-text conversion, you will need an input audio file. The file should be in MP3 format and contain the desired speech that you want to transcribe. You can use any audio file of your choice, but make sure to provide the correct path to the file.

audio_path = "path/to/audio.mp3"

6. Perform Speech-to-Text Conversion

Once the audio file is loaded, you can use the recognizer pipeline to transcribe the speech into text. The output will be a list of dictionaries, where each dictionary represents a transcript segment with its corresponding start and end time.

transcriptions = recognizer(audio_path)

Examples and Results

To demonstrate the effectiveness of speech-to-text conversion using Hugging Face Transformers and Open AI Whisper, we will provide a couple of examples and analyze the results.

Example 1: Transcribing Dialogue from a Movie

In this example, we will transcribe a dialogue from a movie. We will download an audio clip of the dialogue, which will be in MP3 format. By passing the audio file through the recognizer pipeline, we can obtain the text transcription.

Example Audio: Download Link

Transcription Output:

Text: "Starting tonight, people will die. I'm a man of my word."

Example 2: Transcribing a Noisy Audio Clip

In this example, we will transcribe a noisy audio clip. The audio contains background noise and may pose challenges for accurate transcription. Nonetheless, Open AI Whisper is capable of handling such scenarios and providing reasonably accurate results.

Example Audio: Download Link

Transcription Output:

Text: "This town deserves a better class of criminal, and I'm going to give it to them."

It is worth noting that in both examples, the transcriptions are highly accurate, despite the presence of noise in the Second example. Open AI Whisper demonstrates its effectiveness and reliability in transcribing spoken language into written text.

Conclusion

In this tutorial, we explored the capabilities of the Hugging Face Transformers library and the Open AI Whisper model for speech-to-text conversion. We learned how to set up the environment, install the Transformers library, and use the pipeline feature to perform automatic speech recognition. By following just a few simple steps, we were able to convert speech into text with high accuracy and minimal code. Open AI Whisper, with its extensive training data and multilingual capabilities, proves to be a valuable tool in the field of natural language processing. With the power of Hugging Face Transformers, developers can easily incorporate state-of-the-art models into their applications and unlock various machine learning tasks.

Master Coding Quickly with ChatGPT

Unveiling the Tricer USA GTP-III: A Gear Goggles Review