Build a Speech Recognition App with JavaScript
Table of Contents:
- Introduction
- Creating a Web Speech Recognition Application
- Understanding the Web Speech API
- 3.1 How the Web Speech API Works
- 3.2 Browser Compatibility
- Setting up the HTML Structure
- Styling the Web Page
- Getting User Input with Speech Recognition
- Handling Speech Recognition Results
- Displaying Speech Recognition Results on the Web Page
- Handling Multiple Sessions of Speech Recognition
- Creating Automatic Replays and Actions
- Fine-tuning Speech Recognition Accuracy
- Conclusion
Creating a Web Speech Recognition Application
In today's digital age, speech recognition technology has become increasingly popular, enabling users to interact with their devices through voice commands. In this guide, we will learn how to create a web speech recognition application using the Web Speech API. This API provides a simple way to integrate speech recognition capabilities into web applications, allowing users to interact with the application using their voice.
Before we dive into the implementation, let's first understand how the Web Speech API works and its browser compatibility.
Understanding the Web Speech API
How the Web Speech API Works
The Web Speech API allows developers to integrate speech recognition capabilities into their web applications. It works by capturing audio input from the user's microphone, converting it into text, and providing the recognized text as a result. The API uses the user agent's speech recognition service to perform the actual Speech-to-Text conversion.
To use the Web Speech API, we need to create an instance of the SpeechRecognition
object. This object represents the speech recognition service provided by the user agent. We can then configure the recognition settings, such as enabling real-time results and setting language options.
Once the speech recognition is set up, we can start the recognition process by calling the start
method on the SpeechRecognition
object. This will Prompt the user to grant permission to access the microphone. Once the permission is granted, the API will start capturing the user's speech and provide the recognized text as results.
Browser Compatibility
It's important to note that the Web Speech API has limited browser compatibility. Currently, it is supported in Google Chrome for desktop and Android devices. Other browsers may have limited or no support for the API. Therefore, it's recommended to test and develop your web speech recognition application using Chrome.
Now that we have a basic understanding of the Web Speech API, let's move on to setting up the HTML structure of our application.
Setting up the HTML Structure
To create our web speech recognition application, we'll start by setting up the HTML structure. We'll need an HTML element to display the recognized text, a button to start the speech recognition process, and a section to contain these elements. Here's an example HTML structure:
<section>
<h1>Speech Recognition</h1>
<div class="container">
<div class="text">
<p id="output"></p>
</div>
</div>
</section>
In this structure, we have a <section>
element that contains a <h1>
heading representing the title of our application. Inside the <section>
, we have a <div>
element with a class of "container". This div will serve as a container for our text output.
Inside the container div, we have another div with a class of "text". This div is responsible for displaying the recognized text. We have a <p>
element with an id
of "output" inside the text div. This is where we'll dynamically update the recognized text.
Now that we have our HTML structure in place, let's style our web page to make it visually appealing.
Styling the Web Page
To make our web page visually appealing and user-friendly, we need to apply some CSS styles to our HTML structure. Here's an example CSS style:
body {
margin: 0;
padding: 0;
box-sizing: border-box;
font-family: Arial, sans-serif;
}
section {
min-height: 100vh;
width: 100%;
display: flex;
align-items: center;
justify-content: center;
background-color: #f2f2f2;
padding: 50px 0;
}
h1 {
color: #fff;
opacity: 0.03;
margin-bottom: 10px;
text-align: center;
width: 100%;
font-size: 50px;
}
.container {
width: 90%;
max-width: 500px;
margin: 0 auto;
}
.text {
width: 100%;
background-color: #fff;
padding: 10px;
border-radius: 8px;
margin-bottom: 40px;
text-align: left;
}
.text p {
color: #000;
text-align: left;
}
In these styles, we're setting the margin and padding of the body to 0, ensuring the box-sizing property is set to border-box, and using a sans-serif Font family for the entire page.
The section element, representing the main content of our page, covers the entire viewport's Height (min-height: 100vh) and is centered both vertically and horizontally using flexbox. We're also applying a background color and padding to the section element.
The h1 element, representing the title of our application, has a white text color with a low opacity to create a subtle effect. The width is set to 100% to span the entire parent element, and we've applied some margin-bottom and font-size styles to enhance the title's appearance.
The container div, responsible for containing our text output, has a width of 90% and a maximum width of 500 pixels. It's centered horizontally using the margin: 0 auto property. The text div inside the container has a white background color, some padding and border-radius styles to give it a clean and rounded look.
Finally, the p element inside the text div has a black text color and left-aligned alignment to accommodate the recognized text.
With our web page styled, let's move on to the JavaScript implementation of our web speech recognition application.
Getting User Input with Speech Recognition
To capture the user's speech and convert it into text, we'll utilize the Web Speech API's SpeechRecognition
object. Let's start by retrieving the necessary elements from the DOM and creating the SpeechRecognition
object:
const text = document.querySelector(".text");
const output = document.querySelector("#output");
const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.interimResults = true;
In this code snippet, we're selecting the elements with the .text
class and the #output
id. These elements represent the text container and the output Paragraph, respectively.
Next, we create a SpeechRecognition
object using the SpeechRecognition
constructor. We also handle browser compatibility by using the webkit prefix (window.webkitSpeechRecognition
) as a fallback for browsers that do not support the non-prefixed version.
We set the interimResults
property to true
to receive real-time results as the user Speaks.
Handling Speech Recognition Results
Once the SpeechRecognition
object is set up, we can attach an event listener to the recognition object to capture the results. Let's listen for the result
event and log the results to the console:
recognition.addEventListener("result", (event) => {
const text = Array.from(event.results)
.map((result) => result[0].transcript)
.join("");
console.log(text);
});
In this code snippet, we listen for the result
event and access the results using the event.results
property. We convert the results into an array using Array.from
and extract the transcript
property from each result.
By joining the transcripts together, we get the complete recognized text from the user's speech. We log the text to the console for testing purposes.
Displaying Speech Recognition Results on the Web Page
To display the recognized text on the web page, we'll update the innerHTML
property of the output paragraph element:
recognition.addEventListener("result", (event) => {
const text = Array.from(event.results)
.map((result) => result[0].transcript)
.join("");
output.innerText = text;
});
In this code snippet, we update the innerText
property of the output paragraph element with the recognized text. This dynamically updates the text displayed on the web page as the user speaks.
Handling Multiple Sessions of Speech Recognition
By default, the Web Speech API treats each session of speech recognition as a separate event. To handle multiple sessions of speech recognition, we need to start a new session when the previous session ends.
To handle this, we'll check if the current result is final or not. If it's a final result, indicating the end of a session, we'll create a new paragraph element for the next session:
recognition.addEventListener("result", (event) => {
const text = Array.from(event.results)
.map((result) => result[0].transcript)
.join("");
output.innerText = text;
if (event.results[0].isFinal) {
const p = document.createElement("p");
p.innerText = text;
text.appendChild(p);
}
});
In this code snippet, we check if the isFinal
property of the first result in the event results is true
. If it is, we create a new paragraph element using document.createElement("p")
. We set the text of the paragraph to the recognized text and append it to the text container element.
This way, we handle multiple sessions of speech recognition and display each session's recognized text in a separate paragraph element.
Creating Automatic Replays and Actions
To enhance the user experience and add interactivity to our web speech recognition application, we can create automatic replays and actions based on specific user inputs.
Let's say we want to create an automatic replay for the user saying "hello" or "what's your name". We can do this by checking if the recognized text includes these phrases and displaying an appropriate replay:
recognition.addEventListener("result", (event) => {
const text = Array.from(event.results)
.map((result) => result[0].transcript)
.join("");
output.innerText = text;
if (text.includes("hello")) {
const p = document.createElement("p");
p.classList.add("replay");
p.innerText = "Hi!";
text.appendChild(p);
}
if (text.includes("what's your name") || text.includes("what is your name")) {
const p = document.createElement("p");
p.classList.add("replay");
p.innerText = "My name is WebCifer.";
text.appendChild(p);
}
});
In this code snippet, we're checking if the recognized text includes the phrases "hello" or "what's your name". If it does, we create a new paragraph element with a class of "replay" to style it differently. We set the text of the paragraph to the appropriate replay and append it to the text container element.
Additionally, we can perform actions based on specific user inputs. Let's say we want to browse a specific web page when the user says "open my YouTube Channel". We can achieve this by opening the desired URL in a new tab:
if (text.includes("open my YouTube channel")) {
window.open("https://www.youtube.com/c/webcifer");
}
In this code snippet, we use the window.open
method to open the specified URL in a new tab. We provide the URL of our YouTube channel as an example. Replace it with your desired URL.
With these automatic replays and actions, our web speech recognition application becomes more interactive and responsive to the user's speech inputs.
Fine-tuning Speech Recognition Accuracy
It's important to note that the speech recognition accuracy may vary based on the user's pronunciation, audio quality, and other factors. In some cases, the accuracy may not be 100%. To improve the accuracy, you can experiment with different speech recognition settings, such as adjusting the speech recognition language or adding grammar restrictions to limit recognized speech Patterns.
Additionally, you can perform post-processing on the recognized text to remove any noise or improve the formatting. Regular expressions and language processing techniques can be used to achieve this.
Conclusion
Congratulations! You have successfully created a web speech recognition application using the Web Speech API. You've learned how to capture user speech input, convert it into text, and display the results on the web page in real-time. You've also implemented automatic replays and actions based on specific user inputs, enhancing the interactivity of the application.
Remember to test your application on supported browsers, such as Google Chrome, as the Web Speech API may have limited compatibility with other browsers.
Feel free to explore further possibilities with the Web Speech API, such as integrating it with other web technologies or adding more interactive features to your application. Keep iterating and experimenting to enhance the user experience and make your application even more powerful.
Thank you for following along with this Tutorial. Happy coding!
Resources: