How to Build a Speech-to-Text App in JavaScript With Web Speech API

Web Speech API can convert speech to text. Learn how to build a real-time speech-to-text web app using the API in JavaScript with code. 

Written by Nicholas Charles
Published on Jan. 15, 2025
In this short tutorial, we’ll build a simple yet useful real-time speech-to-text web app using the Web Speech API. 

Feature-wise, our app will be straightforward: click a button to start recording, and your speech will be converted to text, displayed in real-time on the screen. We’ll also play with voice commands. For example, saying “stop recording” will halt the recording. 

4 Steps to Build a Speech-to-Text Web App

  1. Initialize the project using a build tool like Vite.js. If you choose to use Vite.js, ensure you have Node.js installed, as npm is required to install and manage project dependencies.
  2. Develop the basic layout of the application using HTML.
  3. Design the visual appearance of the app by applying CSS styles.
  4. Integrate the Web Speech API using JavaScript to enable speech recognition.

Sounds fun? Let’s get into it.


What Is a Web Speech API?

The Web Speech API is a browser technology that enables developers to integrate speech recognition and synthesis capabilities into web applications. It opens up possibilities for creating hands-free and voice-controlled features, enhancing accessibility and user experience.

Some use cases for the Web Speech API include voice commands, voice-driven interfaces, transcription services, and more.

Now, let’s dive into building our real-time speech-to-text web app. I'm going to use Vite.js to initiate the project, but feel free to use any build tool of your choice or none at all for this mini demo project. If you’d like to use Vite, which I highly recommend, ensure you have npm installed by downloading and installing the Node.js runtime.

First, create a new Vite project:

   npm create vite@latest

Then, choose “Vanilla” on the next screen and “JavaScript” on the following one. Use arrow keys on your keyboard to navigate up and down.

HTML Structure for a Real-Time Speech-to-Text Web App

The HTML structure defines the layout of our app. It includes buttons for starting and stopping the speech recognition and a container to display the transcribed text.

<!DOCTYPE html>
<html lang="en">
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <script type="module" src="/main.js"></script>
    <title>Real-time Speech to Text App</title>
    <div class="container">
      <h1>Real-time Stt App</h1>

      <div class="btn-wrapper">
        <button id="startBtn" class="btn-start">
          <svg viewBox="0 0 100 100" class="hidden">
            <!-- Outer circle -->

            <!-- Inner circle indicating recording -->
                values="30; 25; 30"

            <!-- Record icon in the center -->
            <circle cx="50" cy="50" r="5" fill="#ccc" />

          <span> Start Recording </span>
        <button id="stopBtn" class="btn-stop" disabled>Stop Recording</button>

      <div id="result" class="result"></div>


CSS Styling for a Real-Time Speech-to-Text Web App

The CSS styles enhance the appearance of the app, making it visually appealing. You can customize these styles to fit your design preferences.

:root {
  font-family: Inter, system-ui, Avenir, Helvetica, Arial, sans-serif;
  line-height: 1.5;
  font-weight: 400;

  font-synthesis: none;
  text-rendering: optimizeLegibility;
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;

* {
  margin: 0;
  padding: 0;
  box-sizing: border-box;

body {
  background: radial-gradient(
      circle at 100%,
      rgba(3, 6, 21, 0.9) 15%,
      rgba(189, 205, 226, 0.5) 5%,
      rgba(7, 9, 22, 0.9) 15%
    url('./public/chevron.png') center/cover;

  height: 100vh;
  padding: 40px 0;

.container {
  max-width: 1100px;
  margin: 0 auto;
  display: flex;
  flex-direction: column;
  align-items: center;
  padding: 0 15px;

h1 {
  color: #fff;
  font-size: 1.5rem;
  text-transform: uppercase;

.btn-wrapper {
  margin-top: 20px;
  display: flex;
  flex-wrap: wrap;
  justify-content: center;
  align-items: center;
  gap: 10px;

button {
  display: flex;
  align-items: center;
  column-gap: 5px;
  border: none;
  cursor: pointer;
  padding: 12px 24px;
  border-radius: 3px;
  font-weight: 600;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.3);
  transition: opacity 400ms ease-in-out;

button:disabled {
  opacity: 0.47;
  cursor: default;

button:hover:not(:disabled) {
  opacity: 0.9;

button > svg {
  height: 1rem;

.btn-start {
  background-color: #ff2c4f;
  color: #fff;

.btn-stop {
  background-color: rgb(7, 2, 44);
  color: #fff;

.result {
  background-color: #fff;
  width: 100%;
  min-height: 200px;
  padding: 10px;
  border-radius: 3px;
  margin-top: 20px;
  box-shadow: 0 0 10px rgba(0, 0, 0, 0.3);
  text-transform: capitalize;

.result:empty {
  display: none;

.hidden {
  display: none !important;

@media screen and (min-width: 768px) {
  h1 {
    font-size: 3.125rem;
    text-transform: capitalize;

  .container {
    padding: 0 30px;

  .result {
    padding: 15px;


JavaScript Implementation for a Real-Time Speech-to-Text Web App

This JavaScript code handles the app’s functionality. It integrates the Web Speech API, manages event listeners and processes speech recognition results to update the UI.

const resultElement = document.getElementById('result');
const startBtn = document.getElementById('startBtn');
const animatedSvg = startBtn.querySelector('svg');
const stopBtn = document.getElementById('stopBtn');

startBtn.addEventListener('click', startRecording);
stopBtn.addEventListener('click', stopRecording);

let recognition = window.SpeechRecognition || window.webkitSpeechRecognition;

if (recognition) {
  recognition = new recognition();
  recognition.continuous = true;
  recognition.interimResults = true;
  recognition.lang = 'en-US';

  recognition.onstart = () => {
    startBtn.disabled = true;
    stopBtn.disabled = false;
    console.log('Recording started');

  recognition.onresult = function (event) {
    let result = '';

    for (let i = event.resultIndex; i < event.results.length; i++) {
      if (event.results[i].isFinal) {
        result += event.results[i][0].transcript + ' ';
      } else {
        result += event.results[i][0].transcript;

    resultElement.innerText = result;

    if (result.toLowerCase().includes('stop recording')) {
      resultElement.innerText = result.replace(/stop recording/gi, '');

  recognition.onerror = function (event) {
    startBtn.disabled = false;
    stopBtn.disabled = true;
    console.error('Speech recognition error:', event.error);

  recognition.onend = function () {
    startBtn.disabled = false;
    stopBtn.disabled = true;
    console.log('Speech recognition ended');
} else {
  console.error('Speech recognition not supported');

function startRecording() {
  resultElement.innerText = '';

function stopRecording() {
  if (recognition) {
Understanding How to Build a Real-Time Speech-to-Text Web App

This simple web app utilizes the Web Speech API to convert spoken words into text in real-time. Users can start and stop recording with the provided buttons. Customize the design and functionalities further based on your project requirements.

Now, you have a basic understanding of how to create a real-time speech-to-text web app using the Web Speech API. Experiment with additional features and enhancements to make it even more versatile and user-friendly. 

Frequently Asked Questions

Web Speech API is a browser tool that enables developers to integrate speech recognition and synthesis capabilities into web applications.

The following JavaScript code handles the speech-to-text web app functionality:

