Difference between revisions of "Automated Speech Recognition"

Revision as of 09:51, 4 June 2018

Introduction

Automated speech recognition (AKA Speech-to-Text) is the process of turning sound picked up by the robot's microphone into meaningful text. The text may be interpreted as the human side of a conversation with the robot. The robot replies using the reverse process, text-to-speech, which turns text into an audio signal to send to the loudspeaker.

Speech to text is provided by the Speech Recognition Tritium node. The actual recognition is done by a configurable backend, with Google's Cloud Speech-to-Text API as the default (and currently only) option.

Google Cloud Speech-to-Text API

Google's Cloud Speech-to-Text is probably the best service available of its kind. Using highly trained neural networks it reliably converts audio into well-formed text. Compared to other similar services, it does a good job of transcribing the text correctly.

To use the Google API the Tritium node uploads the raw audio data picked up by the microphone to the Google service, then waits for messages back from the service containing any text recognised. This requires a reasonably fast, reliable internet connection.

The Tritium node must provide authentication credentials, created using the API management web application. The process to create new credentials is as follows:

Create a Google Cloud API project
Generate a private service account key JSON file
Download the JSON file, saving it on the robot as...

/opt/tritium/nodes/speech_recognition/google_application_credentials.json

The Before You Begin section of the Quickstart: Using Client Libraries tutorial covers this process. (You do not need to follow the rest of the tutorial.)

@@ Line 5: / Line 5: @@
 ==Introduction==
-Automated speech recognition is the process of turning sound picked up by the robot's microphone into meaningful text.  The text may be interpreted as the human side of a conversation with the robot. The robot replies using the reverse process, [[Text-to-Speech| text-to-speech]], which turns text into an audio signal to send to the loudspeaker.
+Automated speech recognition (AKA Speech-to-Text) is the process of turning sound picked up by the robot's microphone into meaningful text.  The text may be interpreted as the human side of a conversation with the robot. The robot replies using the reverse process, [[Text-to-Speech| text-to-speech]], which turns text into an audio signal to send to the loudspeaker.
 Speech to text is provided by the [[Tritium Node - Speech Recognition|Speech Recognition]] Tritium node.  The actual recognition is done by a configurable backend, with Google's Cloud Speech-to-Text API as the default (and currently only) option.

Difference between revisions of "Automated Speech Recognition"

Revision as of 09:51, 4 June 2018

Contents

Introduction

Google Cloud Speech-to-Text API

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Index

Ameca

RoboThespian

SociBot

Special Hardware

Features & Tech

Tritium Framework

Tritium Nodes

Tritium Device Hosts

Virtual Robot

Technical Support

Useful

Tools