This article is a primer on speech recognition in smart devices. It will describe the various technologies which make this remarkable tool work seamlessly in consumers’ everyday lives.
Speech recognition technology is becoming integrated into various applications and end products. Research into the use of voice search suggests that half of all searches will be performed by voice by 2020 (source: ComScore). Voice is fast, and since it does not require typing it is easier to do on a smartphone.
Another product type in which Automatic Speech Recognition (ASR) is gaining popularity is the smart speaker. The Google Home™ and Amazon’s Echo and Dot smart speakers are penetrating homes faster than most imagined they would, as shown in Figure 1. Apple also plays in this market with its HomePod™ speaker.
Fig. 1: The Google Home, Amazon Echo and other smart speakers
To a consumer, speaking to a smartphone is no different from speaking to a smart speaker. In fact, consumers expect the same level of word-recognition capability that they have come to expect from their smartphone, even though the implementation of a smart speaker is more difficult from an acoustic standpoint.
Operation of a smart speaker
A smart speaker has a unique set of operational conditions and challenges. First, the smart speaker is always listening: it waits for a wake-up signal or trigger word before it opens a channel to the cloud, where audio processing will take place. For Amazon, a single word, ‘Alexa’, triggers the speaker to wake up. The Google Home command ‘OK Google’, or Apple’s ‘Hey Siri’ do not seem so natural. They lead to the awkward construction of simple commands, such as, ‘OK Google, turn on the lights’, or ‘OK Google, turn up the volume’.
In the case of the Amazon Echo, the detection of the wake-up word is carried out locally. This is crucial: if this is not executed entirely within the smart speaker, it will need to continuously send voice packets to the cloud for processing.
Once a wake-up word is detected, the user issues a command, such as ‘Turn on the kitchen light’. The connected device sends the verbal command to the Amazon Web Services (AWS) cloud computing service. After deciphering the command, it will then instruct the kitchen light to turn on.
This architecture works because developers have built the skills for using the Amazon Alexa cloud-based voice service into devices running on AWS, as shown in Figure 2.
Fig. 2: The command architecture for the Amazon Alexa voice service. (Image credit: developer.amazon.com)
There are particular challenges involved in implementing speech recognition in a device which may be as far as 10m away from the user. Distance is not the only challenge for the smart speaker: interference from echoes bouncing off the walls in the room, from music playing from an audio speaker, or from other audible noise sources can make it difficult for a smart speaker to distinguish the user’s voice and to recognize words.
The key to the solution of the problem of recognizing speech at a distance is to deploy an array of ASR Assist technologies, as shown in Figure 3.
Audio beamforming: it is not unusual to find multiple sources of noise, speech and sounds in a room. Technology for locating the source, thereby separating the particular sound of interest, is called beamforming, which minimizes the amplitude of undesired signals and noise.
To perform beamforming effectively, an array of microphones implements spatial filtering. These microphones pick up propagating waves to create spatial samples. Spatial filtering requires information about the microphones’ characteristics and the configuration of the microphone array.
Barge-in: while a smart speaker may be deployed to tell the time or weather, users will also frequently ask it to play music. Barge-in allows trigger words to be detected during music playback.
De-verberation: this feature removes the echo in the room to improve voice clarity.
Automatic gain control: the user might sometimes be located in front of the speakers, at other times 5m away. Automatic gain control applies the appropriate gain to the signal for the distance over which the voice carries.
Noise reduction: this feature mitigates the effect of ambient noise such as fans.
Fig. 3: ASR Assist technologies. (Image credit: Microsemi)
DSP or general-purpose MCU?
ASR Assist technologies, then, enable a smart speaker to perform speech recognition at a distance in noisy real-world environments. The hardware technology underpinning ASR Assist is digital signal processing, which has traditionally been implemented in specialized microprocessors that have a dedicated architecture.
While it is possible to process signals digitally in a general-purpose MCU which executes digital signal-processing algorithms, it is more efficient to use a Digital Signal Processor (DSP) chip.
For ASR applications, specialized audio DSP products provide the best environment in which to implement ASR Assist functions. A leading example of such a device is the ZL38063 from Microsemi, a Microchip company.
The ZL38063 is part of Microsemi’s Timberwolf family of audio processors. It improves ASR performance at extended distances while providing barge-in capability, and is optimized for detecting voice commands. The Microsemi AcuEdge™ technology in the ZL38063 is designed for use in televisions, set-top boxes and smart speakers, but also works well in other connected-home applications. The device is capable of both voice control and two-way full-duplex audio with voice enhancements such as acoustic echo cancellation and noise reduction to improve both the intelligibility and subjective quality of voice in harsh acoustic environments.
Fig. 4: Simplified block diagram of the ZL38063 audio processor. (Image credit: Microsemi)
A different hardware platform which can implement advanced voice-recognition technologies is the Digital Signal Controller (DSC). This offers a number of advantages to OEMs’ system designers. It can reduce bill-of-materials costs, by replacing a design based on a combination of a microcontroller and a DSP with a single DSC. It can also provide a reduction in system-level complexity by removing the need for shared memory, MCU and DSP communication, complex multi-processor bus architectures and custom glue logic between an MCU and DSP.
A DSC also offers the advantage of reducing software development costs, as the entire project can be developed with a single compiler, debugger and integrated development environment.
The project’s software may also be written in a high-level programming language such as C or C++, rather than the handcrafted assembler often used for a proprietary DSP. Product in the dsPIC33E/F DSC family from Microchip may be used in ASR applications, since they offer features such as speech encoding and decoding, noise suppression, acoustic/line echo cancellation, equalizer and automatic gain control.
Amazon’s approach to implementing ASR capability in the Echo smart speaker has been to integrate an audio DSP into the device. The audio DSP performs most of the operations required for the ASR Assist functions. A clean signal is then sent to an applications processor via an I2C or serial peripheral interface for routing to the cloud computing service.
Local ASR: when there is no access to the cloud
Thus far, the discussion has been around speech recognition in the cloud. But what if the user wishes to control lights or the temperature of the hot tub when she or he has no access to the Alexa cloud-based voice service? There will be use cases in which an internet connection is not available: this means that speech recognition must be performed locally.
Without access to the cloud, there is no access to the vast Artificial Intelligence (AI) capability which underpins the operation of speech recognition. When performing speech recognition locally, a smart speaker is limited to a vocabulary of at most 20 short phrases. However, this is perfectly acceptable for convenience applications at home, for which users can remember only a limited number of command phrases.
Various companies specialize in such technology: they will work with OEMs to develop custom command phrases which may be loaded in the audio DSP or host MCU.
Applications beyond the smart speaker
The discussion in this article has centered on the smart speaker. Moving beyond smartphones and smart devices, developers should consider whether speech will be the Human-Machine Interface (HMI) of the future for a much broader range of end products. Users had buttons and switches for centuries: in just the past decade, Apple’s iPad® and iPhone® mobile digital devices have introduced a completely new way of interfacing with electronics products. Now customers expect a device’s HMI to include a smooth touchscreen experience.
The same phenomenon is happening with voice: smart speakers are proliferating at an unprecedented rate, and eventually customers will come to expect to interact with their machines using speech. Advances in speech recognition have enabled the use of voice commands in modern HMIs. The technology is now ready for mass adoption. It is the most natural HMI for many products and systems. Developers and product managers need to think how they can use the technology to help increase demand for their next product design.