Audio injection

Audio injection is the exploitation of digital assistants such as Amazon Echo, Google Home or Apple SIRI by unwanted instructions from a third party. These services lack authentication when reacting to user commands, making it possible for attackers to issue activation words and commands and trigger the execution of desired actions. Injection results include fraud, burglary, data espionage and takeover of connected systems.

Approach
Most digital assistants require an activation word to trigger the complete recording and audio analysis of spoken commands (e.g. "OK Google"). These commands, spoken in natural language, will be processed and executed. The activation word itself is different for each manufacturer, most often connected to product’s name. It is not possible for the product's owner to customize this activation word, meaning that all models react to the same word.

Processed language is not analyzed in regards to voice and recognition or authorization of individual users. As a result, all users which are in the audible range of a device have the same authorizations to trigger and execute actions. Depending on functionality, these actions can range from simple actions like music playback or controlling lights, to security critical or sensitive actions such as opening gates, executing or ordering chargeable transactions or reading out private information.

Audio injection attacks can be performed remotely. If the access into the audible area of a digital assistant is restricted, it can be attacked by increasing the distance of transmission. Either this can be reached by overcoming existent acoustic barriers (e.g. via vending opening or a momentary opened window) or via unusual loud playback via amplifiers and loudspeakers. With such an approach it is possible to control digital assistants from other rooms, floors or outside of a building.

Digital assistants do not distinguish whether voice commands are directly spoken or reproduced by a loudspeaker system. This means that attacks can be performed via devices which support audio playback via loudspeaker. Radio and television can be used to run mass-attacks, and telephones, hands-free systems, other digital assistants or systems for text-to-speech-processing can be misused.

Systems which can be accessed randomly, anonymously and in absence of the owner, as well as being activated remotely, are critical. Answering machines with an automated monitoring function playing a message aloud along could be used for remote and precise attacks.

Examples
A June 2014 television commercial from Microsoft depicted an actor issuing a voice command to an Xbox, causing the same command to be interpreted by Xbox systems in the same room as any television showing this commercial.

In December 2016, Adam Jakowenko showed how one virtual assistant could activate another one using TTS-processing.

In January 2017, television news coverage of a child's ordering of a dollhouse through Amazon's Alexa interface included news anchor Jim Patton quoting the child. This quotation triggered orders for dollhouses from viewers of the coverage who had Alexa devices running.