Neuroscientists Transform Brain Activity to Speech with AI
UCSF neuroscientists use AI to generate speech from brain recordings.
Posted Apr 25, 2019
Artificial intelligence is enabling many scientific breakthroughs, especially in fields of study that generate high volumes of complex data such as neuroscience. As impossible as it may seem, neuroscientists are making strides in decoding neural activity into speech using artificial neural networks. Yesterday, the neuroscience team of Gopala K. Anumanchipalli, Josh Chartier, and Edward F. Chang of University of California San Francisco (UCSF) published in Nature their study using artificial intelligence and a state-of-the-art brain-machine interface to produce synthetic speech from brain recordings.
The concept is relatively straightforward—record the brain activity and audio of participants while they are reading aloud in order to create a system that decodes brain signals for vocal tract movements, then synthesize speech from the decoded movements. The execution of the concept required sophisticated finessing of cutting-edge AI techniques and tools.
For this study, five fluent English-speaking epileptic participants consented to participate. Why use epileptic participants? A gating factor in neuroscience is the availability and access to living, functioning human brains to conduct research—there are not many volunteers, for obvious reasons. Consenting epileptic patient volunteers offer neuroscientists a valuable opportunity to conduct in vivo studies while they are undergoing invasive brain treatment for epilepsy. An estimated 30 percent of epilepsy patients are medically resistant—their seizures are not controllable by non-invasive methods such as antiepileptic medications. For those patients, invasive brain surgery may provide the best odds for controlling the seizures. The participants in this research study were already undergoing surgery to temporarily implant electrodes to identify areas of seizures for upcoming neurosurgery to treat epilepsy. Each participant had a high-density subdural electrode array implanted over the brain’s lateral surface as part of their epilepsy treatment that may record or stimulate neurons with electrical impulses.
The researchers recorded both electrocorticography (ECoG) and audio synchronously as participants read aloud various reading materials ranging from select passages from stories such as The Frog Prince, Alice in Wonderland and Sleeping Beauty to a specially designed phonetic database used for training automatic speech recognition called the MOCHA-TIMIT.
The acoustic recordings’ transcripts were manually corrected at the individual word level for accuracy purposes. From the audio recordings, the researchers reversed engineered the associated vocal tract movements required to make the sounds to create a mapping of sound to anatomy.
The speech decoder consists of two separate bi-directional long short-term memory (bLSTM) recurrent neural networks. In the first stage (brain to articulation), a bLSTM was used to decode "articulatory kinematic features from continuous neural activity” that was recorded from the ventral sensorimotor cortex, superior temporal gyrus, and inferior frontal gyrus areas of the brain. This decoded brain activity patterns into movements of a virtual vocal tract.
In the second stage (articulation to acoustics), a different bLSTM was used to decode voicing, acoustic features, and mel-frequency cepstral coefficients (MFCCs)—a feature used for speech recognition. In this stage, the acoustics are decoded from the prior decoded kinematics. The decoded signals are converted into a synthetic version of the participant’s voice.
The virtual vocal tract created for each participant was controllable by brain activity. The researchers used hidden Markov model-based acoustic models for each participant; a hidden Markov model (HMM) is a statistical model used to illustrate the development of observable events that are dependent on factors that are not directly observable. HMMs have been used for decades in speech recognition to predict words from a recorded speech signal. In the case of speech recognition, an example of a hidden event would be a part-of-speech tag that can be simple (e.g. noun, verb, adjective, preposition, article, conjunction, adverb, etc.) or more sophisticated, with information on verb tense, possession, case, gender, and so on.
According to a UCSF report, transcribers “accurately identified 69 percent of synthesized words from lists of 25 alternatives and transcribed 43 percent of sentences with perfect accuracy.”
The researchers wrote that their results “may be an important next step in realizing speech restoration for patients with paralysis,” and brain-to-computer interfaces (BCIs) are “rapidly becoming a clinically viable means to restore lost function.”
Copyright © 2019 Cami Rosso All rights reserved.
Anumanchipalli, Gopala K., Chartier, Josh, Chang, Edward F.. “Speech synthesis from neural decoding of spoken sentences.” Nature. 24 April 2019.
American Association of Neurological Surgeons. “Epilepsy.” Retrieved 4-25-2019 from https://www.aans.org/en/Patients/Neurosurgical-Conditions-and-Treatments/Epilepsy
The Centre for Speech Technology Research at The University of Edinburgh. “MOCHA-TIMIT.” Retrieved 4-25-2019 from http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html
The Centre for Speech Technology Research at The University of Edinburgh. “The Festival Speech Synthesis System.” Retrieved 4-25-2019 from http://www.cstr.ed.ac.uk/projects/festival/
MNGU0. “Welcome to mngu0.” Retrieved 4-25-2019 from http://www.mngu0.org/
Jurafsky, Daniel. Martin, James H.. “Speech and Language Processing.” Stanford University. September 11, 2018.
Yoon, Byung-Jun. “Hidden Markov Models and their Applications in Biological Sequence Analysis.” Current Genomics. Sept. 10, 2009.