In this ‘behind the paper’ post, Stephanie Williams discusses how the new equipment, techniques and methods developed in her lab helped them…
In this ‘behind the paper’ post, Yaqing Su discusses how the lockdown and the pandemic redirected her thoughts in developing a model of how our brain understands speech, and how this model tries to address the difference between human and current Artificial Intelligence.
My name is Yaqing Su, a postdoctoral researcher at the University of Geneva, department of Fundamental Neuroscience. Our recent paper in PLOS Biology describes a computational model of human speech comprehension that uses hierarchically organized predictions and updates to relay continuous acoustic signal to discrete representations of meaning. We emphasize that non-linguistic knowledge, such as the situational context of a conversation, is important for making sense of the language input in the real world.
In the fall of 2019, when I finished my PhD in auditory neuroscience and enthusiastically started working on the project of modeling speech perception with my two advisors, none of us would predict the path it took.
The project was initially proposed to extend an in-house biophysical model of syllable recognition to higher levels of speech processing such as words and phrases, focusing on the role of nested neural oscillation in the prediction-update loop of neural information passing. After extensive reading and trial simulations, I realized that words and syllables are fundamentally different in at least two aspects. First, a given language usually has a fixed and rather small set of syllables, but magnitudes more of words. Second, syllables are meaningless by themselves, while each word has at least one associated meaning. This made me increasingly interested in how the brain transforms concrete representations (such as acoustic signals and words) into abstract ones (such as semantics and concepts), and how different types of representations interact in a bi-directional fashion.
However, implementing all these different speech processing levels with a biophysical model would be way too ambitious. Following suggestions from my advisors, I began to explore the framework of active inference developed by the theoretical neurobiology group at UCL, and happily found that its uncertainty-minimization approach could very likely be adapted to build the hierarchical predictive model I had in mind. When I reached out to the UCL group for software questions, Professor Karl Friston kindly offered to invite me for a talk and a short stay at the institute. Before heading to London, I already managed to implement a beta version of the model using the experimental design of MacGregor et al. that explores the neural correlates of semantic ambiguity and disambiguation. Everything was going smoothly.
In early March 2020, I arrived at Heathrow airport in its regular bustles, but COVID-19 was grabbing Europe day by day. I departed Heathrow a week later in an unsettling silence. Shortly after I returned to Geneva, borders were closed, flights cancelled, and potential collaborations put into indefinite halt. Working on a purely computational project was a luxury during lockdown because I did not have to worry about lab closure, but also a curse because the indispensable spontaneous idea exchanges with colleagues became very difficult even with the help of Zoom. Progress was extremely slow, and I felt powerless for being unable to help with the situation.
In retrospect, one lesson I learned from this project experience is that being slow in progress is sometimes not all that bad. The massive lockdown emptiness compelled me to reflect on my motivations and how I could make positive impact with my work.
Many contemporary studies on the neural mechanism of language processing were talking about “prediction” in a very specific way: the brain always tries to predict the next word (or next phoneme in some) when listening or reading. This idea was borrowed by the AI community, and one after another large language models (LLM) like GPT and BERT came out showing stunning language abilities. Some neuroscientists embraced this linkage between human and machine and naturally started using LLMs to help interpret neural responses to language, while some argued that LLMs do not really understand language like humans (these two viewpoints are not mutually exclusive). The latter discussion drew major attention as people started to fear AIs, in addition to viruses, are bringing humanity to doom.
I realized that my model was at a place to contribute to this discussion. Although prediction is at the core of the model, it is not explicitly aimed at predicting the next word as in LLMs. Word is at an intermediate level jointly “predicted” by semantic and syntax, which are in turn predicted by domain-general knowledge such as the situational context. In effect, the model can predict the incoming word—as well as the incoming syllable and the incoming acoustic signal. However, it can also have bad predictions and still figure out the message, i.e., the semantic roles and the context, just from bottom-up input. Prediction is rather a signal to drive the between-level information transfer than the goal of the system. The goal is to make sense of the speech signal and figure out the speaker’s message. The additional semantic and context levels are crucial for such a construction: they make sure that the acoustic signal gets “understood”, and resolve the semantic ambiguity when the word can be interpreted in more than one way. To us, the different interpretation of prediction and the inclusion of implicit “world knowledge” could be what is separating human and machine language processing.
When we brought this viewpoint of human-machine divergence to the first meeting of NCCR Evolving Language, a Swiss national collaboration network trying to understand “the past, the present, and the future of language”, it received unexpectedly warm discussion among researchers from all backgrounds ranging from computer science to zoology. With NCCR’s technical support, we were able to show that GPT-2, a precursor of ChatGPT running on the same mechanism but of smaller scale, does not directly capture the brain’s signature response to semantic ambiguity or resolution like our simple hierarchical model. The first chapter of this project was finally complete.
At the moment, I am taking this model further with a neuroimaging experiment. The excessive reading and painful thinking during the lockdown boredom now become my fuel. If this project had not coincided with the pandemic, the booming concerns with LLMs, and the birth of NCCR Evolving Language, this model would have taken a completely different shape and I would not have the same perspective as I have now. Scientific development rides on the wave of social context because scientists’ focus and motivation are always under the influence of the status quo.
About the author
Yaqing Su is a postdoctoral researcher in the group of Anne-Lise Giraud at the University of Geneva, Department of Fundamental Neuroscience. I am interested in the neural mechanisms underlying human speech and hearing processes. I did my PhD on how midbrain neurons encode pitch information, with Bertrand Delgutte at Harvard Medical School and Kamal Sen at Boston University. Before that, I was trained in electrical engineering at Tsinghua University, China. 0000-0002-8544-6284 @YaqingSu