Throughout speech manufacturing, it’s evident that language embeddings (blue) within the IFG peaked earlier than speech embeddings (purple) peaked within the sensorimotor space, adopted by the height of speech encoding within the STG. In distinction, throughout speech comprehension, the height encoding shifted to after the phrase onset, with speech embeddings (purple) within the STG peaking considerably earlier than language encoding (blue) within the IFG.
All in all, our findings recommend that the speech-to-text mannequin embeddings present a cohesive framework for understanding the neural foundation of processing language throughout pure conversations. Surprisingly, whereas Whisper was developed solely for speech recognition, with out contemplating how the mind processes language, we discovered that its inside representations align with neural exercise throughout pure conversations. This alignment was not assured — a adverse end result would have proven little to no correspondence between the embeddings and neural alerts, indicating that the mannequin’s representations didn’t seize the mind’s language processing mechanisms.
A very intriguing idea revealed by the alignment between LLMs and the human mind is the notion of a “mushy hierarchy” in neural processing. Though areas of the mind concerned in language, such because the IFG, are inclined to prioritize word-level semantic and syntactic data — as indicated by stronger alignment with language embeddings (blue) — additionally they seize lower-level auditory options, which is obvious from the decrease but vital alignment with speech embeddings (purple). Conversely, lower-order speech areas such because the STG are inclined to prioritize acoustic and phonemic processing — as indicated by stronger alignment with speech embeddings (purple) — additionally they seize word-level data, evident from the decrease but vital alignment with language embeddings (blue).