Visit: Theory of Mind in Multimodal Dialogue

On Monday, November 6, James Pustejovsky from Brandeis University will visit the VU and CLTL.

We would like to extend the invitation to members of the NI. To get a sense of how many people will attend events throughout the day to reserve appropriate spaces (including a talk on ToM; title and abstract below). Please fill out this form  (places are limited but it’s still possible to register!)

Additionally, if you would like to meet with James individually and have not already signed up, please email Lucia (

Many thanks, and hope to see you on the 6th!

  • Where: HG-05A33 @ VU
  • When: November 6, 13:30-15:00,

Thanks to Lucia Donatelli & Piek Vossen for this

Modelling Theory of Mind in Multimodal Dialogue

James Pustejovsky

Theory of Mind (ToM) refers to the cognitive capacity that humans have to attribute mental states such as beliefs (true or false), desires, and intentions to oneself and others, thereby predicting and explaining behavior. Within the domain of Human-Computer Interaction (HCI), this concept has recently become more relevant for computational agents, especially in the context of multimodal communication. As multimodal interactions involve not only speech, but gestures, haptics, eye movement, and other types of input, each modality introduces subtleties which can be misinterpreted without a deeper understanding of the agent’s mental state. In this talk, I argue that Simulation Theory of Mind (SToM), encoded as an evidence-based dynamic epistemic logic (EB-DEL), can help model these complexities. Specifically, I apply this model to the problem of Common Ground Tracking (CGT) in task-oriented interactions.

Unlike dialogue state tracking (DST), which is the ability to update the representations of the speaker’s needs at each turn in the dialogue by taking into account the past dialogue moves and history, common ground tracking (CGT) identifies the shared belief space held by all of the participants in a task-oriented dialogue. Within the framework of SToM, I present a method for automatically identifying the current set of shared beliefs and questions under discussion (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth.