Click here to -> download <- this file

 

This booklet contains;

1)      The list of the invited speakers;

2)    The preliminary list of the contributors;

3)    The abstracts of the invited lectures;

4)    The abstracts of the proposed contributions.

 

The contributions proposed in this booklet will undergo to a peer revision and the selected ones will be published in the proceedings of the school.

 

List of Invited Speakers

 

Nick CAMPBELL

National Institute for Information and Communication Technology, and ATR (Advanced Telecommunications Research Institute International) Spoken Language Communication Research Laboratories, Keihanna Science City, Kyoto 619-0288, Japan, nick@nict.go.jp;

title of the talk:

It Takes Two to Discourse:Technology and Techniques for Interactive Speech Processing

 

G. Chollet1, L. Zouari1, P. Perrot1,2, C. Pelachaud3,4, P. Horain5, A. Gentes1

 1CNRS LTCI/TSI Paris, 46 rue Barrault, 75634 Paris Cedex 13 – FRANCE, 2IRCGN, 3Paris8, 4INRIA, 5INT, gerard.chollet@enst.fr; leila.zouari@enst.fr; patrick.perrot@TELECOM-ParisTech.fr;  pelachaud@iut.univ-paris8.fr;  patrick.horain@int-edu.eu;  annie.gentes@TELECOM-ParisTech.fr; 

title of the talk:

Multimodal HMI in Virtual and Augmented Reality Applications

 

Marion Dohen

GISPA-Lab, France, marion.dohen@gipsa-lab.inpg.fr;   

title of the talk:

Multimodal perception of speech segments and speech prosody in  relationship with their production

 

Marion Dohen

GISPA-Lab, France, marion.dohen@gipsa-lab.inpg.fr;

title of the talk:

Coordination of Mouth and Hand Actions in Speech Communication

S. Duncan1, A. Esposito2, K. Rohlfing3, C. Sowa4, C. W.-C. So5, A. Franklin6, S. Almutawa1

1University of Chicago, 2Second University of Naples, 3Bielefeld University, 4Friedrich-Alexander-University of Erlangen-Nuremburg, 5National University of Singapore, 6Rice University, deng@ameritech.net 

title of the talk:

Gesturing compared across cultures and contexts: Some implications for language theory

 

Anna Esposito

Dipartimento di Psicologia, Seconda Università di Napoli, and  IIASS Italy, iiass.annaesp@tin.it; anna.esposito@unina2.it;

title of the talk:

Affect in Multimodal Information

 

Marcos Faundez-Zanuy

Escola Universitària Politècnica de Mataró (Adscrita a la UPC) 08303 MATARO (BARCELONA), Spain, faundez@eupmt.es;

title of the talk:

Data Fusion at Different Level

 

Eric Keller

IMM, University of Lausanne, Switzerland, Eric.Keller@unil.ch;

title of the talk:

Fundamental Concepts of Voice Analysis

 

 

Adam Kendon

University of Napoli “L’Orientale”, adamk@dca.net

title of the talk:

The structure of gestural performance in extended discourse: Some illustrative examples

 

Bernd J. Kröger

University Hospital Aachen, Germany, bkroeger@ukaachen.de;

title of the talk:

Perspectives for Articulatory Speech Synthesis within the Framework of Audio-Visual Human-Machine Interaction Systems

 

Catherine Pelachaud,

Universite de Paris 8, France,  c.pelachaud@iut.univ-paris8.fr;

title of the talk:

Behavior parametrization for embodied conversational agents

 

Alex Pentland

Room E15-387, 20 Ames St., Cambridge MA 01239, MIT, USA, pentland@media.mit.edu;

title of the talk:

Honest Signals

 

Isabella Poggi

Dipartimento di Scienze dell’Educazione, Università Roma Tre

poggi@uniroma3.it;

title of the talk:

Gesture and gaze in persuasive discorse: 1) Rhetoric and body communication; 2)Persuasion as a hierarchy of goals; 3) Gesture and gaze in political discourse

 

Michael Pucher

Telecommunications Research Center Vienna (FTW) Tech Gate Vienna, Donau-City-Strasse 1, 3rd floor A-1220 Vienna Austria, pucher@FTW.at;

title of the talk:

Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios

 

Uli Sauerland and Mathias Schenner

Centre for General Linguistic, Berlin, Germany, uli@alum.mit.edu, m.schenner@gmail.com;

title of the talk:

Content in Embedded Sentences: An Overview

 

Kristinn R. Thórisson
Center for Analysis and Design of Intelligent Agents (CADIA), School of Computer Science, Reykjavík University, Iceland,  thorisson@gmail.com

title of the talk:

Realtime Multimodal Turntaking

 

Wolfgang Tschacher

University Hospital of Psychiatry, University of Bern, Laupenstrasse 49, CH-3010 Bern, Switzerland, tschacher@spk.unibe.ch;

title of the talk:

Embodied cognition entails embodied communication

 

Hannes Högni Vilhjálmsson
Center for Analysis and Design of Intelligent Agents (CADIA), School of Computer Science, Reykjavík University, Iceland, hannes@ru.is, http://www.ru.is/faculty/hannes

title of the talk:

Representing Function and Behavior in Multimodal Communication


Matthias Wimmer

Institute for Informatics 9 Technische Universitaet Muenchen Boltzmannstrasse 3, 85748 Garching bei Muenchen, GERMANY,.  wimmerm@cs.tum.edu, www9.cs.tum.edu/people/wimmerm

title of the talk:

Facial Expression Recognition Implemented with Model-based Image Interpretation

 

Bencie Woll
Deafness, Cognition and Language Research Centre University College London,

b.woll@ucl.ac.ukwww.dcal.ucl.ac.uk

title of the talk:

Exploring Sign Language

 

Preliminary List of Contributors

Abel Andrew

University of Stirling, Scotland, aka@cs.stir.ac.uk

title of the talk (coauthor Amir Hussain):

Multi-modal speech processing methods: An overview and possible future directions using a MATLAB based Audio-visual Toolbox

 

Atassi Hicham

Brno University of Technology,Czech Republic, xatass01@stud.feec.vutbr.cz 

title of the talk:

Acoustic  Features of Emotional States

 

Cermak Jan

Academy of Sciences of the Czech Republic,Czech Republic, cermak4@kn.vutbr.cz 

title of the talk (coauthor Smekal Z):

Underdetermined blind source separation by linear separation system.

 

Chetouani Mohamed

Universite Pierre et Marie Curie,France, chetouani@upmc.fr

title of the talk (coauthor Fabien Rigenval):

A Pseudo-phonetic Approach for Speech Characterisation of Autistic

 

Costen Nick:

Manchester Metropolitan University, Manchester, UK, ,n.costen@mmu.ac.uk

title of the talk:

Temporal integration in computer face and movement recognition

 

Demir Yasemin

Kov Universitesi,Turkey, ydemir@ku.edu.tr

title of the talk (coauthors Engin Erzin and Murat Tekalp):

Multimodal Signal Processing for Dance Performance Analysis

 

Georgescu Alexandru

University of Timişoara, Romania, alexandrugeorgescu@gmail.com2

title of the talk (coauthor Alina E. Lascu):

Exploiting Protensity of Musical Messages

 

González Domínguez Verónica

Escuela Universitaria Politécnica de Mataró, Spain, gondomve@eupmt.upc.edu

title of the talk:

Eye detection from facial images

 

Grassi Marco

DEIT Università Politecnica delle Marche, Ancona, Italy, margra75@hotmail.com

title of the talk (coauthor Marcos Faundez-Zanuy)

Face Localization in 2D Frontal Face Images Through Luminosity Profiles Analisys

 

Hempel René

Technische Universität Dresden, Germany, ReneHempel@hotmail.de

title of the talk (coauthor Patrick Westfeld):

Statistical Modeling of Interpersonal Distance with Range Imaging Data

 

Heylen Dirk

University of Twente, The Netherland, heylen@cs.utwente.nl

title of the talk (coauthor Anton Nijholt):

Presence Agents: Physical and Virtual Agents with Interaction Behavior that Supports Being There

 

Karpiński Maciej

Adam Mickiewicz University, Poland, maciej, kaprinski@amu.edu.pl   

title of the talk:

From Sounds and Gestures to Dialogue Acts. Selected Problems of a Multimodal Corpus Labelling

 

Kostoulas Theodoros

University of Patras, Greece,  tkost@wcl.ee.upatras.gr     

title of the talk(coauthors Todor Ganchev, Nikos Fakotakis):

Study on speech parameterization for emotion recognition

 

Koutsombogera Maria  

Institute for Language and Speech Processing, Greece, mkouts@ilsp.gr       

title of the talk (coauthor Papageorgiou Harris)

Multimodality Issues in Conversation Analysis of Greek TV Interviews

 

Lalev Emilian

New Bulgarian University,Bulgaria, elalev@cogs.nbu.bg

title of the talk:

Social Phenomena Related to Cooperation in Prisoner's Dilemma

 

Lascu Alina

Lucian Blaga University of Sibiu, Romania,  alina.lascu@gmail.com1

title of the talk (coauthor Georgescu Alexandru) :

From Extensity to Protensity in CAS: Adding Sounds to Icons

 

Mahdhaoui Ammar

Institut des Systèmes Intelligents et Robotiques ISIR,France/Tunisia, ammar.mahdhaoui@robot.jussieu.fr

title of the talk (coauthor Mohamed Chetouani):

Home movies segmentation for face-to-face interaction analysis

 

Malatesta Lori

National Technical University of Athens, Greece, lori@image.ntua.gr

title of the talk:

Situated Non-verbal expressivity synthesis

 

Maskeliunas Rytis

Kaunas University of Technology, Lithuania, rytmask@kyu.lt

title of the talk:

Multimodal human - machine interface modeling

 

Mirilovič Michal

Technical University of Košice ,Slovakia, michal.mirilovic@tuke.sk

title of the talk (coauthors Jozef Juhár, Anton Čižmár):

Comparison of grapheme and phoneme based acoustic modelling in LVCSR task in Slovak

 

Murhy Peter                      

University of Limerik, Ireland, peter.murphy@ul.ie

title of the talk:

Glottal Characteristics of Emotion Based on Electroglottogram Measures

 

Nelson Delroy

DCAL, University College London, England, delroy.nelson@ucl.ac.uk

title of the talk:

Using a signing avatar as a sign language research tool

 

Neubarth Friedrich

OFAI, Vienna, Austria, friedrich.neubarth@ofai.at

title of the talk (coauthor Kranzler Christian) :

A Distributional Concept for Modeling Dialectal Variation in TTS

 

Op Den Akker

Henderikus,University of Twente,The Netherlands, infrieks@cs.utwente.nl

title of the talk:

The Computability of Human Interaction

 

Pedica Claudio

Università di Camerino,Italy, claudio.pedica@studenti.unicam.it

title of the talk:

A software system demonstration

 

Pirker Hannes

Austrian Research Institute for AI, Austria, hannes.pirker@ofai.at

title of the talk:

Get a Grip on the Phone(me): The Application of Phonetic Segmentation in Multimodal Emotional Data

 

Přibilová Anna

Slovak University of Technology in Bratislava, Slovakia, anna.pribilova@stuba.sk

title of the talk:

Spectrum Modification for Emotional Speech Synthesis

 

Punys Jonas

Kaunas University of Technology, Lithuania, jonas.punys@ktu.lt

title of the talk (coauthors Jurate Puniene, and Vytautas Rudzionis):

Modelling of lips for the main vowels of the Lithuanian language

 

Reidsma Dennis

University of Twente, The Netherlands, dennisr@ewi.utwente.nl

title of the talk:

Annotation Reliability and Machine Learning Performance

 

Reyes-Garcia Carlos

INAOE, Computer Science, Mexico, kargaxxi@yahoo.com

title of the talk:

Qualitative and quantitative crying analysis of new born babies delivered after high risk gestation

 

Ruttkay Zsófia

University of Twente, The Netherlands, zsofi@cs.utwente.nl

title of the talk:

Cultural Dialects of Real and Synthetic Facial Expressions

 

Ter Maat Mark

University of Twente,The Netherlands, maatm@ewi.utwente.nl

title of the talk:

Differentiating Communicative Signals using Context

 

Vicsi Klara

Budapest University of Technology and  Economics,Hungary,  vicsi@tmit.bme.hu 

title of the talk (coauthors Tóth Szabolcs Levente, Sztaho David):

Optimalization of the Automatic Emotion Recognition of Speech in case of Some Languages

 

 

Vincze Laura

University of Pisa, Romania/Italy, vincze@ling.unipi.it  

title of the talk:

The Impact of Gesture and Gaze in the Persuasive Political Discourse

 

Vondra Martin

Academy of Sciences CR,Czech Republic, vondra@ufe.cz

title of the talk (coauthor Robert Vích):

Recognition of Emotions in Czech Speech Using Gaussian Mixture Model

 

Zoric Goranka

Faculty of Electrical Engineering and Computing, Croatia, goranka.zoric@fer.hr

title of the talk:

Facial Gestures Generation by Speech Signal Analysis using HUGE Architecture

 

 

 

 

It Takes Two to Discourse:
Technology and Techniques for Interactive Speech Processing

Nick CAMPBELL

National Institute for Information and Communication Technology, and ATR (Advanced Telecommunications Research Institute International) Spoken Language Communication Research Laboratories, Keihanna Science City, Kyoto 619-0288, Japan

nick@nict.go.jp;

ABSTRACT:

Much of current speech technology is designed for use with broadcast-style modes of speech. However, most human social interaction uses a more interactive, two-way, conversational-style of speaking. In broadcast mode, the speaker is not aware of the presence, or attentional states, of the listener and carries on speaking regardless of whether there is anyone listening or not, neither checking nor adjusting the style and content of speech according to the understanding of the listener. In conversational mode, on the other hand, the speaker and listener interact in a delicate interplay of often overlapping discourse segments to jointly contribute to the mutual generation of conversational content and meaning.

 

In this presentation and its accompanying demonstrations, use is made of samples of speech turns taken from a large corpus of very natural conversations to show that as much as 50% of each partner's speech is overlapping at some point. It also shows how the discourse contents can be separated into 'content' and 'collaboration' elements, illustrating how the listener actively contributes to the discourse. The talk and demonstrations introduce novel technology for processing this form of interactive speech, and show how future speech technology might be able to sense the presence and attentional states of a human listener in order to be able to adapt its output speaking style and content, according to the comprehension and interests of the human partner.

 

The prototype system presented here was first developed to monitor speech and discourse activity in meetings, to provide an estimate of each participant's involvement in the various stages of the meeting, but it has recently been adapted for use in a one-on-one interaction with a single user as part of a computer-based spoken dialogue system. The technology incorporates video analysis to detect the presence of a listener or listeners and uses face-detection software to count the faces of those present who are close enough to be looking at the computer screen. It then measures the amount of movement detected in the areas around and below the faces, using synchrony of movement with events in the speech signal to infer knowledge about the reactions of the listener. In addition to the video information, the system also makes use of very low-level audio information coming from those present, in order to detect nonverbal speech sounds such as laughs, grunts, backchannel utterances, and so forth. From the timely conjunction of the movements and the nonverbal speech sounds, useful inferences can be made about the attentional states of those present so that changes can be made as necessary to the output speech style and content.

 

A further point made in the talk and accompanying demonstrations is that the tempo of a discourse frequently changes throughout the conversation as the partners come mentally closer to one another or drift further apart as they become more or less involved in the conversation, and as they find elements of common interest in the content. Whereas studies of the speech characteristics in this type of interaction have typically been referenced under the umbrella of 'emotion' in previous work, we prefer to tackle the topic within the framework of nonverbal speech, and to incorporate it more intimately into the spoken dialogue component. Annotated speech samples are provided that show human labellers to be remarkably consistent in their perception of such 'rapport' in a conversation, and steps are now being taken to train statistical models to respond to these acoustic characteristics in much the same way as human labellers do.

 

The overriding theme of the presentation is that conversation is not just about what people say, but also equally about how they interact. Specifically, how they use speech, speaking style, lexical choice, discourse fragments, etc., interactively and collaboratively to produce a shared interpretation of the topic under discussion. The problems are illustrated, and technological components described that will allow computer speech processing to take a step forward and move from the limitations of simple broadcast mode to being able to participate actively in a conversation in a much more human-friendly manner.

 

Multimodal HMI in Virtual and Augmented Reality Applications

G. Chollet1, L. Zouari1, P. Perrot1,2, C. Pelachaud3,4, P. Horain5, A. Gentes1

 1CNRS LTCI/TSI Paris, 46 rue Barrault, 75634 Paris Cedex 13 - FRANCE

2IRCGN, 3Paris8, 4INRIA, 5INT

gerard.chollet@enst.fr; leila.zouari@enst.fr; patrick.perrot@TELECOM-ParisTech.fr;  pelachaud@iut.univ-paris8.fr;  patrick.horain@int-edu.eu;  annie.gentes@TELECOM-ParisTech.fr;  

 

ABSTRACT:

         This article investigates multimodal Human Machine Interaction (HMI) within a framework of virtual and augmented reality application. When somebody faces a camera, he is represented by an Embodied Conversational Agent (with the same face and voice) in a virtual world. The face and the body of this avatar are animated according to the expressions/keywords produced by the real person. In order to hide

his identity, the person can change the voice and the face of the avatar. This application involves many technologies belonging to different fields which are mainly Speech Processing, Embodied Conversational Agents and Gesture Analysis and Synthesis.

Gesturing compared across cultures and contexts: Some implications for language theory

S. Duncan1, A. Esposito2, K. Rohlfing3, C. Sowa4, C. W.-C. So5, A. Franklin6,

 S. Almutawa1

1University of Chicago, 2Second University of Naples, 3Bielefeld University,

4Friedrich-Alexander-University of Erlangen-Nuremburg, 5National University of Singapore,

6Rice University,

deng@ameritech.net

ABSTRACT:

For decades now, research in psychology, anthropology, and linguistics has accumulated evidence showing that coverbal gesturing is part of human face-to-face communication in every language-cultural group studied. This evidence of the universality of gesture is foundational for many current psychological and linguistic anthropological theories of language production and comprehension; so-called ‘embodied cognition’ or ‘embodied communication’ accounts of the evolved human capacity for language. However, many people who contemplate the gestural dimension of human communication, both within academic research circles and outside of them, generally also endorse the tenet of ‘folk wisdom’ that members of some cultures gesture substantially more or less than members of other cultures. There are implications in this, if true, for theories of the role of gesture in language, generally, and for the cross-linguistic validity of such theories.

In this presentation we will consider the notion of a “high gesture culture” (e.g., Pika, Nicoladis & Marentette 2006), drawing on observations from studies of native members of several different language/cultural groups. The focal study is a descriptive comparison of five groups: Egyptian Arabic, Neapolitan Italian, mainland Mandarin Chinese, northern German, and American English speakers. We compare the gesturing of these groups in two different natural discourse contexts: group free conversation and dyadic storytelling. The elicitations were videotaped, enabling cross-language/culture comparisons of gesturing by discourse genre and within-individual comparisons across the two genres.

In keeping with claims that gesturing is a language universal, the main observation we can report is a broad cross-group similarity in tendency to gesture when speaking. The five groups in the focal study differed only somewhat in rate of gesture (from highest rate to lowest): Americans, Neapolitans, Egyptians, Germans, Chinese. A more noticeable cross-group difference was in the proportions of spontaneously generated, ‘representational’ gestures relative to recurring, conventionalized gesture forms. Notable in the Chinese and, to some extent, the Germans, were high counts of ‘self-adapters’ or self touching. A kind of gesture self-suppression seemed likely to be partly responsible for this characteristic. Within each of the five language/cultural groups, individuals varied considerably in gesture style and in tendency to gesture at all.

We consider the implications of these observations for a cross-linguistically valid theory of the role of gesture in language and interaction. The observations make clear the multi-functional nature of coverbal gesturing and the need to differentiate multiple factors that prompt or suppress the occurrence of gestures and influence their forms, all of which may be subject to cultural norms for interaction.

 

 

Multimodal perception of speech segments and speech prosody in
relationship with their production

Marion Dohen

GISPA-Lab, France,

 marion.dohen@gipsa-lab.inpg.fr;   

ABSTRACT:

The aim of this talk will be to investigate multimodal perception of speech linked to its production. Both segmental and suprasegmental  aspects of speech will be considered. We shall first recall how speech  sounds and images are fused in speech perception, starting from speech in noise and the McGurk effect. This will show that speech communication  is intrinsically multimodal. We shall discuss possible cognitive architectures for audiovisual fusion. These will be related to recent  neurocognitive data on perceptuo-motor links in the human brain, from mirror neurons to the cortical dorsal route of speech perception. Then we shall present a number of recent data we have obtained on audiovisual prosody, dealing with the audiovisual perception and production of contrastive focus.

 

Coordination of mouth and hand actions in speech communication
Marion Dohen

GISPA-Lab, France,

marion.dohen@gipsa-lab.inpg.fr;

ABSTRACT:

Starting from an evolutionary perspective on vocal vs. gestural origins  of language, we shall propose a view in which the coordination between mouth and hand is given a key role in speech communication. We shall present a framework called “Vocalize to Localize” in which deixis is considered as a possible bootstrap for this coordination in both evolution and development. Then we shall present a number of data, in both infants and adults, describing the hand-mouth coordination in deixis in a quantitative way. Finally, we shall describe a framework for studying the mouth-hand coordination to a larger extent, proposing experimental paradigms concerned with the production and perception of prosodic focus.

 

Affect in Multimodal Information

Anna Esposito

Dipartimento di Psicologia, Seconda Università di Napoli, and  IIASS Italy

iiass.annaesp@tin.it; anna.esposito@unina2.it;

ABSTRACT:

In the face to face communication, the emotional state of the speaker is transmitted to the listener through a synesthetic process that involves both the verbal and the nonverbal modalities of communication. From this point of view, the transmission of the information content is redundant, since the same information is transferred through several channels. How much information about the speaker’s emotional state is transmitted by each channel and which channel plays the major role in transferring such information? This work tries to answer the above questions through a perceptual experiment that evaluates the subjective perception of emotional states in the single (either visual or auditory channel) and the combined channels (visual and auditory). Results seem to show that, taken separately, the semantic content of the message and the visual content of the message bring the same of information amount of the combined channels, suggesting that each channel performs a robust encoding of the emotional features that results very helpful in recovering the perception of the emotional state when one of the channel is degraded by noise. 

 

Data fusion at different level

Marcos Faundez-Zanuy

Escola Universitària Politècnica de Mataró (Adscrita a la UPC)
08303 MATARO (BARCELONA), Spain

faundez@eupmt.es;

ABSTRACT:

Data fusion is a milestone to improve the recognition accuracies of any pattern recognition system.  Main possibilities can be summarized in four different levels: a) sensor level: different signals acquired by different sensors or by a single sensor performing consecutive acquisitions b) feature extractor: combination of different feature vectors obtained with different procedures can produce relevant measures that combined outperform the behavior of each one alone c) score level: when dealing with different classifier each one can produce an opinion or score. These scores, after normalization, can be combined in order to achieve a more accurate combined opinion d)decision level: after the decision of each classifier, the labels assigned by each classifier can also be combined in a similar way to decision level. For instance, by majority voting.

 

Fundamental Concepts of Voice Analysis

Eric Keller, IMM, University of Lausanne, Switzerland

Eric.Keller@unil.ch;

ABSTRACT:

The structure of gestural performance in extended discourse: Some illustrative examples
Adam Kendon,

University of Napoli “L’Orientale”

adamk@dca.net

 

ABSTRACT:

In 1972 I demonstrated how, in extended discourse, there is an hierarchical organization to the kinesics of the speaker that matched the hierarchal organization of his concurrent speech flow (Kendon 1972). The speech flow was described as a nested hierarchy of tone units and tone-unit groupings and it was shown that co-occurring phrases of bodily movement (in hands, arms, trunk, head) could likewise be shown to be organized in an analogous and matching manner. This paper was the first detailed demonstration of the intimate coordination of speech and gesture, the theoretical importance of which has since been emphasized in the work of David McNeill, among others.  The demonstration of 1972 has never been repeated.  Here I will attempt a new demonstration of this organization with two pieces of extended discourse (taken from academic lectures).  Problems of defining units in the kinesic flow and how they are interrelated will be discussed. [See Adam Kendon "Some relationships between body motion and speech: An analysis of an example"  In A. Seigman and B. Pope, eds., Studies in Dyadic Communication. New York: Pergamon Press, 1972, pp. 177-210]. 

 

Perspectives for Articulatory Speech Synthesis within the Framework of Audio-Visual Human-Machine Interaction Systems

Bernd J. Kröger

University Hospital Aachen, Germany

bkroeger@ukaachen.de;

ABSTRACT:

Articulatory speech synthesis comprises (i) a module for the generation of respiratory, phonatory, and articulatory control information on the basis of linguistic information (control module), (ii) a module for the generation of a continuous temporal sequence of (three-dimensional) vocal tract geometries (articulatory part of the vocal tract model), and (iii) a module for the generation of the acoustic speech signal (acoustic part of the vocal tract model). Since the visualization of vocal tract components (lips, tongue etc.) is usually a part of articulatory speech synthesis, an audio-visual or multimodal speech synthesis system generating a facial (and optionally a whole body) animation together with the acoustic speech signal could be developed starting from articulatory speech synthesis by extending such a system by facial and body movement control information and by integrating the vocal tract model in a whole body model (avatar) for creating a naturally speaking and naturally articulating humanoid.  Moreover since articulatory speech synthesis is capable of producing speech in a more flexible way than corpus-based speech synthesis – i.e. the variation of glottal and vocal tract features can be done in a straightforward way, leading to different voice qualities and different articulatory settings – articulatory speech synthesis is useful for synthesizing different speakers (e.g. male vs. female, child vs. adult), different voice types, different speaking styles, and also different emotional states. 

            This lecture reviews current computer implemented articulatory speech synthesis systems. In addition a major claim of this lecture is to argue for an integration of cognitive strategies into the control module of articulatory speech synthesis systems. Since brain functions become more and more transparent by imaging techniques during the last decades, it is now possible to establish neurocomputational models for action or gesture execution, and thus it is also possible to develop neurocomputational control models for speech production. These models generally include self-learning strategies – normally occurring during speech acquisition – and they include feedforward and feedback control concepts on the sensorimotor levels and on the cognitive levels of speech production.  At least the importance of feedback control of speech production requires the implementation of a perception module as a part or in parallel to the production module (perception-production link). That in principle enables us to build up a complex audio-visual human-machine interaction system in a straight forward way. Focussing on speech, this complete system would be able to do speech recognition and speech synthesis, would be capable of profiting from the audio-visual input coming from a speaker in order to increase recognition rates and to increase or adapt the quality of the acoustic output, and would be possibly capable of detecting linguistic or phonetic-articulatory malfunctions of a speaker and thus being able to help speakers to become aware of their mistakes and to increase their speaking capabilities. Thus the approach introduced in this paper opens the door for the development of complex audio-visual human-machine interaction systems for all aspects of dialogue like (i) automatic speech recognition and speech synthesis applications, (ii) language teaching applications, and (iii) applications in speech therapy.

Behavior parametrization for embodied conversational agents

Catherine Pelachaud,

Universite de Paris 8, France,

c.pelachaud@iut.univ-paris8.fr;

ABSTRACT:
During this presentation we will present our work toward the creation of  an Embodied Conversational Agent (ECA). ECAs are human-like entities capable of communicating with other ECAs and/or users. They exhibit  synchronized verbal and nonverbal behaviors (facial expression, gesture, body movement and gaze). Behaviors are defined not only by the signals  that composed them but also by how they are displayed. We will present an expressivity model where six parameters have been designed to change the gesture and face expressivity. Lately we have extended this model to implement ECA exhibiting distinctive behaviors. In this talk we will describe the architecture we have developed for such agents as well as the representation language we used to describe behaviors.

 

Honest Signals

Alex Pentland

Room E15-387, 20 Ames St., Cambridge MA 01239, MIT, USA

pentland@media.mit.edu;

ABSTRACT:

Many types of human behavior can be reliably predicted from biologically  based “honest signaling” behaviors. These ancient primate signaling  mechanisms, such as the amount of synchrony, mimicry, activity, and  emphasis, form  a separate communication channel that provides an  effective window into our intentions, goals, and values. By examining  this ancient channel of communication—paying no attention to words or even who the people are—we can accurately predict outcomes of dating  situations, job interviews, and even salary negotiations

 

Gesture and gaze in persuasive discourse

Isabella Poggi

Dipartimento di Scienze dell’Educazione, Università Roma Tre

poggi@uniroma3.it;

ABSTRACT:

Rhetoric and body communication

The importance of  body behaviour in persuasive discourse has been acknowledged back since the ancient Roman treatises of Rhetoric, like Cicero’s “De Oratore” and Quintilian’s “Institutiones Oratoriae”. In the rhetorical tradition, gestures and head movements have been studied as an indispensable part of “Actio” (discourse delivery), since they were credited with the capacity of conveying various communicative functions. By gestures and other body movements we can summon, promise, exhort, incite, approve, express apology or supplication, display emotions  like regret, anger, indignation, adoration, depict or point at objects. For example, Quintilian in his work provides detailed hints, mainly with a normative intent, about which movements may be more or less effective in portraying a particular image of the orator, which ones make him more similar to a comic actor, which can excite the audience or so.

In recent literature, studies overview various aspects of the body's relevance in political communication (Atkinson 1984), like the use of pauses and intonation to quell the applause (Bull 1986), or  facial expression and other bodily behaviours (Frey 2000; Bucy and Bradley 2004). Two recent studies directly concerned with the impact of gestural communication on political discourse are Calbris (2003), Kendon (2004)  and Streeck (2007). Calbris analyses the gestures of Lionel Jospin as  a route to understand the intimate expression of his political thought: the metaphors exploited by his manual behaviour – whether he uses the left or right hand, and with which shape – can  express abstract notions like effort, objective, decision, balance, priority, private or public stance; but they also fulfil discourse functions: they  can delimit or stress, enumerate or explicate the topics of discourse.

Also among the gestures analysed by Kendon (2004)  some are used with a persuasive intent: for example, the "ring" gestures, that bear a meaning of 'making precise' or 'clarifying', and are used every time this clarification is important "in gaining the agreement, the conviction or the understanding of the interlocutor" (p. 241).

Streeck’s (2007) analyses the gestural behaviour of the Democratic candidates during the 2004 political campaign in USA, and even, attributing the defeat of Howard Dean to the frequency of his "finger wag", a "hierarchical act" that might have given an impression of a presumptuous and contemptuous attitude, shows how much importance may be credited to bodily behaviour in the persuasive import of political discourse. Moreover, he also shows how the tempo of body movements and their relation to speech rhythm provide information about discourse structures, distinguishing background from foreground information.

If gesture is so important in conveying information that is effective in persuasion, also facial behaviour could be relevant in this connection. For example, in Italian political talk shows, as a politician is talking often the cameras record the facial expressions of their opponents, which are sometimes very communicative and may have a counter-persuasive role. Yet, not so much literature has been devoted to the persuasive impact of facial expression and gaze in political discourse.

In this work I deal with the role of multimodal communication in the persuasive goals of political discourse.

 

Persuasion as a hierarchy of goals

I will first present a theoretical model of persuasion in terms of a goal and belief view of mind, social interaction and communication. Persuasion is an act aimed at social influence, with social influence defined as the fact that an Agent A causes an increase or decrease in the likeliness for another Agent B to pursue some goal GA. In order to have B more likely pursue a goal GA, A must raise the value that GA may have for B, and does so through having B believe that pursuing GA is a means for B to achieve some other goal GB that B already has, and considers valuable. This definition encompasses different kinds of of social influence, ranging from education to threat, promise, the use of strength and so on. Persuasion is but one specific type of influence: one characterized by the fact that it is pursued through communication, and that it leaves B free of either pursuing the goal GA proposed by A or not. Thus, to persuade B to have GA as a goal of his, A must convince B, that is, induce B to believe with a high degree of certainty, that  GA is worth pursuing since it is a sub-goal to some goal GB that B has. In order to do so, A can make use of three different strategies, as already stated by Aristotle: logos (the logical arguments that support the link between GA and GB), ethos (the credibility and reliability of the Persuader A) and pathos (the extent to which A, while mentioning goal GA pursuit, can evoke the possibility for B to feel pleasant emotions or to prevent from feeling unpleasant emotions).

In order to persuade other people we produce communicative acts that can exploit different modalities – we can use words, intonation, gestures, gaze, facial expression, posture, body movements: we thus make multimodal persuasive discourses, that is, complex communicative plans for achieving communicative goals. Each discourse can be analysed as a hierarchy of goals in which each single communicative act – even a gesture or a gaze – can bear its specific contribution to convey a global persuasive meaning; and each bears on a logos, ethos or pathos strategy.

 

Gesture and gaze in political discourse

After presenting a model for the analysis of persuasive discourse in general, I will focus on the specific relevance of gesture and gaze for the persuasive strength of multimodal discourse.

Some fragments of political persuasive discourse will be presented, drawn from the electoral debates of Romano Prodi and Ségolène Royal before elections in Italy and France, and an analysis of their gesture and gaze will be proposed. An annotation scheme will be illustrated for the transcription and classification of gesture and gaze in political discourse, and a research study will be presented taking into account the notions of expressivity and of persuasiveness of gesture and gaze, and trying to classify and measure the persuasive strength of different speakers.  Finally, the implications of this research will be discussed in order to the construction of Persuasive

Embodied Agents. 

References

[1] Poggi I.(2005): “The goals of persuasion”. Pragmatics and Cognition 13: 2,  pp.297-336.

[2] Poggi I.(2007): Mind, hands, face and body. A goal and belief view of multimodal communication. Berlin, Weidler 2007.

[3] Poggi I., & Roberto E. (2007): “The eyes and the eyelids. A compositional view about the meanings of Gaze”. In E.Ahlsén, P.J.Henrichsen, R.Hirsch, J.Nivre, A.Abelin, S.Stroemqvist, & S.Nicholson (Eds.), Communicaion – Action – Meaning. A festschrift to Jens Allwood. Department of Linguistics, Goteborg University, Goteborg, pp. 333-350.

[4] Poggi I. & Pelachaud C. (2008): Persuasive gestures and the expressivity of ECAs. In I.Wachsmuth, M.Lenzen, G.Knoblich (eds.): Embodied Communication in Humans and Machines. Oxford, Oxford University Press.

 

Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios

Michael Pucher

Telecommunications Research Center Vienna (FTW) Tech Gate Vienna

Donau-City-Strasse 1, 3rd floor A-1220 Vienna Austria

pucher@FTW.at;

ABSTRACT:

This paper presents results on the selection of application scenarios and persona design for sociolect and dialect speech synthesis.  These results are derived from a listening experiment and a user study.  Most speech synthesis applications focus on major languages that are  spoken by many people. We think that the localization of speech  synthesis applications by using sociolects and dialects can be beneficial for the user since these language variants entail specific  personas and background knowledge.

 

Content in Embedded Sentences: An Overview

Uli Sauerland and Mathias Schenner

Centre for General Linguistic, Berlin, Germany

uli@alum.mit.edu, m.schenner@gmail.com;

ABSTRACT:

Embedding a sentence can separate its meaning into different components:   When little Calvin complains "Mum thinks that the monsters under my bed are not real." his mother may not even agree with Calvin that there are any monsters under Calvin's bed.  Non-linguistic content presents similar ambiguities:  If Calvin tells us "Mum said `Stop messing around [yawn] and go to bed'" he may use the stern voice of his mother's and simultaneously sound tired and yawn, indicating his own tiredness.  What are the factors that decide whether embedded content is interpreted relative to the speaker or the subject of the embedding verb?  In our talk, we give an overview of our present state of knowledge.

 

  

 

Realtime Multimodal Turntaking

Kristinn R. Thórisson
Center for Analysis and Design of Intelligent Agents (CADIA)
School of Computer Science, Reykjavík University, Iceland

 thorisson@gmail.com

ABSTRACT:
Getting computers to respond in a human-like manner in realtime  dialogue presents numerous challenges. One of the larger -- but often  overlooked -- issues is the highly dynamic nature of such   interaction. This includes how conversants decide, in fact negotiate,   how to "divide the work" of talking, using various kinds of   multimodal behaviors for orchestrating it. Over the last decade   research into multimodal communication has brought some progress to   this issue but significant questions remain with regard to the   perception-action loop, in particular, what kinds of perception,   planning and prediction is involved. I will present some results from  an effort to build holistic artificial dialogue systems with reaction  times and capabilities similar to those of human-human dialogue. For  a deeper understanding of dialogue, and indeed general cognitive  issues related to communication, I argue that we need to approach the  issue as the *design of a complex system* (a system whose behavior  cannot be simply inferred from the behavior of its components).  Looking at the problem holistically we need to consider what kinds of  architectures are likely to produce the kinds of behaviors exhibited  by people in natural communication. I will describe the latest  results from CADIA in architectural modeling of natural dynamic  turntaking and dialogue, and present data from models using classical   as well as hybrid neural information approaches.

Embodied cognition entails embodied communication

Wolfgang Tschacher

University Hospital of Psychiatry, University of Bern

Laupenstrasse 49, CH-3010 Bern, Switzerland

tschacher@spk.unibe.ch;

ABSTRACT:

Synchronization phenomena are abundant in complex systems, be they physical, mental or social. Synchrony can be viewed as a universal concept in nonlinear systems science. Such phenomena of pattern formation were found on a variety of levels ranging from the cell level (e.g. neuronal synchrony in the human brain) to mass phenomena (e.g. la ola in a soccer stadium). In this context, the concept of "embodiment" in the context of mental and social systems means that the body is important as a crucial parameter of all cognition and social action. We strongly doubt that cognition or social interaction are well understood as information processing or information transfer alone. The embodied communication approach focuses on this critique of the information metaphor. For instance, cognitive and nonverbal synchrony in dyadic human systems highlights the influence of embodiment on social exchange. Systematic assessments of synchrony were performed especially in psychotherapy research of recent years. In various large samples, therapy courses were investigated longitudinally. Significant increases of synchrony in therapy systems were found with various methods used, on the basis of movement and of self-monitorings of cognition and emotion. This synchrony effect was shown not to be attributable to response stereotypy of session report assessments or other trivial explanations. Synchrony was predominantly linked with interactional variables and with measures of positive therapy outcome.

 

Representing Function and Behavior in Multimodal Communication

Hannes Högni Vilhjálmsson
Center for Analysis and Design of Intelligent Agents (CADIA)
School of Computer Science, Reykjavík University, Iceland

hannes@ru.is, http://www.ru.is/faculty/hannes

ABSTRACT:

         I will talk about how communicative behavior can be represented at two levels of abstraction, namely the higher level of communicative intent or function, which does not make any claims about the surface form of the behavior, and the lower level of  physical behavior description, which in essence instantiates intent as a particular multimodal realization.  I will explain the usefulness of this distinction by using examples of several implemented systems that each draws different strengths from it: (1) As a way to help human animators create character animations with BEAT; (2) As a way to expand a narrow communication channel into full multimodal communication with Spark; (3) As a way to deal with culture specific behaviors in complex game and training environments.  Finally I will review how this fits into the proposed SAIBA framework for multimodal generation of communicative behavior, which is an international research platform that fosters the exchange of components between different systems currently in development.  The SAIBA framework now contains a first draft of the Behavior Markup Language (BML) and is starting work on the Function Markup Language (FML)

 

Facial Expression Recognition Implemented with Model-based Image Interpretation
Matthias Wimmer

Institute for Informatics 9 Technische Universitaet Muenchen Boltzmannstrasse 3

85748 Garching bei Muenchen, GERMANY,

 wimmerm@cs.tum.edu, www9.cs.tum.edu/people/wimmerm

ABSTRACT:

Computers are excellent devices for quickly solving mathematical problems and for memorizing an enormous extent of information. Nevertheless, the interaction between humans and computers still lacks intuition, because it is restricted to traditional input and output devices.
This thesis focuses on augmenting traditional systems with aspects of interpersonal communication in order to resolve these shortcomings.
It describes methods that robustly localize facial features, seamlessly track them through image sequences, and interpret facial expressions.

Model-based techniques have great potential to fulfill current and future requests on interpreting images. Unfortunately, remaining challenges, to fundamental aspects, such as the initial model parameterization, still present major obstacles to making these systems usable in real-world scenarios.
%The objective function is usually determined heuristically, which is a time-consuming step that requires much domain-dependent knowledge.

Our contributions are twofold:
First, we shows that fitting algorithms for face models benefit from well-defined color features that are able to distinguish between the different regions  of a face, such as the skin, the lips, and the eyebrows. Since these parts only vary slightly in color, the decision criterion must be well chosen. The proposed approach adapts to the person and to the context first, and then quickly and robustly identifies the facial component of the current pixel via general purpose classifiers. This procedure maintains real-time performance and obtains high accuracy, which makes it appropriate for a variety of applications such as face model fitting, gaze estimation, and facial expression recognition.

Second, we focus on fitting models to images by considering the objective function as the most important component involved. These functions are usually determined heuristically in a time-consuming and error-prone procedure that requires much domain-dependent knowledge.
We investigate and explicitly formulate inevitable properties of ideal objective functions. Furthermore, we propose a methodology for learning objective functions from annotated example images while considering these properties.
Therefore, the learned functions are approximately ideal as well. The benefits of this approach are that the crucial decision steps during function design are automated and the remaining manual steps require little or no computer vision expertise. This procedure lays the foundation for a general application of model-based image interpretation to real-world scenarios and it has therefore potential for commercialization

.

Exploring Sign Language
Bencie Woll
Deafness, Cognition and Language Research Centre
University College London,

b.woll@ucl.ac.ukwww.dcal.ucl.ac.uk

ABSTRACT:

DCAL's mission is to use research on sign language as a model for exploring language and cognition. This presentation will describe a number of strands of current DCAL research, all focussing around the topic of the relationship of modality to the structure of human language. We are concerned with the extent
to which the perceptual and articulatory systems in which human language is instantiated influence the form of language. Four topics will be covered:

  1. Sign language grammar - how does sign language exploit its visual-spatial modality. Topics to be covered include multiple articulators and multiple channels; the use of spatial referencing for grammatical structures; the role of iconicity. This topic will be the basis for the other three sections;
  2.  Neuroscience of sign language - current research using functional imaging studies to compare the neural processing of sign language and spoken language will be reviewed, with an emphasis on similarities and differences in localization;
  3. Atypical sign language - this section will be concerned with studies of signers with developmental or acquired atypicalities in sign language. Case studies will be presented of the effects of stroke on sign language, and of individuals with such developmental disorders as Down syndrome and William syndrome;
  4. Computer processing of sign language - attempts to model sign language will be reviewed with two emphases: the creating models of human perception to structure computer recognition; and the creation of signing avatars;


 
ABSTRACTS OF THE SCHOOL CONTRIBUTIONS
 
Multi-modal speech processing methods: An overview and possible future directions using a MATLAB based Audio-visual Toolbox
A. Abel and A. Hussain
Centre for Cognitive & Computational Neuroscience
Department of Computing Science
University of Stirling, Stirling FK9 4LA
Email: aka@cs.stir.ac.uk , ahu@cs.stir.ac.uk
Abstract
With increasing evidence of a link between the various human communication production domains, such as speech, gestures, and facial features, interest in multimodal speech processing has grown significantly in the last decade. Many different specialised processing methods have been developed by active researchers to analyse and utilise the complex relationship between multimodal data streams. This paper will present an overview of the various multi-modal processing methods reported to date. In particular, a new MATLAB based Toolbox developed by Barbosa et al (2007) for processing audio-visual data will be reviewed and its performance potential evaluated. The operation of the tool will be described, covering the prerequisites, installation, functionality, and potential benefits offered.  It is shown that the tool does not represent a complete and comprehensive speech processing solution, but rather serves as a standardised, yet versatile base to build upon with further research. To demonstrate this versatility, some preliminary examples that make use of these computational procedures with various audiovisual corpora are demonstrated.  Finally, some future research directions in the area of multi-modal speech processing are outlined, including future research that the authors aim to carry out with the aid of the newly developed audio-visual MATLAB toolbox, including toolbox customisation, handling emotional speech, and processing noisy speech in real world environments.

 

Underdetermined blind source separation by linear separation system.

Cermak J., Smekal Z.

Academy of Sciences of the Czech Republic,Czech Republic

cermak4@kn.vutbr.cz

Abstract:

In automatic speech recognition and speech modal analysis, good quality of input speech signal is often required. The hit rate of recognizers is lowered by degradation of speech quality due to the noise. Speech separation aims to enhance the speech signal as a part of preprocessing techniques. However single channel speech enhancement often change the important characteristics of the speech signal and is useful only for suppressing stationary or partially stationary noise. On the other hand multi channel techniques can handle even non-stationary noise by employing spatial filtering. This paper deals with multi channel signal separation. We present a linear blind source separation method that can by applied even if the number of source signals is higher than the number of sensors. Furthermore our method achieves high interference suppression while keeping distortion introduced by the separation system to the separated signal low.

 

A Pseudo-phonetic Approach for Speech Characterization of Autistic Children

Chetouani Mohamed, Ringeval Fabien

Universite Pierre et Marie Curie,France

fabien.ringeval@isir.fr, chetouani@upmc.fr

Abstract:

The ability to perceive and express emotions, through the expressivity of the face and the voice, is developed during the early stages of the children’s life, and has an essential role in the development of the intersubjectivity. The access to intersubjectivity, communication and language is seriously failed in the autistic syndrome. The purpose of our work is to extract prosodic characteristics of autistic children in order to characterize their speech disorders. We collected an audiovisual corpus of autistic children with the help of hospitals. These records take place during meetings dedicated to the evaluation of the psycho-educative revised profile (PEP-R) and are supervised by a psychologist. The corpus is then processed by automatic speech analysis methods. Extracted features are qualified as supra-segmental features and attempt to model the prosody by a characterization of its main components such as pitch, energy, and duration. While these features are usually computed during the voiced segments, we propose to compute them from pseudo-phonetic segments such as vowels and consonants. We firstly carried out a pseudo-phonetic segmentation by the use of stationary segments. Then, a vowel detector makes it possible to identify prominent segments. Since the vowels properties are known to be highly correlated with prosody. In this paper, we propose an original modelling of these features resulting on a statistical model of speech disorders for autistic children. 

Temporal integration in computer face and movement recognition

Costen Nick

Manchester Metropolitan University, Manchester, UK,

n.costen@mmu.ac.uk

Abstract:

We describe an algorithm for automatically finding correspondences from face video sequences. Given a sequence of images, the face feature points are tracked by a model-constraint optic flow algorithm. By employing a Minimum Description Length (MDL) framework, the drift-off error caused by the optic flow algorithm can be reduced and the correspondences can be matched robustly by optimizing the statistical model. As a result, the face is able to be tracked precisely. This allows automatic construction of appearance models. A factorization Structure from Motion (SfM) framework can then use a combination of individual and generic shape models help the SfM algorithm to reconstruct the 3D face structure, allowing the 3D face shapes to be recovered optimally. Experimental results show that this algorithm accurately reconstructs the 3D shape of familiar and non-familiar faces from video sequences under circumstances of imperfect face tracking or noisy observations. Tracking human pose using observations from less than three cameras is a challenging task due to ambiguity in the available image evidence. We present a method for tracking using a pre-trained spatio-temporal model of activity to guide sampling within an Annealed Particle Filtering framework. The approach is an example of model-based analysis by synthesis and is capable of robust tracking from less than 3 cameras with reduced numbers of samples. The scheme is tested on a common dataset containing ground truth motion capture data and shows stable, low error scores for both monocular and 2-camera sequences.

 

Multimodal Signal Processing for Dance Performance Analysis

Yasemin Demir

Kov Universitesi Rumeli Feneri Yolu Istanbul, Turkey
 E-mail: ydemir@ku.edu.tr

Abstract:

Performing arts, including the music, dance and theater, is a rich source of multimodal content. We are proposing a framework to model multimodal content of the performing arts. The framework includes analysis, annotation, personalization and synthesis of the multimodal performing art content. Our main objective is to build a relational model between music and the theatrical events, such as dance figures, in order to create music-driven personalized theatrical animations.

Exploiting Protensity of Musical Messages

Alexandru V. GEORGESCU1 and Alina E. LASCU2

1“Politehnica” Univ. of Timişoara, Faculty of Automation and Computers

2 “Lucian Blaga” University of Sibiu,  Faculty of Political Sciences, Sibiu, ROMANIA,

alexandrugeorgescu@gmail.com;  alina.lascu@gmail.com

Abstract:

 The first part explains the title, setting the paper into the roadmap for the PhD thesis "Agent-Oriented e-Semiosis For Protensity-Based Messages. Applications in Musicology" as well as into the trans­disciplinary framework of three other medium-range agent-oriented under­takings of the research team, both STSM stem from: A) Computer-Aided Semiosis (target: the semantic web; now, confined to trans-cultural multimodal interfaces); B) Emergence in agent-based systems (focusing now on self-organization via stigmergic control); C) Self-aware­ness of bo­di­less agents (target: the "Self-*"-memeplex; now, con­fined to Gödelian self-reference and agent self-cloning). The focus is on the common denominator of these converging areas: a) powerful temporal dimension of the agents involved (primeval "thick time" included); b) non-algorithmic right-brain tactics (mainly for interaction in open, dynamic and uncertain environments); c) e-Learning as test-bench (current subdomain: protension in teaching/learning music).

 

The second part presents the experimental model of a gen­er­ic protensional agent in two development phases: virtual disc jockey (implementation illustrated in an accepted paper for ICCCC 2008); virtual guitar teacher (current stage based on updating a paper submitted to SIWN 2008). The focus is on open questions related to the topics of the Vietri School. Both entities of the development phases have a common characteristic that places them in the Protensity-related domain: they have to deal with messages coming form audio signals which are protensive by nature and, unlike static image-based messages, they need to be analysed while unwinding in time in order to understand and interpret their meaning.

 

Eye detection from facial images

González Domínguez Verónica,

Escuela Universitaria Politécnica de Mataró, Spain, gondomve@eupmt.upc.edu

Face Localization in 2D Frontal Face Images Through Luminosity Profiles Analisys Marco Grassi1, Marcos Faundez2

1DEIT Università Politecnica delle Marche, Ancona, Italy (margra75@hotmail.com)

2 DEscuela Universitaria Politecnica de Matarò, Barcelona, Spain (faundez@eupmt.es)

 

Abstract:

A first step of any face processing system is detecting the position in images where faces are located. Face detection from a single image represents a challenging task because of variability in scale, location, orientation and pose. Facial expression, occlusion, and lighting conditions also change the overall appearance of faces. In the last years a great research effort has been done in this field and many different methods have been proposed. The choice of the most appropriated one deals with the porpoise of the application. In a real time scenario, processing burden becomes, in particular, an inescapable constraint.

In this work we propose a fast face localization method in 2D frontal face images, through eyes detection, based on the analysis of the horizontal and vertical profiles of image’s average luminosity and the definition of rules describing the relations between these profiles and the positions of characteristic face elements. Experimental results over the AR face database show high rates of successful detection together with reduced computational times that make this method particularly suitable for real time applications.

 

Statistical Modeling of Interpersonal Distance with Range Imaging Data

Renè Hempela*, and Patrick Westfeldb

aFaculty of Education, bInstitute of Photogrammetry and Remote Sensing,

Technische Universität Dresden, D-01062 Dresden, Germany

renehempel@hotmail.de, patrick.westfeld@tu-dresden.de

 

Key words: Interpersonal Distance, Range Imaging Data, Image Sequence,Analysis, Body Models, Time Series Models

Abstract:

Interpersonal distance is one dimension of involvement in social interaction. Many scientific papers in psychological and educational research (e.g. [1]) focus on this variable in addition to body orientation and mutual gaze. Quantitative data on interpersonal distance is usually obtained through questionnaires and/or ratings based on video recordings. In these cases, the evaluation is based on the interpretation of an operator. Besides the significant time effort of interactively processing long video-graphy sequences, unfavorable effects such as un-objectivity as well as a spatial and temporal generalization of the recorded behavioral data are introduced by this procedure. Automatic image sequence processing methods can be used to improve both efficiency and objectivity of video-graphy data processing. These methods have the potential to increase the spatial and temporal resolution at a reduced observational inference.

In our research work, we use a novel 3-D camera (range imaging, RIM; Figure 1a) which simultaneously provides intensity and range images (Figure 1b) at up to 50 Hz. We adapt existing 2-D image analysis methods to range image sequence processing, and we develop new approaches for RIM sequence analysis  which allow 3-D tracking of body points over time (Figure 1c) [2] [3].

 

The results of RIM sequence processing allow the construction of an abstract body model consisting of a set of surface vectors for each object by gathering n specific points on the surface of each object, which are usually vectors in R3. This model can be formulated as an element of Mat3,n(R), the space of all (n×3) matrices. Since Mat3,n(R) is isomorphic to R3×n we are free to use all well known metrics d to capture the interpersonal distances for the given persons i, j Î I  at time tÎT. Using the realizations of the time varying metrics dt, statistical methods for time series can be applied to model and predict dt,.

References

[1] Alisch, L.M. (1998). Children’s friendships - A Random Dynamical System Approach. Paper presented at the AREA annual meeting, San Diego, April 13-17, 1998.

[2] Westfeld, P. (2007). Development of Approaches for 3-D Human Motion Behaviour Analysis Based on Range Imaging Data. Optical 3-D Measurement Techniques, Vol. VIII, II, pp 393-402.

[3] Westfeld, P. & Hempel, R. (2008). Range Image Sequence Analysis by 2.5- D Least Squares Tracking with Variance Component Estimation and Robust Variance Covariance Matrix Estimation. Paper accepted for oral presentation at the XXIth Congress of the ISPRS in Beijing, 3-11 July 2008.

 

Presence Agents: Physical and Virtual Agents with Interaction Behavior that Supports Being There

Heylen Dirk

University of Twente, The Netherland,

heylen@cs.utwente.nl

 

From Sounds and Gestures to Dialogue Acts. Selected Problems of a Multimodal Corpus Labelling

Maciej Karpiński

 (a) Institute of Linguistics (b) Center for Speech and Language Processing

Adam Mickiewicz University. Poznań, Poland

kaprinski@amu.edu.pl

 

Abstract:

The aim of this presentation is to discuss selected theoretical and practical problems which occur in the labelling and analysis of a multimodal corpus of task-oriented dialogues. While the focus is on the tiers of prosody and gestures, some attention is also paid to the level of intentions. The problems addressed here are related mostly to the segmentation of speech and movement, the categorization of the obtained units and to the analysis of their internal structure and possible relations.

Both analogies and differences can be found between various modalities of communication. The interactions and relation between these modalities are noticeable and complex. Most of the phenomena under study are of a continuous and dynamic nature. Each of the modalities may be analysed and labelled on a number of conceptual levels, starting from the "physical" description of measurable parameters and ending with an abstract representation on the level of intentions. The level of intentions may be partially expressed in terms of "dialogue acts". It often provides a relatively coherent image of dialogue structure, integrating relevant information from various communication modalities.

 

Study on speech parameterization for emotion recognition

Theodoros Kostoulas, Todor Ganchev, Nikos Fakotakis

Wire Communications Laboratory, Department of Electrical and Computer Engineering,

University of Patras, 26500 Rion-Patras, Greece

{tkost, tganchev, fakotaki}@wcl.ee.upatras.gr

Abstract:

The increasing use of commercial applications based on spoken interaction and the recent emergence of multimodal dialogue systems emphasizes the necessity of effective and user friendly human-machine interaction (HMI). One research direction that, undoubtedly, will contribute significantly for bridging the gap in the HMI is making the machines sensitive to the human aspects of behavior. For instance, awareness of the emotional state of the user can provide feedback to the dialogue flow manager and set the basis for successful interaction experiences.

In the present work we perform a study on prosodic and spectral features, towards emotion recognition from speech. Experimentations are performed on both natural/spontaneous speech [1] and acted speech [2]. Correlations between the emotion categories and the features are investigated.

References

[1] Kostoulas, T., Ganchev, T., Mporas, I., Fakotakis, N.: A real-world emotional speech corpus for modern Greek: In: Proc. of LREC 2008, Morocco, May 2008 (2008)

 

[2] University of Pennsylvania, Linguistic Data Consortium, “Emotional Prosody Speech,” www.ldc.uppen.edu/Catalog/CatalogEntry.jsp?cataloId=LDC2002S28

 

Multimodality Issues in Conversation Analysis of Greek TV Interviews

KOUTSOMBOGERA Maria, PAPAGEORGIOU Harris

Institute for Language and Speech Processing, Greece,

{mkouts, xaris}@ilsp.gr

Abstract:

In this paper we present a cross-disciplinary research on the communicative role of multimodal expressions in TV face-to-face interviews occurring in various settings that hold a mixture of characteristics oscillating between institutional discourse, semi-institutional discourse and casual conversation. Specifically, we examine the type of facial displays and gestures and their respective communicative functions in terms of feedback and turn management in an attempt to develop a deeper analytical understanding of the mechanisms underlying the multimodal aspects of human interaction in the context of media communication.

Taking into account previous work on the analysis of non-verbal interaction, we present our methodology (corpus description, tools, coding scheme, annotation process), we discuss the distribution of the features of interest and we investigate the effect of the situational and conversational setting of each interview on the interactional behavior of the participants.

Our motivation is to interpret the multimodal signs and describe their interrelations, as well as find evidence about their potential systematic role and examine possible patterns in the appearance and the cooccurrence of certain features. In this way, we make a first step towards the description and annotation of a multimodal corpus of Greek TV interviews, available for further development and exploitation.

 

Social Phenomena Related to Cooperation in Prisoner's Dilemma

Emilian Lalev

Central and Eastern European Center for Cognitive Science

New Bulgarian University, 1618 Sofia, Bulgaria

elalev@cogs.nbu.bg

Abstract:

The Prisoner’s Dilemma game is intriguing for it contains an analogue of the problem of cooperation in everyday life. Nevertheless having simple rules and structure, the game – especially in its iterated form - gives grounds for investigation of even more complex effects of social interactions. Such are trust, reputation, coordination, prediction and anticipation of the behavior of others. Modeling the processes of human decision making in the Iterated Prisoner’s Dilemma as the source of these phenomena is an approach to understand how human societies work.

 

From Extensity to Protensity in CAS: Adding Sounds to Icons

Alina E. Lascu1, Alexandru V. Georgescu2

1Lucian Blaga University of Sibiu, 2Politehnica University of Timişoara2, Romania

alina.lascu@gmail.com1, alexandrugeorgescu@gmail.com2,

Abstract:

Being aware of the gap between technological offers and user expectations, the paper aims to illustrate the necessity of anthropocentric designs (“user-pulled”) and to reveal the dangers of current ICT designs (“technology-pushed”). Since the gap is deepened because of insufficient innovative use of new agent-oriented technology potential, an affordable manner to “invent new Computer-Aided x” application domains is proposed. To substantiate the approach, the domain must be immediately useful, challenging, easy to implement and “as humanist as possible”: Computer-Aided Semiosis (CAS). On this background, the paper also presents a new and challenging concept in IT applied research, borrowed from psychology and meant to assist the process of semiosis: protensity. If the prior researches focused on images and implicitly on extensity (i.e. extensity versus image-based messages; protensity versus sound-based messages); the idea is to extend CAS through innovative attributes, in line with music-oriented user expectations. Thus, the paper refers a newcomer in agent-oriented technology: the Protensional Agents (i.e. interface agents represented on the screen as pseudo-avatars). The paper concludes that CAS is a pathfinder for other researches in this field and the concept of CAS could be used for immediate applied research. Moreover the Smart DJ is a good example also for the applicability of CAS in the field of protensity-based messages.

 

Key-Words: Computer-Aided Semiosis (CAS); Anthropocentric trans-cultural interface (ATCI); Protensity; Agent-oriented software engineering (AOSE); Virtual Disc Jockey (VDJ).

 

Home movies segmentation for face-to-face interaction analysis

Ammar Mahdhaoui, Mohamed Chetouani

Université Pierre et Marie Curie-Paris 6, CNRS FRE 2507

Institut des Systèmes Intelligents et de Robotique

3 rue Galilée, 94200 IVRY SUR SEINE

Ammar.Mahdhaoui@robot.jussieu.fr

Mohamed.Chetouani@upmc.fr

Abstract:

 During the last years, a new task emerged in speech processing: the rich transcription of an audio/video document. An important meta-data for rich transcription is the speaker information which tells us "Who spoke when?" for a given audio document. However, this information can be enhanced by the introduction of other information related to interaction between the speakers. In this work, we propose to study the interaction between infants and parents by the analysis of home movies. The main advantage of this data is that all the interactions are spontaneous. Our purpose is to study the strategy used by the parents for starting an interaction. The results presented in this paper will show that motherese (infant-directed speech) plays a major role. From a technical point of view, speaker diarization techniques are firstly developed and tested on the ESTER data (Evaluation of Rich Transcription of French Broadcast News). Secondly, we adapted and tested the models for home movies. Motherese segmentation is also carried out and fused to the speaker diarization model resulting on an improvement of the analysis of the interaction.

 

 

Situated Non-verbal expressivity synthesis

Malatesta Lori

National Technical University of Athens, Greece,

lori@image.ntua.gr

Abstract:

Appraisal theories in psychology investigate the emotion elicitation process though the cognitive appraisal of stimuli. They offer predictions regarding resulting facial expressions as well as action tendencies based on the way an event is appraised. We used these predictions to achieve non-verbal expressivity synthesis of virtual characters. With the aid of state of the art game engine technology we have created a virtual environment that can cater dynamic interactions with human users. The approach is “situated” since it deals with the issue of agent expressivity in specific contexts of non-verbal human to agent interactions. The underlying psychology theory used is a collapsed version of the OCC model proposed by Ortony in [1,2]. The simplification collapses the original 22 emotion types down to five distinct positive and five distinct negative reactions by taking under consideration the emotional states that make sense for a virtual character. The idea is to start simple in making the agent able to differentiate her expressions between positive and negative and then progressively develop more elaborate categories. An agent could have an identical positive expression in a situation where she is happy about obtaining a desired object or in a situation where she is happy because she feels proud when she has attained some goal. The expressivity would not change in such a coarse approach, only the context. This is a key concept in the collapsed OCC model. Context not only matters but it helps distinguish the way similar expressions are perceived in different contexts.

 

References

Ortony A. (2001), On making believable emotional agents believable, 189–211, Emotions In Humans And Artifacts, Cambridge, MA: MIT Press.

Ortony A., Collins A., and Clore G.L. (1988) The Cognitive Structure of Emotions, Cambridge University Press.

 

Multimodal human - machine interface modeling

Maskeliunas Rytis

Kaunas University of Technology, Lithuania, rytmask@kyu.lt

 

Abstract:

Paper deals with modeling of multimodal human - machine interface. Multimodal access to the information retrieval system (computer) is possible by combining three different approaches:  Data input / retrieval by voice (speech recognition / text to speech synthesis);  Traditional data input / retrieval systems (mouse, keyboard, computer display, etc.); Confirmation / rejection by recognizing and displaying human face gestures and emotions.

A prototype of multimodal access web application is presented by combining the traditional HTML, the Speech Application Language Tags (SALT) and the human face gesture videos technologies. For Lithuanian speech recognition the widely available standard Windows English recognizer has been used by utilizing English transcriptions of Lithuanian words.  The results of experiments on transcriptions in various scenarios are presented.

 

 

The Student’s Barriers to Communicate

Mihailova Katia

University of National and World Economy,Bulgaria,

katiajivkova@yahoo.com

Abstract:

It is well known that the students need a set of communications skills first to graduate their university programs and then to do successful professional carriers. It’s often occurs that they can not rise up the level of their communicative intelligence while they are at the university. The reasons can be of two kinds: \1\ the teaching methodology and the organization of the university and \2\ the students’ personal traits to communicate and develop communication skills.

 

Following the second reason, a qualitative empirical research has been started. The aim of it is to discover and classify the students’ traits to communicate in order to propose a teaching program and methodology which can help students to improve their communication skills. As the research is not finished yet, this contribution is intending to present the intermediate results of it.

 

The participants in the research are 150 students in their third year at the university. All of them are engaged in bachelor programs out of the field of social communication, but their future professional profile requires strong communication skills and even experience. The method of the research is a combination of projective situation and group discussion.

 

Having in mind the lecturers and the contents of the school, I do believe the discussions will add some new viewing points to the problem of students’ barriers to communicate as well as some recommendations for the conclusion of the research.

 

Comparison of grapheme and phoneme based acoustic modelling in LVCSR task in Slovak

Michal Mirilovič, Jozef Juhár, Anton Čižmár

Technical University of Košice, Slovakia

michal.mirilovic@tuke.sk

Abstract:

Phonemes and allophones serve as basic speech units for training of HMM based acoustic models in most of today’s speech recognizers. Grapheme-based acoustic sub-word units to multilingual and crosslingual acoustic modeling was applied in many tasks. Grapheme and phoneme based mono-, cross- and bilingual speech recognition of Czech and Slovak has been studied in our previous work. In this contribution we study acoustic modeling and model unit selection using comparison of grapheme and phoneme based approach in large vocabulary continuous speech recognition (LVCSR) task in Slovak. The major goal of this work is to study a selection of an optimal set of units used for acoustic modeling of words in LVCSR system. First comparison of phoneme and grapheme based acoustic models on LVCSR system was made in our present work and it gave us interesting results. Grapheme based acoustic models reached better results in tests with higher order n-gram (bigram and trigram) stochastic language models. These results were the reason for our next research in this area. Optimizing procedure for unit selection described in this paper is managed by pronunciation effects arising in Slovak language (assimilation, palatalization, ... ) and by their influence on word error rate and other errors of LVCSR system.

Glottal Characteristics of Emotion Based on Electroglottogram Measures

Murhy Peter

University of Limerik, Ireland,

peter.murphy@ul.ie

 

Using a signing avatar as a sign language research tool

Nelson Delroy

DCAL, University College London, England,

delroy.nelson@ucl.ac.uk

Abstract:

This presentation is in two parts: firstly , an introduction to the Hamnosys/Visicast avatar; secondly, a description of a research project which is seeking to develop a research toolkit, using the properties of the avatar to create material in British Sign Language for research purposes.

HamNoSys or (Hamburg Notation System) is a computer-based system for notating signs. It can be combined with ViSiCAST (Signing for the Deaf using Virtual Humans) , a  computer system that generates sign language strings which are then animated by the avatar. The avatar is driven by SiGML (Signing Gesture Mark-up Language) which is a XML (Extensible Markup Language) encoder for HamNoSys. The system has been used to date to generate educational material for deaf children and for other applications, such as sign language on the web (the e-Sign project). 

The Research Toolkit project has explored adaptations to the system to enable it to be used by linguists and psycholinguists to create test material for research purposes. A standard research tool to measure language fluency is to ask subjects to make judgements about the acceptability of linguistic constructions. This poses a challenge for sign language researchers since it is difficult to get humans to produce ungrammatical material in a natural way. The Research Toolkit project aims to create three tools for research: 1) fingerspelling of non-words (for example ‘blick’ or ‘kclba’; 2) signs with incorrect face actions (for example, the manual component of SLIM combined with the face action accompanying FAT); 3) signs with semantic mismatches (for example, SUN SQUARE); 4; sentences with grammatical errors (for example, questions such as LONDON LEAVE WHEN (when are you leaving London) with the signs in incorrect order.

The demonstration will indicate new approaches to assessments for current and future research.

 

A distributional concept for modeling dialectal variation in TTS

Neubarth Friedrich, Kranzler Christian

OFAI, Vienna, Austria,

friedrich.neubarth@ofai.at, kranzler@ftw.at

Abstract:

In this presentation we discuss the challenge of modelling language variants and dialects on the basis of a common language resource. It is obvious that differences between variants affect all levels relevant for processing, starting from syntactic constructions, lexical differences, different symbolic encoding of sounds up to phonetic nuances. The aim is to identify not only the differences by themselves but also assign the correct level of representation. The overall goal is to find optimal procedures for these tasks in order to minimise efforts in annotation and processing. One central question is whether it makes sense to apply a more accurate phonetic transcription or to rely on slightly incorrect transcriptions, which facilitates comparison with the standard. In the light of unit-selection or HMM-based speech synthesis it is arguable that the second option has clear merits. However, the boundaries up to which level of distortion such a direct propagation of coding is applicable have to be defined very carefully.

Two studies are presented: in one the differences between the German standard of German are contrasted with the Austrian standard. It turns out that most differences can be covered by the implicit context. In the second, several varieties of Viennese dialects/sociolects are modelled (re-)using sources from the standard language. Here the differences are much more fundamental and special methods have to be implemented to deal with them. An additional complication is that some of the differences are gradual – from the standard towards the dialectal variant.

 

 

The Computability of Human Interaction

Rieks op den Akker

University of Twente; Enschede; the Netherlands

infrieks@cs.utwente.nl

Abstract:

Human behavior can not completely be understood as the execution of a preconceived program, a set of conditional rules, the application of which depends on the classification of observable events according to a number of preformatted classification schemes. These are the categories of the designer who has a complete specification of the motives that agents drive, the goals they have and the means they can use to realize these goals. On the contrary, the goal of a human activity is the realization of the person ‘self’. How the actions become realized depends on the actions of other agents. Interaction is emergent, not the result of planned actions but something that simply shows up as a result of many synchronous activities as well as what has been established before. Thus there is a tension between the perspective of the designer of synthetic humanoid interactive characters and the creative emergent behavior in which humans realize themselves through interaction as social beings.

 

In this talk we will see how the conception of human conversational acts in modern theories of conversation (Austin, Searle, Clark) are products of a technical view  where actions are seen as realizations of conceptually pre-designed and complete acts, and what the problems are that this view raises in understanding creative and spontaneous human interaction in open dialog.

I hope to initiate a discussion about the consequences of the above mentioned tension for the projects in which we design and build systems for human machine interaction.

 

A software system demonstration

Pedica Claudio

Università di Camerino,Italy, claudio.pedica@studenti.unicam.it

 

Get a Grip on the Phone(me): The Application of Phonetic Segmentation in Multimodal Emotional Data

Hannes Pirker

Austrian Research Institute for Artificial Intelligence (OFAI), Vienna

hannes.pirker@ofai.at

Abstract:

In this contribution annotation and exploitation of data from the Geneva Multimodal Emotion Portrayals (GEMEP) corpus is presented. This corpus provides video and audio recordings of highly uniform and controlled content. More specifically, it contains samples from 10 actors, portraying 18 different emotions with different degrees of intensity resulting in a set of 3815 recordings. At the same time the lexical content in these recordings is restricted to only two differerent pseudo linguistic sentences. This highly uniform segmental content offers a promising basis for further systematic studies, especially on the acoustic properties of emotional speech as well as on the temporal relationship between speech, gestures and facial expressions.

First, results and insights from the application of forced alignment for the phonetic segmentation of this corpus are presented, giving an overview on the peculiarities of applying standard speech segmentation techniques to the kind of non-standard speech found in the emotional portrayals.

As a first application, the information on the exact location of individual speech sounds made accessible by the phonetic segmentation was used for experiments with the automatic recognition of emotions in speech, contrasting phoneme specific MFCC-based Hidden-Markov-Models with standard sentence-level models.

 

Spectrum Modification for Emotional Speech Synthesis

Přibilová Anna

Slovak University of Technology in Bratislava, Slovakia

anna.pribilova@stuba.sk

 

Modelling of lips for the main vowels of the Lithuanian language

Punys Jonas

Kaunas University of Technology, Lithuania,

jonas.punys@ktu.lt

Abstract:

Lip tracing is a difficult problem due to the poor contrast of lips to the surrounding skin area and the highly non-rigid nature of the lip shape. The image processing techniques have been applied for developing lip contour models, which could be used for tracking moving lips. The shape of the areas has been constructed by the morphological operations. Different size and shape of the structuring element has been used in erosion and dilation operations to ensure the lips contour detection. The geometrical parameters of the estimated tissue areas provide with information on lip contour. The dynamic characteristics of the model have been examined. The model was developed on the Lithuanian vowels.

Real video sequences of moving lips have been recorded and synchronized with spoken vowels. The multimodal database has been constructed. It consists of image seqeunces recorded at the rate 25 fr/s and the corresponding synchronised signals of speech (vowels). The experiment has been carry out with different recording parameters and the same quality of images and voice records On the base of experimental results when forming the multimodal database (synchronising the the record of face motion and voice) the file format „Windows Media Video“ has been chosen. The quality of images and voice is similar for the file formats „PAL DVD“ and „MS AVI- OpenDML“. The results of the COST 2102 action on the multimodal database could be applied for the standardization of experimental data.

 

Annotation Reliability and Machine Learning Performance

Dennis Reidsma

Human Media Interaction, University of Twente, The Netherlands

dennisr@ewi.utwente.nl

Abstract:

Many projects makes extensive use of hand-annotated corpora of recorded human-human interactions. These corpora are used to provide train and test data for machine learning classifiers as well as for gathering information about interaction patterns that one can build into interactive virtual humans. Manual annotation is a difficult task. With more subjective phenomena it can happen that different annotators see different things in the same recordings, which makes the resulting data less useful for the mentioned purposes.

 

A common practice is to assess quality of hand annotated data using an agreement metric such as alpha or kappa. As soon as the value of the metric exceeds a certain threshold the data is deemed to be of sufficient quality. Reidsma and Carletta (submitted 2008) show that this is not enough: thresholding some agreement metric can not be used as an indication of data quality, especially when the data is to be used for machine learning. It is important to figure out what `shape' the disagreement takes, as well as to use this information in assessing machine classifier performance results.

In this presentation we look at the addressee annotations in the AMI corpus[1], containing for every dialog act the intended recipient of the utterance, as judged by the annotator. First we discuss our analysis of the quality of the annotations, also looking at class map versions of the data and doing a contextual agreement analysis. It turns out that there are clear multimodal contextual dependencies with an influence on the agreement scores between annotators. Next, we use the outcome of this analysis of the quality of the annotations to better understand the performance of the machine learning classifiers on this particular data.

Qualitative and quantitative crying analysis of new born babies delivered after high risk gestation

Reyes-Garcia Carlos

INAOE, Computer Science, Mexico,

kargaxxi@yahoo.com

Abstract

High risk newborns present high irritability and poor physiological stability. Crying qualitative description is complementary for quantitative analysis. The aim of this study is the qualitative and quantitative description from the spectral analysis of crying in infants considered as infants of high perinatal risk. Cries of 30 infants with birth alterations (asphyxia, immaturity lung and inter-ventricular hemorrhage) were registered. Three neonatal groups were formed (2, 3 and 4 months old) for audio-recording sessions during clinical exploration. F0 average and standard deviation was: 452.28 ± 73.47 Hz, 466.90 ± 95.85 Hz and  427.5233 ± 49.48 Hz, for 2,3 and 4 months old respectively.  While age increases, expiration capacity increases too, being reflected in duration length. Vibrato and glotic functions were of considerable high frequency. The non parametric test of Xi2 was carried out to identify differences between the age in months and frequency, with p=0.05 there was not significant statistical difference. Infant Cry was described quantitatively and qualitatively, in literature there has not been described. The results indicate that two months old infants does not have phonatory control probably due to neurological immaturity. The qualitative characteristics in the analyzed infant cry are similar to other researching reports, these qualitative variables diminished with respect to the infant maturation. The crying analysis could be considered as an early diagnostic tool because it reflects the neurophysiologic state of the baby.

 

KeyWords: Infant Cry, High risk analysis characteristics; Qualitative analysis Quantitative analysis.

Cultural Dialects of Real and Synthetic Facial Expressions

Ruttkay Zsófia

University of Twente, The Netherlands,

zsofi@cs.utwente.nl

 

Differentiating Communicative Signals using Context

Ter Maat Mark

University of Twente,The Netherlands,

maatm@ewi.utwente.nl

Abstract:

A lot of research in the field of Embodied Conversational Agents deals with the generation of signals and especially with the mapping from conversational functions to signals. Such a mapping is used by an agent to select how a certain function can be displayed to the user. However, interpreting the conversational functions of detected signals is not that easy, since a lot of signals belong to multiple functions. In this short article this problem is explained and a number of ambiguous signals are presented. With these signals it is made clear that they can be differentiated by using the context of the dialogue. The exact description of this context is not clear yet, but a start is made by analyzing some examples to get a global view of what is important in the context.

 

Optimalization of the Automatic Emotion Recognition of Speech in case of Some Languages.

Klara Vicsi, Szabolcs Levente  Tóth, David Sztahó,

Laboratory of Speech Acoustics, Budapest University of Technology and Economics,

Department of Telecommunications and Media Informatics, Hungary

{toth.sz, vicsi, sztaho.david}@tmit.bme.hu

 

Abstract. In this lecture, we would like to present an emotion recognition research method, which is simple enough to be adapted to more COST action languages. We would like to call attention to our colleagues in the COST session to contribute with databases, or to join in our multilingual emotion recognition research.

Optimalization of automatic speech emotion recognition has been carried out in two separate ways: first, by looking after the most important acoustical features which characterize the different emotions. Not only different F0, energy and spectral parameters were examined, but also their optimal time resolution.
Secondly, we examined, which training method gives the optimal speech recognition, in case of a small database (with 10 to 40 speakers). Cross evaluation technique, commonly used for evaluating statistical recognizers with a small training database, improved the liability of the recognition results presented in our former publication.

Three languages from two different language families: Hungarian from the Finn-Ugric, together with Slovakian and Czech from the Slavic language families were used and compared. In order to eliminate the effect of verbal influence, only a few sentence-types were recorded, each with 5 emotions. The emotion recognition training and testing were carried out separately, sentence by sentence.

Keywords: Emotion recognition, Multilingual, Automatic speech recognition, Human speech perception, speech technology, Hidden Markov Models.

 

The Impact of Gesture and Gaze in the Persuasive Political Discourse

Vincze Laura

University of Pisa, Romania/Italy

vincze@ling.unipi.it

Abstract:

The role of gesture in persuasion is examined. What I hope to prove in this work is the tremendous importance gestures have in political discourse and how we can make use of gaze and gesture to express positive or negative evaluations of the interlocutor.

In the persuasive discourse, affirms Poggi, locutor A has the goal to convince interlocutor B that A’s goal is the best possible option, and not just an option among many others. (Poggi 2005) Therefore, to persuade is to convince the others of the importance of our own goals. 

In order to do so, politicians use many persuasion strategies. Together with the persuasion techniques of logos, ethos and pathos (Aristotle), they often employ gestures and gaze in a persuasive way. A resolute gaze may be used for example when one wants to stress the fact that he is a person that people can count on, who keeps his promises. On the contrary, the speaker may employ a discontent facial expression and an accusative gaze while talking about his opponent, implying this way that the other candidate is not worthy of the audience’s trust.

Both gesture and facial expression may have persuasive impact not only in accompaniment of speech, but also while one is listening to the opponent’s speech and cannot interfere.

This is the case of the symbolic gestures like the “tulip hand”. The audience disambiguate the “tulip hand” gesture, which has more than one interpretation, and see that in that particular case, the meaning of the gesture is that of a criticism. One can disambiguate an ambiguous gesture by looking at the context, that is, at the non-manual components like facial expression. (Poggi 2007).

While a politician is talking, often the cameras record the facial expressions of his opponents, which are sometimes very communicative and may have a counter-persuasive role. Ironical smile, for example, employed while the opponent is speaking, may have a significant impact on the audience. Smiling ironically is making fun of the interlocutor, rendering him ridicule and harmless in front of you.

Conclusions

These are only some of the persuasive gazes and gestures used in political discourse, which if ingeniously employed and all along with a good political program, guarantee the victory of the political campaign. Subsequent research will imply the analysis of items of gaze extracted from a candidate’s discourse and assessing if the meaning of these items are actually persuasive for the audience.

 

Main References :

ARISTOTLE (1973). Retorica. Bari : Laterza.

POGGI I. 2005. The goals of persuasion. Pragmatics and Cognition. John Benjamin Publishing Company, 297-336

  2007. Mind, Hands, Face and Body. Berlin : Weidler Buchverlag

 

Recognition of Emotions in Czech Speech Using Gaussian Mixture Model

Martin Vondra, Robert Vích

Academy of Sciences CR,Czech Republic

(vondra, vich)@ufe.cz

Abstract:

The contribution describes experiments with recognition of emotions in Czech speech signal based on the same principle as recognition of speakers. The most robust algorithm for speaker recognition is based on GMM models. We examine several types of speech parameters: Mel-Frequency Cepstrum Coefficients (MFCC), 1st order delta MFCC, 2nd order delta MFCC and several GMM model orders. Further we try to add the fundamental frequency to the speech parameters, because fundamental frequency plays an important role in speech emotions. The aim of this contribution will be a recommendation for GMM model order and optimal selection of speech parameters for emotions recognition in Czech speech.

 

Facial Gestures Generation by Speech Signal Analysis using HUGE Architecture

Zoric Goranka

Faculty of Electrical Engineering and Computing, Croatia

goranka.zoric@fer.hr

 



[1] http://corpus.amiproject.org