Abstracts

 

 

            Coding Schemes for Verbal and Non-Verbal Communication

Ole Bernsen, Laila Dybkjær

Natural Interactive Systems Laboratory University of Southern Denmark, Odense  Denmark

nob@nis.sdu.dk,laila@nis.sdu.dk;

 

ABSTRACT:

The richness of multimodal and natural interactive data means that there is no limit to the coding purposes for which the data may be used and hence no limit to the coding schemes that could be needed. Many coding schemes are still being created for some highly specific in-house purpose and are often not very well-documented. There are, however, a number of more general, well-established and well-documented coding schemes available. It is a great advantage to apply a coding scheme of this kind because it typically comes with a reasonably reliable tag-set and reasonably well-developed semantics for the tags. Furthermore, tools are often available which can help speed up annotation and analysis with these coding schemes, enable analysis of larger data sets and reduce annotation error. All of this is crucial to the development of new and more sophisticated multimodal and natural interactive applications. Still, due to the complexity of cross-modal data annotation, few cross-modal coding schemes exist which possess the desirable properties of being relatively general-purpose, well-tested and solidly documented.

In this paper, we describe a range of established unimodal coding schemes for coding aspects of multimodal and natural interactive behaviours in different modalities, including coding schemes for speech, facial expression, gaze, gesture, emotion, and head and body posture, and discuss issues of tools support. In addition, we illustrate emerging cross-modality coding schemes and discuss issues concerning their further development and tools support.

 

 

 

[HUGE]: Universal Architecture for Statisticaly Based Human Gesturing

Aleksandra Cerekivic, Igor Pandzic,

Faculty of Electrical Engineering and Computing, Zagreb, Croatia

igor.pandzic@fer.hr; aleksandra.cerekovic@fer.hr

 

ABSTRACT:

We introduce a universal architecture for statistically based HUman GEsturing (HUGE) system, for producing and using statistical models for facial gestures based on any kind of inducement. As inducement we consider any kind of signal that occurs in parallel to the production of gestures in human behaviour and that may have a statistical correlation with the occurrence of gestures, e.g. text that is spoken, audio signal of speech, bio signals etc. The correlation between the inducement signal and the gestures is used to first build the statistical model of gestures based on a training corpus consisting of sequences of gestures and corresponding inducement data sequences. In the runtime phase, the raw, previously unknown inducement data is used to trigger (induce) the real time gestures of the agent based on the previously constructed statistical model. We present the general architecture and implementation issues of our system, and further clarify it through two case studies. We believe that this universal architecture is useful for experimenting with various kinds of potential inducement signals and their features and exploring the correlation of such signals or features with the gesturing behaviour.

 

 

Teaching Communication Skills for Engineers: A Personal Experience

Aly N. El-Bahrawy

Faculty of Engineering, Ain Shams University

alyelbahrawy@yahoo.com

ABSTRACT

Engineers require many communication skills for their personal and professional success in life.  The Engineering Curricula has very limited time devoted to such crucial skills.  A recent Higher Education Enhancement Project Fund offered by the Ministry of Higher Education allows many innovative staff members to compete for funds to support their creative ideas.  Enhancing Data Analysis and Presentation Skills “EDAPSE” is one of the projects executed at the Faculty of Engineering, Ain Shams University.  The author’s experience as manager and member of the implementation team is presented in this paper.  Attention will be given to the communication side of the training inside and outside the Faculty.  In particular, three interesting communication training programs will be illustrated.  These programs are for non-verbal communication, non-lexical vocal expressions (grunts) and the use of very special hand gestures used frequently in the streets of Cairo.

 

 

 

What Pauses Can Tell Us about Speech and Gesture Partnership

Anna Esposito, Maria Marinaro

Dipartimento di Psicologia, Seconda Università di Napoli, and  IIASS, taly

Department of Physics, Salerno University, and IIASS, Italy

iass.annaesp@tin.it; anna.esposito@unina2.it

 

ABSTRACT:

Considering the role that speech pauses play in communication we speculate on the possibility that holds (or gesture pauses) may serve to similar purposes supporting the view that gestures as language are an expressive resource that can take on different functions depending on the communicative demand. The data reported in the present paper seem to support this hypothesis, showing that 93% of the children and 78% of the adult speech pause variation is predictable from holds, suggesting that at the least to some extent, the function of holds may be thought to be similar to speech pauses. While speech pauses are likely to play the role of signaling mental activation processes aimed at replacing the “old spoken content” of an “utterance” with a new one, holds may signal mental activation processes aimed at replacing the “old visible bodily actions” (intimately involved in the semantic and/or pragmatic contents of the old “utterance”) with new bodily actions reflecting the representational and/or propositional contribution that gestures are engaged to convey in the new “utterance”.

 

 

Low-Complexity Algorithms for Biometric Recognition

Marcos Faundez-Zanuy

Escola Universitaria Politecnica de Mataro, Spain

faundez@eupmt.es

 

ABSTRACT:

In this paper we will present the main research lines focused on biometric recognition of people, followed at EUP Mataró: Speaker, face, on-line signature, fingerprint and hand-geometry. We will summarize the low-cost and low-complexity applications (face and on-line signature), which are specially suited for real time applications and execution on low-cost processors. In addition, we will also describe some novel approaches which try to study and solve new technological problems, such as to avoid replay attacks and to check if a given recording has been altered (by means of watermaking), the relevance of bandwidth extension, and speaker recognition for bilingual speakers.

 

 

Analysis of verbal and nonverbal acoustic signals with the Dresden UASR system

Rüdiger Hoffmann,

Technische Universität Dresden, Dresden, Germany

Ruediger.Hoffmann@ias.et.tu-dresden.de: 

ABSTRACT:

Considering the evolving computer technology in the 1960-th, the Dresden University founded a research unit for communication and measurement in 1969. Since then, the acoustic human-computer interaction is one of the main aspects in research and teaching. This unit is now named “Institute of Acoustics and Speech Communication” and includes the chair for Communication Acoustics and the chair for Speech Communication.

 

The main activities at the chair for Speech Communication in the past have been directed to the development of speech recognizers and TTS systems. In both directions, special effort was concentrated to versions which are suited to be applied in embedded systems. Prosody models have been developed and applied for the improvement of speech recognition systems and also of TTS systems. Of course, the development of the speech databases is an essential part of these activities.

 

The development of the algorithms is performed using an experimental platform which is called UASR (Unified Approach for Speech Synthesis and Recognition). It includes a collection of modules for data analysis and synthesis as well as the different databases. It enables the investigation of modern technologies like FSM algorithms and HMM synthesis. The components of UASR have been successfully applied to the analysis of signals which are acoustic but non-speech signals.

 

It is intended now to extend the UASR architecture toward the extraction/addition of non-linguistic information from/to the speech signal. This includes the refinement of the existing prosodic analysis and control, but also the development of new methods which are suited to evaluate certain communication situations. As an example of this kind of analysis, a multidisciplinary project will be started which aims to the identification and prediction of critical situations during the psycho-therapeutical treatment.

 

Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents

Jonas BESKOW, Björn GRANSTRÖM and David HOUSE

Centre for Speech Technology, CSC, KTH, Stockholm, Sweden

bjorn@speech.kth.se; davidh@speech.kth.se

ABSTRACT:

The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is for obvious reasons closely related to the speech acoustics (e.g. movements of the lips and jaw), while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face.  On the other hand, many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. The context of much of our research regarding these questions is to be able to create an animated talking agent capable of displaying realistic communicative behaviour and suitable for use in conversational spoken language systems. This chapter looks into the communicative function of the agent, both the capability to increase intelligibility of the spoken interaction and the possibility to make the flow of the dialogue smoother, through different kinds of communicative gestures, such as visual prosodic gestures (e.g. focal accent and emphatic stress) and gestures for different expressive states, turntaking and negative or positive system feedback. We will give some examples of recent work, primarily at KTH, involving the collection and analysis of databases for audiovisual prosody.

 

Prosodic and gestural expression of interactional agreement
Eric Keller, Wolfgang Tschacher

IMM, University of Lausanne, Switzerland

University of Berne, Switzerland

Eric.Keller@unil.ch, tschacher@spk.unibe.ch

 

ABSTRACT:

At a sociolinguistic level, conversational interactions are cooperatively constructed activities in which participants negotiate their entrances, turns and alignments with other speakers, oftentimes with an underlying long-term objective of obtaining some agreement. The degree of perceived agreement is known to be of importance in a variety of contexts, such as psychotherapeutic interactions, contractual negotiations or project evaluations. Part of the prosodic and gestural elements in a conversational interaction can be interpreted as signals for the speaker’s degree of agreement; they are thus of probable importance in the (non-)emergence of agreement in a conversational exchange. A review of such prosodic and gestural elements will be presented, as well as some patterns in the evolution of agreement.

 

 

Characteristics of gestural action

Adam Kendon

University of Pennsylvania

 adamk@dca.net


ABSTRACT:

In everyday conversations participants routinely distinguish visible actions deemed a part of a speaker's intended   expression from other kinds of actions which are not attended to or   'counted' as being significant for a speaker's utterance. I will  offer some examples to show what the criteria may be according which  such distinctions are made and propose an approach in terms of which  the 'gestural' character of a phrase of movement can be assessed.

 

 

Modeling multimodal dialogue with and in embodied agents

Stefan Kopp

Artificial Intelligence Group, Bielefeld University

skopp@techfak.uni-bielefeld.de

 

ABSTRACT:

Face-to-face conversation between humans is a complex process of concurrent and interleaved information transfer on multiple levels. This is only possible because humans manage to easily produce and attend to a number of verbal and nonverbal cues at the same time. I will briefly describe some of our research projects centered around the virtual human „Max“, with which we study how natural conversational behavior can be modeled and made available for artificial systems. This research activity embarks on the goal of building comprehensive, embodied agents that can engage with humans in face-to-face conversation. At the same time, they are used as testbeds in probing and evaluating our models of aspects of multimodal dialogu, e.g. for speech-gesture production, multimodal feedback, turn-taking, as well as of the architectural underpinnings of these more or less aware, interactive behaviors.

 

 

 

Voice Source Change During Pitch Variation

Peter Murphy

Department of Electronic and Computer Engineering, , University of Limerick, Limerick, Ireland.

peter.murphy@ul.ie

 

ABSTRACT:

Prosody refers to certain properties of the speech signal including audible changes in pitch, loudness, and syllable length. The acoustic manifestation of prosody is typically measured in terms of fundamental frequency (f0), amplitude and duration. These three cues have formed the basis for extensive studies of prosody in natural speech. The present work seeks to go beyond this level of representation and to examine additional factors that arise as a result of the underlying production mechanism. For example, intonation is studied with reference to the f0 contour. However, to change f0 requires changes in the laryngeal configuration that results in glottal flow parameter changes. These glottal changes may serve as important psychoacoustic markers in addition to (or in conjunction with) the f0 targets. The present work examines changes in basic glottal parameters with f0 in connected speech using electroglottogram and volume velocity at the lips signals. This preliminary study suggests that individual differences may exist in terms of glottal changes for a particular f0 variation. Future work will examine glottal variation within emotionally styled speech.         

 

 

 

Cross-modal analysis & summarization of audiovisual material

Harris Papageorgiou

Institute for Language and Speech Processing (ILSP), Athens, Greece

xaris@ilsp.gr

 

ABSTRACT:

We present latest work in cross-modal summarization of audiovisual material. The proposed presentation emphasizes on researching cross-modal analysis of audio/video coupled with different ways of synthesizing the most salient elements of the parts that constitute a cross-modal object.

At the core of our work lies an open, adaptable architecture responsible for suitably combining the salient parts taking into account (a) the cross-modal analysis findings, (b) the typology and semantic characteristics of the audiovisual stream and (c) the users’ interests. Three cases will be presented applying our cross-modal fusion and summarization methodology in three discrete domains: broadcast TV news, European Parliament sessions and travel documentaries.

 

Multimedia information fusion and presentation of personalized summaries in a range of different consumer devices is of growing importance nowadays where available multimedia content increases exponentially. There is a basic need to compose multimedia content systems that will help users keep up with the explosion of digital content scattered over different platforms (radio, satellite TV, Web, etc), different modalities (speech, text, images, video) and different languages. Moreover, end users require intelligent information filtering facilities, embedded in user-friendly, personalized interfaces that will allow them to distinguish crucial content from a plethora of irrelevant information. An answer to this need is provided by organizing, selecting and presenting summarized information in a personalized way.

 

In this context, state-of-the-art techniques of distributed information retrieval, related to multimedia selection, data fusion and presentation of results coupled with cross-modal summarization and hierarchical categorization enable users to effectively search and browse the large amount of content gathered. The process of cross-modal summarization consists in constructing summaries by exploiting and analyzing the different modalities (images, speech, audio, text, etc) that co-exist in the original data stream. Although research is still in its infancy, the growing availability of multimedia material along with the technological advances of display capabilities in consumer devices makes cross-modal content abstraction a challenging, yet worthwhile task.

 

 

Speech Prosody Analysis, Modification, and Resynthesis

Hartmut R. Pfitzinger

Institute of Phonetics and Speech processing, Munich, Germany

hpt@phonetik.uni-muenchen.de

ABSTRACT:
Automatic extraction of prosodic features is more than just detecting fundamental frequency contours. We distinguish at least five dimensions of prosody: intensity, intonation, timing, voice quality, and degree of reduction, each of which can be regarded as a composition of quickly varying components (e.g. intrinsic or segmental variations) and slowly  varying components, the so-called supra-segmentals. We develop new  analysis techniques for the automatic extraction of the more complicated  prosodies timing (or local speech rate), voice quality, and degree of  reduction. The main development goal is to achieve analysis results on  a high degree of abstraction which allow for meaningful interpretation,  easy modification, and artifact-free re-synthesis. Especially the common re-synthesis technique PSOLA is difficult but necessary to improve, or to  be precise, to replace by a more powerful re-synthesis algorithm with less  artifacts.

 

 

Embodied Conversational Agents in Wizard-of-Oz and Multimodal Interaction Applications

Matej Rojc, Tomaž Rotovnik, Zdravko Kačič

University of Maribor, Faculty of Electrical Engineering and Computer Science

matej.rojc@uni-mb.si

 

ABSTRACT:

Embodied Conversational Agents employed in multimodal interaction applications have potential to achieve similar properties as humans in face-to-face conversation. They enable the inclusion of verbal and nonverbal communication. Thus the degree of personalization of the user interface is much higher than in other human-computer interfaces. This, of course, greatly contributes to the naturalness and user friendliness of the interface, opening a wide area of possible applications. Two implementations of embodied conversational agents in human-computer interaction are presented in this paper: the first one in a Wizard-of-Oz application and the second in a dialogue system. In Wizard-of-Oz application the embodied conversational agent is applied in a way that it conveys spoken information of the operator to the user with whom the operator communicates. Depending on the scenario of the application the user may be or not aware of the operator’s involvement. The operator can communicate with the user based on audio/visual or only audio communication. The paper describes the application setup, which enables distant communication with the user, where the user is not aware of the operator’s involvement. A real-time viseme recognizer is needed to assure proper response of the avatar. In addition, implementation of the embodied conversational agent named Lili as a conductor in a music show, which is broadcasted by the RTV Slovenia, will be described in more detail.

Employment of the embodied conversational agent as a virtual major-domo named Maja, within an intelligent ambience, using speech recognition system and TTS system Plattos, will be also described.

 

 

 

 

Presenting in Style by Virtual Humans

Zsófia Ruttkay

HMI, University of Twente

zsofi@cs.utwente.nl

 

ABSTRACT:

The concept style is used to characterize how a human talks and gestures: his typical intonation, if he uses many bold hand gestures and vivid facial expressions, or rather he has the habit to talk with almost a poker face. How does his speech and gesturing change if he gets angry, or sad? We are all somewhat different in these respects, but there are factors which influence our style: personality, culture, social status, the setting of the conversation.  The diversity in style is not only a source of joy in every-day life, but also a reference framework which tells much about the identity of the person.

The topic of the paper is endowing Virtual Humans with some style, from an arsenal of possibilities. First a conceptual framework of defining style is discussed, identifying the variables of style manifested in speech and nonverbal communication. Then the GESTYLE language is introduced, making it possible to define the style of a character in terms of Style Dictionaries, assigning non-deterministic choices to express certain meanings by nonverbal signals and speech. As a (virtual) human may have several, changing and possibly conflicting factors determining the style he uses at a moment (think of an extrovert person in a formal conversation with his boss), there are mechanisms to define multiple sources of style and maintain conflicts and dynamical changes.

GESTYLE is a text markup language which makes it possible to generate speech and accompanying facial expressions and hand gestures automatically, by declaring the style of the character and using meaning tags in the text.  GESTYLE can be coupled with different low-level TTS and animation engines. GESTYLE could be a handy tool for psychologists to investigate different styles and their effect.

 

 

 

Single-channel Speech Enhancement Using Modified Wavelet Transform

Zdenek Smekal, Petr Sysel

Department of Telecommunications, Brno University of Technology,

smekal@feec.vutbr.cz

ABSTRACT:

The basic problem in the analysis, synthesis and recognition of speech signal is its extraction from an interfering environment. If a speech signal that has been interfered with is being processed, serious errors may occur during the processing. Interference may be acquired already during recording, transmission and also in digital encoding or compression. Included directly in the speech signal is a noise component, which is intrinsic to unvoiced consonants in particular. If interference occurs in the same frequency band as the noise component of speech, it is very difficult to separate the desirable and undesirable components from each other.

The wavelet transform represents a specific spectral alternative to the Fourier transform. Vis-a-vis the Fourier transform it has an advantage in that with the wavelet transform the signal resolution in the time and frequency domains is set up better. It has been shown that the dyadic discrete-time wavelet transform (DTWT) is closely related to half-band mirror digital filters with perfect reconstruction. Which part of the image plane of the wavelet transform is removed depends on the application of a suitable type of thresholding. The type of thresholding can thus be used to choose the types of noise and interference to be removed.

The paper includes a comparison of single-channel methods for removing interference by means of the wavelet transform, which is applied in different domains (time, frequency, and cepstral domains, etc.). Different types noise and interference will be discussed (periodic, narrowband, impulse, broadband, etc.) as well as the choice of a suitable method that would be most effective for the given type of noise.

References:

[1] Deller, J., R. Jr., Hansen, J. H.L., Proakis, J.G.: Discrete-Time Processing of Speech Signals. John Wiley, New York, 2000.

[2] Vaidyanathan, P.P.: Multirate Systems and Filter Banks. Prentice Hall, Englewood Cliffs, 1993.

 

 

Speech spectrum envelope modeling

Robert Vich

Institute of Photonics and Electronics AS CR, Prague

vich@ure.cas.cz

ABSTRACT:

A new method is proposed for cepstral speech synthesis. It is based on peak picking on the spectrum of a speech frame followed by interpolation. The interpolated spectrum envelope is used for real cepstrum computation of the vocal tract model impulse response. From the original speech spectrum and the spectrum envelope the mean fundamental frequency and the amplitude of the residual signal can be estimated. The cepstral speech production model using Pade approximation can then be constructed.

In the contribution the proposed method will be evaluated using real and synthetic speech signals and compared with LPC spectrum and a spectrum obtained by cepstral liftering.

 

 

Proteus: a rapid segment selection algorithm for unit-selection TTS in embedded devices
Jerneja ZGANEC GROS

Alpineon, Development and Research, Ljubljana,Slovenia

jerneja.gros@alpineon.com


ABSTRACT:
Memory and processing power requirements are important factors when designing human computer interfaces for embedded devices. We describe Proteus, an accelerated unit-selection method, which was designed for an embedded implementation of a polyphone concatenative TTS system. The results of objective measurements of computational speed, along with results of subjective listening tests conceived according to ITU-T

recommendations, are provided at the end of the paper.

 

Activities of Speech Laboratories of BME TMIT

Klara Vicsi

Budapest University of Technology and Economics, Dept. of Telecommunication and Mediainformatics, Lab. of Speech Acoustics, 111 Budapest, HUNGARY

 vicsi@tmit.bme.hu

ABSTRACT:

The purpose of this lecture is to give an overall review of research activities going on at the Speech Laboratories (Speech Technology Laboratory, Laboratory of Speech Acoustics, and Telecommunications & Signal Processing Laboratory) of Department of Telecommunication and Mediainformatics (TMIT) of Budapest University of Technology and Economics (BME).  The  main activities of these Laboratories are grouped into 4  speech research fields: Speech syntheses, text-to-speech systems (TTS), automatic speech recognition (ASR), database design and processing for TTS and ASR, speech technology for sound diagnosis and for handicapped people.  Now we are going to extent our activity in the direction of nonverbal communication research, for example in the direction of the examination of emotional speech, collection of emotional speech databases, recognition of 6 main emotions in speech etc.