Abstracts
![]()
Coding Schemes for Verbal and Non-Verbal Communication
Ole Bernsen, Laila Dybkjær
Natural
Interactive
nob@nis.sdu.dk,laila@nis.sdu.dk;
ABSTRACT:
The richness of
multimodal and natural interactive data means that there is no limit to the
coding purposes for which the data may be used and hence no limit to the coding
schemes that could be needed. Many coding schemes are still being created for
some highly specific in-house purpose and are often not very well-documented.
There are, however, a number of more general, well-established and
well-documented coding schemes available. It is a great advantage to apply a
coding scheme of this kind because it typically comes with a reasonably
reliable tag-set and reasonably well-developed semantics for the tags.
Furthermore, tools are often available which can help speed up annotation and
analysis with these coding schemes, enable analysis of larger data sets and
reduce annotation error. All of this is crucial to the development of new and
more sophisticated multimodal and natural interactive applications. Still, due
to the complexity of cross-modal data annotation, few cross-modal coding
schemes exist which possess the desirable properties of being relatively
general-purpose, well-tested and solidly documented.
In this paper,
we describe a range of established unimodal coding schemes for coding aspects
of multimodal and natural interactive behaviours in different modalities,
including coding schemes for speech, facial expression, gaze, gesture, emotion,
and head and body posture, and discuss issues of tools support. In addition, we
illustrate emerging cross-modality coding schemes and discuss issues concerning
their further development and tools support.
![]()
[HUGE]: Universal Architecture for
Statisticaly Based Human Gesturing
Aleksandra Cerekivic, Igor Pandzic,
Faculty of
Electrical Engineering and Computing,
igor.pandzic@fer.hr;
aleksandra.cerekovic@fer.hr
ABSTRACT:
We introduce a universal
architecture for statistically based HUman GEsturing (HUGE) system, for
producing and using statistical models for facial gestures based on any kind of
inducement. As inducement we consider any kind of signal that occurs in
parallel to the production of gestures in human behaviour and that may have a
statistical correlation with the occurrence of gestures, e.g. text that is
spoken, audio signal of speech, bio signals etc. The correlation between the
inducement signal and the gestures is used to first build the statistical model
of gestures based on a training corpus consisting of sequences of gestures and
corresponding inducement data sequences. In the runtime phase, the raw,
previously unknown inducement data is used to trigger (induce) the real time
gestures of the agent based on the previously constructed statistical model. We
present the general architecture and implementation issues of our system, and
further clarify it through two case studies. We believe that this universal
architecture is useful for experimenting with various kinds of potential
inducement signals and their features and exploring the correlation of such
signals or features with the gesturing behaviour.
![]()
Teaching Communication Skills
for Engineers: A Personal Experience
Aly N.
El-Bahrawy
Faculty of
Engineering,
ABSTRACT
Engineers
require many communication skills for their personal and professional success
in life. The Engineering Curricula has
very limited time devoted to such crucial skills. A recent Higher Education Enhancement Project
Fund offered by the Ministry of Higher Education allows many innovative staff
members to compete for funds to support their creative ideas. Enhancing Data Analysis and Presentation
Skills “EDAPSE” is one of the projects executed at the Faculty of Engineering,
![]()
What Pauses Can Tell Us about
Speech and Gesture Partnership
Anna Esposito, Maria Marinaro
Dipartimento
di Psicologia, Seconda Università di Napoli, and IIASS, taly
Department of Physics,
iass.annaesp@tin.it; anna.esposito@unina2.it
ABSTRACT:
Considering the role that speech pauses play in communication we
speculate on the possibility that holds (or gesture pauses) may serve to
similar purposes supporting the view that gestures as language are an
expressive resource that can take on different functions depending on the communicative
demand. The data reported in the present paper seem to support this hypothesis,
showing that 93% of the children and 78% of the adult speech pause variation is
predictable from holds, suggesting that at the least to some extent, the
function of holds may be thought to be similar to speech pauses. While speech
pauses are likely to play the role of signaling mental activation processes
aimed at replacing the “old spoken content” of an “utterance” with a new one,
holds may signal mental activation processes aimed at replacing the “old
visible bodily actions” (intimately involved in the semantic and/or pragmatic
contents of the old “utterance”) with new bodily actions reflecting the
representational and/or propositional contribution that gestures are engaged to
convey in the new “utterance”.
![]()
Low-Complexity Algorithms for Biometric Recognition
Marcos Faundez-Zanuy
Escola Universitaria Politecnica de Mataro, Spain
ABSTRACT:
In this paper we
will present the main research lines focused on biometric recognition of
people, followed at EUP Mataró: Speaker, face, on-line signature, fingerprint
and hand-geometry. We will summarize the low-cost and low-complexity
applications (face and on-line signature), which are specially suited for real
time applications and execution on low-cost processors. In addition, we will
also describe some novel approaches which try to study and solve new
technological problems, such as to avoid replay attacks and to check if a given
recording has been altered (by means of watermaking), the relevance of
bandwidth extension, and speaker recognition for bilingual speakers.
![]()
Analysis of verbal and nonverbal
acoustic signals with the Dresden UASR system
Rüdiger Hoffmann,
Technische Universität Dresden, Dresden, Germany
Ruediger.Hoffmann@ias.et.tu-dresden.de:
ABSTRACT:
Considering the
evolving computer technology in the 1960-th, the
The main
activities at the chair for Speech Communication in the past have been directed
to the development of speech recognizers and TTS systems. In both directions,
special effort was concentrated to versions which are suited to be applied in
embedded systems. Prosody models have been developed and applied for the
improvement of speech recognition systems and also of TTS systems. Of course,
the development of the speech databases is an essential part of these
activities.
The development
of the algorithms is performed using an experimental platform which is called
UASR (Unified Approach for Speech Synthesis and Recognition). It includes a
collection of modules for data analysis and synthesis as well as the different
databases. It enables the investigation of modern technologies like FSM
algorithms and HMM synthesis. The components of UASR have been successfully
applied to the analysis of signals which are acoustic but non-speech signals.
It is intended
now to extend the UASR architecture toward the extraction/addition of
non-linguistic information from/to the speech signal. This includes the
refinement of the existing prosodic analysis and control, but also the
development of new methods which are suited to evaluate certain communication
situations. As an example of this kind of analysis, a multidisciplinary project
will be started which aims to the identification and prediction of critical
situations during the psycho-therapeutical treatment.
![]()
Analysis and synthesis of
multimodal verbal and non-verbal interaction for animated interface agents
Jonas BESKOW, Björn GRANSTRÖM and David HOUSE
Centre for Speech Technology, CSC, KTH,
bjorn@speech.kth.se; davidh@speech.kth.se
ABSTRACT:
The use of animated talking agents is a novel feature of many multimodal
spoken dialogue systems. The addition and integration of a virtual talking head
has direct implications for the way in which users approach and interact with
such systems. However, understanding the interactions between visual
expressions, dialogue functions and the acoustics of the corresponding speech
presents a substantial challenge. Some of the visual articulation is for
obvious reasons closely related to the speech acoustics (e.g. movements of the
lips and jaw), while there are other articulatory movements affecting speech
acoustics that are not visible on the outside of the face. On the other hand, many facial gestures used
for communicative purposes do not affect the acoustics directly, but might
nevertheless be connected on a higher communicative level in which the timing
of the gestures could play an important role. The context of much of our
research regarding these questions is to be able to create an animated talking
agent capable of displaying realistic communicative behaviour and suitable for
use in conversational spoken language systems. This chapter looks into the
communicative function of the agent, both the capability to increase
intelligibility of the spoken interaction and the possibility to make the flow
of the dialogue smoother, through different kinds of communicative gestures,
such as visual prosodic gestures (e.g. focal accent and emphatic stress) and
gestures for different expressive states, turntaking and negative or positive
system feedback. We will give some examples of recent work, primarily at KTH,
involving the collection and analysis of databases for audiovisual prosody.
![]()
Prosodic and gestural expression
of interactional agreement
Eric Keller, Wolfgang
Tschacher
IMM,
Eric.Keller@unil.ch, tschacher@spk.unibe.ch
ABSTRACT:
At a
sociolinguistic level, conversational interactions are cooperatively
constructed activities in which participants negotiate their entrances, turns and
alignments with other speakers, oftentimes with an underlying long-term
objective of obtaining some agreement. The degree of perceived agreement is
known to be of importance in a variety of contexts, such as psychotherapeutic
interactions, contractual negotiations or project evaluations. Part of the
prosodic and gestural elements in a conversational interaction can be
interpreted as signals for the speaker’s degree of agreement; they are thus of
probable importance in the (non-)emergence of agreement in a conversational
exchange. A review of such prosodic and gestural elements will be presented, as
well as some patterns in the evolution of agreement.
![]()
Characteristics of gestural
action
Adam Kendon
ABSTRACT:
In everyday
conversations participants routinely distinguish visible actions deemed a part
of a speaker's intended expression
from other kinds of actions which are not attended to or 'counted' as being significant for a
speaker's utterance. I will offer some examples to show what the criteria
may be according which such distinctions are made and propose an approach
in terms of which the 'gestural' character of a phrase of movement can be
assessed.
![]()
Modeling
multimodal dialogue with and in embodied agents
Stefan Kopp
Artificial
Intelligence Group,
skopp@techfak.uni-bielefeld.de
ABSTRACT:
Face-to-face conversation
between humans is a complex process of concurrent and interleaved information
transfer on multiple levels. This is only possible because humans manage to
easily produce and attend to a number of verbal and nonverbal cues at the same
time. I will briefly describe some of our research projects centered around the
virtual human „Max“, with which we study how natural conversational behavior
can be modeled and made available for artificial systems. This research
activity embarks on the goal of building comprehensive, embodied agents that
can engage with humans in face-to-face conversation. At the same time, they are
used as testbeds in probing and evaluating our models of aspects of multimodal
dialogu, e.g. for speech-gesture production, multimodal feedback, turn-taking,
as well as of the architectural underpinnings of these more or less aware,
interactive behaviors.
![]()
Voice Source Change During Pitch
Variation
Peter Murphy
Department of
Electronic and Computer Engineering, ,
peter.murphy@ul.ie
ABSTRACT:
Prosody refers
to certain properties of the speech signal including audible changes in pitch,
loudness, and syllable length. The acoustic manifestation of prosody is
typically measured in terms of fundamental frequency (f0), amplitude and
duration. These three cues have formed the basis for extensive studies of
prosody in natural speech. The present work seeks to go beyond this level of
representation and to examine additional factors that arise as a result of the
underlying production mechanism. For example, intonation is studied with
reference to the f0 contour. However, to change f0 requires changes in the
laryngeal configuration that results in glottal flow parameter changes. These
glottal changes may serve as important psychoacoustic markers in addition to
(or in conjunction with) the f0 targets. The present work examines changes in
basic glottal parameters with f0 in connected speech using electroglottogram
and volume velocity at the lips signals. This preliminary study suggests that
individual differences may exist in terms of glottal changes for a particular
f0 variation. Future work will examine glottal variation within emotionally
styled speech.
![]()
Cross-modal analysis &
summarization of audiovisual material
Harris Papageorgiou
Institute for
Language and Speech Processing (ILSP),
ABSTRACT:
We present latest
work in cross-modal summarization of audiovisual material. The proposed
presentation emphasizes on researching cross-modal analysis of audio/video
coupled with different ways of synthesizing the most salient elements of the
parts that constitute a cross-modal object.
At the core of
our work lies an open, adaptable architecture responsible for suitably
combining the salient parts taking into account (a) the cross-modal analysis
findings, (b) the typology and semantic characteristics of the audiovisual
stream and (c) the users’ interests. Three cases will be presented applying our
cross-modal fusion and summarization methodology in three discrete domains:
broadcast TV news, European Parliament sessions and travel documentaries.
Multimedia
information fusion and presentation of personalized summaries in a range of
different consumer devices is of growing importance nowadays where available
multimedia content increases exponentially. There is a basic need to compose
multimedia content systems that will help users keep up with the explosion of
digital content scattered over different platforms (radio, satellite TV, Web,
etc), different modalities (speech, text, images, video) and different
languages. Moreover, end users require intelligent information filtering
facilities, embedded in user-friendly, personalized interfaces that will allow
them to distinguish crucial content from a plethora of irrelevant information.
An answer to this need is provided by organizing, selecting and presenting
summarized information in a personalized way.
In this context,
state-of-the-art techniques of distributed information retrieval, related to
multimedia selection, data fusion and presentation of results coupled with
cross-modal summarization and hierarchical categorization enable users to
effectively search and browse the large amount of content gathered. The process
of cross-modal summarization consists in constructing summaries by exploiting
and analyzing the different modalities (images, speech, audio, text, etc) that co-exist
in the original data stream. Although research is still in its infancy, the
growing availability of multimedia material along with the technological
advances of display capabilities in consumer devices makes cross-modal content
abstraction a challenging, yet worthwhile task.
![]()
Speech Prosody Analysis, Modification, and Resynthesis
Hartmut R. Pfitzinger
ABSTRACT:
Automatic extraction of prosodic features is more than just detecting
fundamental frequency contours. We distinguish at least five dimensions of
prosody: intensity, intonation, timing, voice quality, and degree of reduction,
each of which can be regarded as a composition of quickly varying components
(e.g. intrinsic or segmental variations) and slowly varying components, the so-called
supra-segmentals. We develop new analysis
techniques for the automatic extraction of the more complicated prosodies timing (or local speech rate),
voice quality, and degree of reduction.
The main development goal is to achieve analysis results on a high degree of abstraction which allow for
meaningful interpretation, easy
modification, and artifact-free re-synthesis. Especially the common
re-synthesis technique PSOLA is difficult but necessary to improve, or to be precise, to replace by a more powerful
re-synthesis algorithm with less
artifacts.
![]()
Embodied Conversational Agents in Wizard-of-Oz and Multimodal
Interaction Applications
Matej
Rojc, Tomaž Rotovnik, Zdravko Kačič
ABSTRACT:
Embodied Conversational Agents employed in
multimodal interaction applications have potential to achieve similar
properties as humans in face-to-face conversation. They enable the inclusion of
verbal and nonverbal communication. Thus the degree of personalization of the
user interface is much higher than in other human-computer interfaces. This, of
course, greatly contributes to the naturalness and user friendliness of the
interface, opening a wide area of possible applications. Two implementations of
embodied conversational agents in human-computer interaction are presented in
this paper: the first one in a Wizard-of-Oz application and the second in a
dialogue system. In Wizard-of-Oz application the embodied conversational agent
is applied in a way that it conveys spoken information of the operator to the
user with whom the operator communicates. Depending on the scenario of the
application the user may be or not aware of the operator’s involvement. The
operator can communicate with the user based on audio/visual or only audio
communication. The paper describes the application setup, which enables distant
communication with the user, where the user is not aware of the operator’s
involvement. A real-time viseme recognizer is needed to assure proper response of
the avatar. In addition, implementation of the embodied conversational agent
named Lili as a conductor in a music show, which is broadcasted by the RTV
Slovenia, will be described in more detail.
Employment of the embodied conversational agent
as a virtual major-domo named Maja, within an intelligent ambience, using
speech recognition system and TTS system Plattos, will be also described.
![]()
Presenting
in Style by Virtual Humans
Zsófia Ruttkay
HMI,
ABSTRACT:
The concept style is used to characterize how a human talks and
gestures: his typical intonation, if he uses many bold hand gestures and vivid
facial expressions, or rather he has the habit to talk with almost a poker
face. How does his speech and gesturing change if he gets angry, or sad? We are
all somewhat different in these respects, but there are factors which influence
our style: personality, culture, social status, the setting of the
conversation. The diversity in style is
not only a source of joy in every-day life, but also a reference framework
which tells much about the identity of the person.
The topic of the paper is endowing Virtual Humans with some style, from
an arsenal of possibilities. First a conceptual framework of defining style is
discussed, identifying the variables of style manifested in speech and
nonverbal communication. Then the GESTYLE language is introduced, making it
possible to define the style of a character in terms of Style Dictionaries,
assigning non-deterministic choices to express certain meanings by nonverbal
signals and speech. As a (virtual) human may have several, changing and
possibly conflicting factors determining the style he uses at a moment (think
of an extrovert person in a formal conversation with his boss), there are
mechanisms to define multiple sources of style and maintain conflicts and
dynamical changes.
GESTYLE is a text markup language which makes it possible to generate speech
and accompanying facial expressions and hand gestures automatically, by
declaring the style of the character and using meaning tags in the text. GESTYLE can be coupled with different
low-level TTS and animation engines. GESTYLE could be a handy tool for
psychologists to investigate different styles and their effect.
![]()
Single-channel Speech Enhancement Using Modified
Wavelet Transform
Zdenek Smekal, Petr Sysel
Department of
Telecommunications,
ABSTRACT:
The basic
problem in the analysis, synthesis and recognition of speech signal is its
extraction from an interfering environment. If a speech signal that has been interfered
with is being processed, serious errors may occur during the processing.
Interference may be acquired already during recording, transmission and also in
digital encoding or compression. Included directly in the speech signal is a
noise component, which is intrinsic to unvoiced consonants in particular. If
interference occurs in the same frequency band as the noise component of
speech, it is very difficult to separate the desirable and undesirable
components from each other.
The wavelet
transform represents a specific spectral alternative to the Fourier transform.
Vis-a-vis the Fourier transform it has an advantage in that with the wavelet
transform the signal resolution in the time and frequency domains is set up
better. It has been shown that the dyadic discrete-time wavelet transform
(DTWT) is closely related to half-band mirror digital filters with perfect
reconstruction. Which part of the image plane of the wavelet transform is
removed depends on the application of a suitable type of thresholding. The type
of thresholding can thus be used to choose the types of noise and interference
to be removed.
The paper
includes a comparison of single-channel methods for removing interference by
means of the wavelet transform, which is applied in different domains (time,
frequency, and cepstral domains, etc.). Different types noise and interference
will be discussed (periodic, narrowband, impulse, broadband, etc.) as well as
the choice of a suitable method that would be most effective for the given type
of noise.
References:
[1] Deller, J.,
R. Jr., Hansen, J. H.L., Proakis, J.G.: Discrete-Time Processing of Speech
Signals. John Wiley,
[2]
Vaidyanathan, P.P.: Multirate Systems and Filter Banks. Prentice Hall,
![]()
Speech
spectrum envelope modeling
Robert Vich
ABSTRACT:
A new method is proposed for cepstral speech synthesis. It is based on
peak picking on the spectrum of a speech frame followed by interpolation. The
interpolated spectrum envelope is used for real cepstrum computation of the
vocal tract model impulse response. From the original speech spectrum and the
spectrum envelope the mean fundamental frequency and the amplitude of the
residual signal can be estimated. The cepstral speech production model using
Pade approximation can then be constructed.
In the contribution the proposed method will be evaluated using real and
synthetic speech signals and compared with LPC spectrum and a spectrum obtained
by cepstral liftering.
![]()
Proteus: a rapid segment
selection algorithm for unit-selection TTS in embedded devices
Jerneja
ZGANEC GROS
Alpineon,
Development and Research,
ABSTRACT:
Memory and processing power requirements are important factors when designing
human computer interfaces for embedded devices. We describe Proteus, an
accelerated unit-selection method, which was designed for an embedded
implementation of a polyphone concatenative TTS system. The results of
objective measurements of computational speed, along with results of subjective
listening tests conceived according to ITU-T
recommendations,
are provided at the end of the paper.
![]()
Activities of Speech
Laboratories of BME TMIT
Klara Vicsi
ABSTRACT:
The purpose of this lecture is to give an overall review of research
activities going on at the Speech Laboratories (Speech Technology Laboratory, Laboratory of Speech Acoustics, and
Telecommunications & Signal Processing Laboratory) of Department of
Telecommunication and Mediainformatics (TMIT) of Budapest University of
Technology and Economics (BME). The main activities of these Laboratories are grouped
into 4 speech research fields: Speech
syntheses, text-to-speech systems (TTS), automatic speech recognition (ASR),
database design and processing for TTS and ASR, speech technology for sound
diagnosis and for handicapped people.
Now we are going to extent our activity in the direction of nonverbal
communication research, for example in the direction of the examination of
emotional speech, collection of emotional speech databases, recognition of 6
main emotions in speech etc.