Abstracts of Lectures and Short Contributions:

Click here to download this file

Age as a Disguise in a Voice Identification Task

Ruth Huntley Bahr

Dept. of Communication Sciences and Disorders, University of South, Florida, Tampa, USA

ABSTRACT:

         Listeners have been shown to be quite reliable in identifying speaker age; however, little is known about how speakers may use age as a voice disguise. Previous research has shown that fundamental frequency and speech rate may be powerful predictors of speaker age. Yet, it is not known if talkers manipulate these parameters when attempting to produce an older voice or if they will rely on voice stereotypes to guide their productions. The following investigation is designed to study how manipulation of speaker age might affect speaker identification.

            Four young actors (2 males and 2 females) provided voice samples that represented young, middle-aged, and older voices. Four healthy elders also provided voice samples. In the first experiment, naïve listeners were asked to estimate talker age of both the young adult simulations of vocal age and the “real age” samples. In the second task, listeners were asked to determine if the presented voice pairs were from the same or different speakers.

            Results indicated that talkers were able to produce voices that were perceived as different from their chronological age. However, these vocal disguises were not perceived as being representative of actual older voices. In terms of disguise effectiveness, the listeners did not confuse the age-disguised voice for another individual, but these same auditors had much difficulty matching an age-disguised voice to its owner. These results will be discussed in terms of aging stereotypes and cues to speaker identity. The influence of these findings on the speaker identification process will be described.

 

Virtual talking heads and ambiant face-toface

Communication

Gérard BAILLY, Frédéric ELISEI and Stephan RAIDT

Institut de la Communication Parlée, 46 av. Félix Viallet, 38031 Grenoble - France

ABSTRACT:

We describe here our first effort for developing a virtual talking head able to engage a situated face-to-face interaction with a human partner. This paper concentrates on the low-level components of this interaction loop and the cognitive impact of the implementation of mutual attention and multimodal deixis on the communication task.

 

Detection of Faces and Recognition of Facial Expressions

Nikolaos  Bourbakis

ITRI, Wright State University, Dayton, Ohio, USA

Abstract

Face detection is the foremost task in building vision-based human-computer interaction systems and in particular in applications such as face recognition, face identification, face tracking, expression recognition and content based image retrieval.  A robust face detection system must be able to detect faces irrespective of illuminations, shadows, cluttered backgrounds, facial pose, orientation and facial expressions.  Many approaches for face detection have been proposed.  However, as revealed by FRVT 2002 tests, face detection in outdoor images with uncontrolled illumination and in images with varied pose (non-frontal profile views) is still a serious problem.  In this talk, we describe a Local-Global Graph (LGG) based method for detecting faces and for recognizing facial expressions accurately in real world image capturing conditions both indoor and outdoor, and with a variety of illuminations (shadows, high-lights, non-white lights) and in cluttered backgrounds.  The LG Graph embeds both the local information (the shape of facial feature is stored within the local graph at each node) and the global information (the topology of the face). The LGG approach for detecting faces with maximum confidence from skin segmented images is described.  The LGG approach presented here emulates the human visual perception for face detection.  In general, humans first extract the most important facial features such as eyes, nose, mouth, etc. and then inter-relate them for face and facial expression representations.  Facial expression recognition from the detected face images is obtained by comparing the LG Expression Graphs with the existing the Expression models present in the LGG database.  The methodology is accurate for the expression models present in the database.

 

Image Chromatic Adaptation for Face Skin Color Detection

Nikolaos Bourbakis and Praveen Kakumanu

ITRI, Wright State University, Dayton, Ohio, USA

Abstract

The goal of image chromatic adaptation is to remove the effect of illumination and to obtain color data that reflects precisely the physical contents of the scene.  We present in this talk an approach to image chromatic adaptation using neural networks (NN) with application for detecting - adapting human skin color.  The network is trained on randomly chosen color images containing human subject under various illuminating conditions, thereby enabling the model to dynamically adapt to the changing illumination conditions.  The proposed network predicts directly the illuminant estimate in the image so as to adapt to the human skin color.  The comparison of our method with Gray World, White Patch and Neural Network on White Patch algorithms is presented.  We also present our results on detecting skin regions in NN color corrected test images.  The results are promising and suggest a new approach for adapting human skin color using NN’s. The skin detect technique presented here is the first part of an integrated methodology-tool used for detecting human face and facial expressions of emotion.

 

Nonverbal Communication as a Factor in Linguistic and Cultural Miscommunication

Maja Bratanić, Professor, University of Zagreb, Croatia

ABSTRACT:

The presentation discusses two major assumptions:

-                       that a great deal of human communication is culturally molded and conditioned

-                       that people convey meanings not only through language but through various aspects of nonverbal communication as well.

Nonverbal behavior is to a great extent universal but in many ways also marked by culture-specific patterns. Being less obvious than misunderstandings in verbal communication, nonverbally induced miscommunication is far more difficult to detect. Furthermore, the line between verbal and nonverbal components of communication is often hard to delineate precisely.

 

Main categories of nonverbal behavior and its role in communication will be briefly discussed with the focus on proxemics - the study of the human use of space within the context of culture. The concept of proxemics and its implications will be elaborated on examples from American cultural patterns.

Further examples of culturally-conditioned miscommunication will draw on an aviation-related context.

The presentation will be accompanied by a video A World of Differences: Understanding Cross-Cultural Communication by D. Archer.

Face recognition from 2D still images

Paola Campadelli

Università di Milano, Italy, campadelli@dsi.unimi.it

Abstract:

In the past two decades a lot of  research work has been devoted to the development of automatic methods aimed at recognizing people from images;

such systems are attractive since this type of identification does not require any interaction with the subject. However, the problem is very difficult especially when very few assumpions are done on the images to be treated.

 

In this talk the most interesting methods developed for face recognition from still images will be presented and compared. Open problems will be dealt with, and the contribution that 3D information might provide will be discussed.

 

Technology for Non-Verbal Speech Processing

Nick Campbell,

ATR Science Labs, Kyoto, Japan, nick@atr.jp;

Abstract:

This pair of lectures will focus on the technological needs for the processing of non-verbal speech in a dialogue context.  It is based on an analysis of a very large corpus of spoken interactions captured under extremely natural situations.  The talks will present a model of speech interaction as not only facilitating the exchange of linguistic or propositional information, but also facilitating the display of affect and interpersonal or social relationships.

 

Part I: Speech Synthesis and Discourse Information

 

This talk presents some recent work towards a conversational speech synthesis system for use in interactive dialogues, such as might take place between a person and an information system, a robot, or a speech translation device.  The talk describes several types of response utterances that are currently very difficult to implement using traditional speech synthesis methods, and shows how these non-verbal speech sounds function to provide feedback and status-updates in an interactive discourse.  The lecture will be illustrated with examples of such practic utterances, including laughter and grunts as well as common phrases and idiom, showing how their variety can reveal several types of information about the speaker- (i.e., listener) states.  The proposed model of information exchange through non-verbal speech shows how this feedback from the listener can help the speaker to deliver content more efficiently, and at the same time to be reassured of success in information transmission.

 

Part II:  Towards Recognising Speech Gestures in Discourse

 

This talk describes how the lowest level of information can be processed in a speech signal for annotation of discourse progress and speaker participation status.  In a semi-formal round-table meeting situation there is typically only one main speaker at any given moment, but several participants may be speaking simultaneously, expressing (dis-)agreement, chatting, translating, etc., in addition to the main speaker.  We are currently performing research into technology to process this audio landscape in order to detect the main speaker and to categorise the competing forms of speech.  Several speech gestures such as laughter, agreement, and feedback-responses can be recognised, isolated, and used to determine the progress of the meeting and the degrees and types of participation status among the members present.  This talk will describe the current state of the technology and will present examples of the frequent gestures with descriptions of their typical usage.

 

The Amount of Information on Emotional States Conveyed by the Verbal and Nonverbal Channels: Some Perceptual Data

Anna Esposito

Dipartimento di Psicologia, Seconda Università di Napoli, and  IIASS Italy

ABSTRACT:

In a face-to-face interaction, the addressee exploits both the verbal and nonverbal communication modes to infer the speaker’s emotional state. Is such an informational content redundant? Is the amount of information conveyed by each communication mode the same or is it different? How much information about the speaker’s emotional state is conveyed by each mode and is there a preferential communication mode for a given emotional state? This work attempts to give an answer to the above questions evaluating the subjective perception of emotional states in the single (either visual or auditory channel) and the combined channels (visual and auditory). Results show that vocal expressions bring the same amount of information as the combined channels and that the video alone brings poorer emotional information than the audio and the audio and video together. Interpretations of these results (that seem to not support the data reported in the literature proving the dominance of the visual channel in the emotion’s perception) are given in terms of cognitive load, language expertise and dynamicity. Also, a mathematical model inspired to the information processing theory is hypothesized to support the suggested interpretations.

 

Analyzing and modelling verbal and non-verbal communication for talking animated interface agents

David House and Björn Granström

Royal Institute of Technology, Sweden

ABSTRACT:

The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is for obvious reasons closely related to the speech acoustics (e.g. movements of the lips and jaw), while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face.  On the other hand, many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. The context of much of our research regarding these questions is to be able to create an animated talking agent capable of displaying realistic communicative behaviour and suitable for use in conversational spoken language systems.

The focus of these lectures is to look into the communicative function of the agent, both the capability to increase intelligibility of the spoken interaction and the possibility to make the flow of the dialogue smoother, through different kinds of communicative gestures, such as visual prosodic gestures (e.g. focal accent and emphatic stress) and gestures for different expressive states, turntaking and negative or positive system feedback. We will give some examples of recent work, primarily at KTH, involving the collection and analysis of databases for audiovisual prosody. We will report on methods for the acquisition and modelling of visual and acoustic data, and provide some examples of analysis of e.g. head nods and eyebrow settings related to communicative functions. We will also demonstrate how this analysis can be implemented to generate useful and realistic expressive audiovisual synthesis using a combination of data-driven and rule-based methods.

 

 

Individual Speech Rhythm Variation Within the Plosive Structure of Speech
Eric Keller, IMM, University of Lausanne, Switzerland

Abstract:

The "perceived humanness" of speech revolves centrally around the issues of regularity and variation. Within an utterance's temporal structure, we subjectively experience humans to speak with a certain regularity -- which creates perceived rhythm within speech -- at the same time as we expect them to display variation, mostly for emphasis and to satisfy personal preferences. Synthesized speech that does not exhibit these perceptual qualities is often classified as "robotic" and "unnatural".
 
The search for the objective bases of the perceived regularity in speech is old and has produced less than satisfactory results. In fact in 1977,  Ilse Lehiste, in an extensive review of the issue of isochrony (acoustic  evidence for rhythmicity in speech) came to the conclusion that there  were no direct acoustic correlates of rhythmicity, a view that has formed the consensus for spontaneously produced speech since then, despite a number of further studies performed on the issue.
 
 However, we have data to show that regularity may actually be directly dependent on what might be called the "plosive structure" of the speech chain. If one considers vowel onsets in terms of the suddenness and the relative strength of voice onset, it turns out that human speakers exhibit considerable inter-speaker agreement with respect to the placement of sudden ("strong") vowel onsets, but that this inter-speaker  agreement is gradually reduced as vowel onsets "weaken". The "strength"  or "weakness" of the vowel onset can be determined automatically from  the acoustic signal, and is thus likely to correlate with both motor and  perceptual saliency within the utterance. "Strong" vowel onsets (i.e., those that resemble plosive sounds) appear to set a "frame" for speaking, and between those onsets, we are much freer to choose the  timing values appropriate to the specific semantic and personal context.

 
 Current statistical or neural network models for temporal structuring of  speech may thus well be flawed. Currently, we model all aspects of  speech timing with rigid, totally predictable statistical structures.  Instead, temporal prediction systems should probably provide a set of  main "temporal anchor points" within the utterance, and introduce  "motivated variation" for the remaining aspects of temporal structure.  We will pass in review the different types of psycholinguistic and  pragmatic events that can motivate such variation, and we will consider  a prediction system that can handle these new requirements.

 


TWO LECTURES ON GESTURE

ADAM KENDON

Abstract:

We begin with the question: what is ‘gesture’?  Can we identify, in a theoretically coherent manner, a domain of human action to be called ‘gesture’ which is to be distinguished from other forms of human bodily expression?  After answering this question, we shall proceed to look at the problem of how meaning is attributed to gestural actions. Then we shall look at gesture in relation to speech and consider the different ways it may be employed whenever it is used in conjucntion with speech within an utterance. Examples will be presented that show that speaker’s appear to employ speech and gesture as partners in a common enterprise of utterance construction. Gesture, it will be shown, must be seen to be as much the ‘final product’ of a speaker’s utterance as are the speaker’s words. Discussion will then turn to a consideration of how gesture and speech interact semantically as these two modalities create unified expressions. It will be shown that the ways in which gestures contribute to the overall meaning of an utterance are quite diverse. No simple generalizations are possible. Here we shall first consider the contributions gestures may make to the propositional content of utterances. This will be followed by a discussion of the ‘meta-discursive’ or pragmatic functions that gestures also frequently can be shown to have.  Throughout these lectures numerous examples will be presented, drawn from video-recordings of conversations in many different circumstances in southern Italy (especially in the provinces of Naples and Salerno) and in central England.

Recommended reading: Adam Kendon Gesture: Visisble Action as Utterance. Cambridge University Press, 2004. Especially Chapters 7-13.

 

A Methodological and Theoretical Framework for the Study of Psychological and Biometrical Characteristics in Verbal and Nonverbal Communication

Dominic W. Massaro, Ph.D

Perceptual Science Laboratory, Department of Psychology, University of California

Santa Cruz, CA 95060 U.S.A, 1-831-459-2330, FAX 1-831-459-3519, massaro@fuzzy.ucsc.edu, http://mambo.ucsc.edu/psl/dwm/

Abstract:

The goal of the lectures is to provide an empirical and theoretical overview of a paradigm for inquiry on the Psychological and Biometrical Characteristics in Verbal and Nonverbal Communication. A persistent theme of our approach is that humans are influenced by many different sources of information, including so-called bottom-up and top-down sources. Understanding spoken language, for example, is constrained by a variety of auditory, visual, and gestural cues, as well as lexical, semantic, syntactic, and pragmatic constraints. We will first present a theoretical framework for language processing, and the methodological implications of this framework. We will then review experimental evidence in support of this framework while inconsistent with other frameworks. We will indicate some limitations in other current studies, and suggest changes for improvement. Finally, we will extend the analyses to cross-linguistic studies of speech perception, as well to studies of emotion.

 

Research questions for psycholinguists and speech and reading scientists include the nature of the sources of information; how each source is evaluated and represented; how the multiple sources are treated; whether or not the sources are integrated; the nature of the integration process; how decisions are made; and the time course of processing. Research in a variety of domains and tasks supports the conclusions (for summary see Massaro, 1998) that a) perceivers have continuous rather than categorical information from each of these sources; b) each source is evaluated with respect to the degree of support for each meaningful alternative; c) each source is treated independently of other sources; d) the sources are integrated to give an overall degree of support for each alternative; e) decisions are made with respect to the relative goodness of match among the viable alternatives; f) evaluation; integration; and decision are necessarily successive but overlapping stages of processing; and g) cross-talk among the sources of information is minimal. The fuzzy logical model of perception (Massaro, 1998; FLMP) will be described and contrasted with other models of speech and emotion processing. These models will be tested against speech perception experiments involving cross-linguistic comparisons and experiments on emotion and gesture perception.

Abstract of McNeill lectures

David McNeill University, Chicago, USA, dmcneill@uchicago.edu

 

1.        Introduction to gesture study – how and why it is done.

2.       Some basic facts of speech-synchronized gestures.

3.       The growth point, context, catchment, imagery-language dialectic, dynamic and static dimensions and how they relate.

4.       Gesture in social interaction.

5.       Gesture and culture, linked in non-obvious ways.

6.       Gesture and brain, relevance to the origin of language.

7.        

Multimodal expressive ECAs

Catherine Pelachaud,

Universite de Paris 8, France, c.pelachaud@iut.univ-paris8.fr;

Abstract:
Embodied Conversational Agents (ECAs) are human-like entities capable of communicating with other ECAs and/or users. They exhibit synchronized verbal and nonverbal behaviors (facial expression, gesture, body movement and gaze). During these lectures we will present an ECA system architecture. We will introduce  the taxonomy of communicative functions developed by
Isabella Poggi. Following this taxonomy each communicative function is represented as a pair where the first element corresponds to the meaning of the communicative function while the second element is a description of the signal that is used to transmit this meaning. A representation language, Affective Presentation Markup Language, APML has been elaborated. It is used to drive the animation of the agent and ensures synchrony between the multimodal signals. Behaviors are defined not only by the signals that composed them but also by how they are displayed. We will present an expessivity model where six parameters have been designed to change the gesture and face expressivity.

Intonation, Accent and Personal Traits

Michelina Savino

Dept. of Psychology, University of Bari, ITALY

m.savino@psico.uniba.it

Abstract:

Speaking with an accent reveals the sociolinguistic background of locutors, and it is a widely shared belief that intonation plays a crucial role in characterising regional varieties of spoken languages. This is attested by the large amount of descriptive studies in the literature, and also from the perceptual point of view a number of experiments have been carried out to test the hypothesis that language varieties can be identified by pitch information alone (even though it is currently not clear to what extent segmental information can also play a role in such identification task).

An exemplar case is represented by Italian speakers, for whom the spoken language represents a reliable way for identifying their sociolinguistic traits in verbal interactions, as they always speak with an accent, in both formal and informal situations. This is a consequence of the particular status of Italian with respect to other languages: for historical reasons, in fact, the process of standardisation has been successfully achieved for the written form but not for the spoken language, which is presently characterised by quite strong regional accents. In fact, standard Italian has never been taught in any level of the Italian education system, and its use has been restricted to a small number of professional speakers and actors, and therefore in very specific contexts.

In this lecture, a discussion on the role of intonation in conveying information on the speakers’ sociolinguistic background as personal traits will be presented, basing mainly on examples of Italian varieties.

 

Blind Signal Pre-Processing for MPEG-7 based Multimedia Metadata

applications: an Assisted Living use-case

Giovanni Tummarello, Stefano Squartini, Francesco Piazza

Università Politecnica delle Marche, Italy

ABSTRACT:

Metadata extracted from Multimedia or live sensoring is set to play a major role in any intelligent and multimodal interactions between humans and computers. Furthermore, it is generally required that such metadata are structured and encoded according to well agreed standards. This is fundamental to enable interoperability and create complex applications as a mesh of heterogeneous services and components. On purpose, the MPEG-7 standard for dealing with multimedia metadata and the tools developed within the Semantic Web initiative are providing today the basic framework. Their application to real world problems, however, is made problematic by the fact that the data are often captured from difficult live conditions. It is therefore of primary importance to enhance the quality of the observable signals before the metadata extraction algorithms are employed. In particular, for the case of audio signals, it is important to perform separation and deconvolution of audio signals captured in real environments and in blind conditions. In this work a full featured real world multimedia metadata assisted living scenario is constructed using a combination of Blind Signal Processing and MPEG-7 based metadata techniques. In such example, an array of microphones captures speech and audio signals and thanks to MPEG-7 technologies the user can select multimedia content to be played.

 

Visual Phrasems and Pragmaphrasems in English, Polish and Croatian

Neda Pintaric

University of Zagreb, Croatia

ABSTRACT:

The author writes about etimological and semantic meaning of an eye as the organ with the biggest capacity among all human organs. Eyes, ears and tactil receptors are main receptors in human communication. The author claims that nonverbal code is a pracode based on these receptors, therefore it has been developed earlier than the verbal code.

            Nonverbal code is multicode consisting of kinetic, tacezic, deiktic, proxemic and prosodic signs which we use consciously and unconsciously. Our unconscious informations couldn't be hidden and the other person in communication can read it in our eyes.

            In various cultures people operate with visual culturemes, such as a custom of eye-contact which can mean sincerity (e.g. among Croats) or sexual affection (e.g. among Poles).

            The main part of the paper consists of lexical and phrasematic analyse of pragmatic items called pragmemes and phrasopragmemes. The author compares different linguistic systems using examples in English, Polish and Croatian eye-signs. 

 

Videocaptured Verbal and Nonverbal Foreign Language Teacher Feedback

Leticia Vicente-Rasoamalala

Aichi Prefectural University, JAPAN

ABSTRACT:

Videotaping for research purposes in the field of Second Language Acquisition (SLA) classroom context is still quite experimental and a minority practice. Apart from the difficulties for getting the consent of the participants (i.e. the school authorities, the teachers and the parents of the students) for videorecording classroom interactions, there are not very defined research lines and instruments to work with the obtained data. Specifically, the present poster will highlight different issues concerning the collection, the identification and the analysis of the videocaptured verbal and the nonverbal foreign language teacher feedback in classroom context.

The focus on this area is at identifying from the collected database the teachers’ strategies that appear to be more successful in dealing with L2 learner oral output containing deviant forms. From the last decade, a number of SLA studies influenced by Long’s Interaction Hypothesis (1996) are studying in detail the negative implicit forms of teacher feedback under the assumption that they might be more beneficial for acquisition. Such works have adopted diverse qualitative and quantitative paradigms. For instance, elements of ethnographic research, Conversational Analysis and the Neo-Vygotskian perspective (Vygotsky, 1968). The ultimate goal of many studies in this area is finding out the instructional sequences that might optimize foreign language teaching and learning. Nevertheless, there are some shortcomings relating to the lack of consensus for the existing schemes analyzing corpora and the types of verbal and non verbal annotations.

Originally, most approaches in classroom discourse research have built analytic frameworks almost exclusively dealing with audiotaped linguistic data designed to build didactic models. Significantly, most instruments have neglected the audio-visual data which might be captured in videorecordings. Thanks to the new technologies the notations of teacher gestures and the manipulation of tools are being incorporated in some classroom studies. Some works have suggested that non verbal elements might enhance verbal feedback and often regulate classroom exchanges among teachers and learners. Additionally, those outcomes suggest that using extensively video recordings is necessary for future comprehensive studies of classroom discourse.

References

 

1.        Block, D (2003) The Social Turn in Second Language Acquisition. Edinburgh University Press.

2.       Gardner, R. & Wagner, J. (Eds.) (2004) Second Language Conversations. London: Continuum.

3.       Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In Bhatia and Richie, (Eds.), Handbook of second language acquisition, San Diego: Academic Press, Inc.

4.       Lyster, R. (2001) Negotiation of form, recasts and explicit correction in relation to error types and learner repair in immersion classrooms. Language Learning, 51, 265-301.

5.       Vicente-Rasoamalala, L. (2006) Elementos No Verbales en la Retroalimentación del Docente de L2”. The Journal of the Faculty of Foreign Studies. Aichi Prefectural University 38, 159-188.

 

Effectiveness of Short-Term Prosodic Features for Speaker Verification

Iker Luengo, Eva Navas, Inmaculada Hernáez

University of the Basque Country, Spain, ikerl@bips.bi.ehu.es

Abstract:

 In this work a traditional MFCC based system is combined with a prosody based one to determine whether simple short-term prosodic information is useful for improving current state-of-the-art ASV. Results do not show significant improvement

 

 

A Partially Observable Markov Decision Process approach to Affective Dialogue Modeling

Bui Huu Trung,

Human Media Interaction, Department of Computer Scence, Faculty of Electrical Engineering Mathematics and Computer Science, University of Twente, P.O. Box 217 7500AE Enschede the Netherlands

Abstract:

We propose a novel approach to developing a dialogue model which is able to take into account some aspects of the user's emotional state and acts appropriately. The dialogue model uses a Partially Observable Markov Decision Process approach with observations composed of the observed user's emotional state and action. A simple example of route navigation is explained to clarify our approach and preliminary results & future plans are briefly discussed.

 

A Systemic Approach to Enhance Writing, Analysis, and Presentation Skills

Aly N. El-Bahrawy, Professor

Faculty of Engineering, Ain Shams University, Cairo, Egypt

Abstract:

The paper discusses the systemic of technical communication - for researchers and professionals - which includes writing, analysis and presentation.  Each of the three components has sub-components related to conveying the technical message clearly to the receiver.  The first component ‘Technical writing’ is related to language rules in general, and technical writing guidelines in particular.  The second component ‘Data analysis’ is related to statistical and database principles in addition to graphics basics.  The third component ‘Professional Presentation’ is related to organization of material, audio-visual equipment, the body movement of the speaker, which includes voice, hand gestures, facial expressions, etc.  Another encompassing factor is the use of computers to help the three components send the respective message clearly.  As an example, the Microsoft programs WORD, EXCEL, and PowerPoint are powerful tools to write, analyze and present the technical message.  In WORD, features like formatting, language tools, and automatic generation of reference tables are very appealing.  In EXCEL, data analysis and graphics are very elaborate.  In PowerPoint, organization and animation tools can be used to enhance the presentation significantly.  The paper presents examples of the use of such approach to execute successful training courses for engineers and researchers.  Finally, the paper stresses the importance of the three components to form the technical and academic character of professionals.

 

Can Word prime gestures?

Paolo Bernardis

Università di Bologna, Scuola superiore di Studi Umanistici. Università di Parma, Dipartimento di Neuroscienze

Abstract:

Can words prime gestures?

Evidence for language and action relationships was recently highlighted in both behavioral (Glenberg, 2002) and neurophysiological research (Gentilucci, 2001), thus reinforcing the already well established link between language and gestures for communicative purposes (Bellugi,1979; 1987; McNeill,1992; 2000). However, at present there is no clear evidence of the direct interaction of the two systems.

Aim of this study was precisely to check for evidence of this interaction with the priming effect, i.e. to investigate whether words could prime the recognition of a gesture with the same meaning. The participants were presented with a word and then asked to recognize a gesture either having the same or a different meaning. The words chosen as primes were of different kinds: 1) a noun referring to a simple object with (1a) no specific action required (e.g., ‘clock’) or (1b) a specific action required (e.g., ‘gun’ -to shoot); 2) a verb referring to (2a) a direct simple action (e.g., ‘knock’) or (2b) an action to be performed with a tool (e.g., ‘write’). The gestures presented were video-clips showing the upper half-body of an actor, which was blurred, performing the gestures with his arms and hands. Response times and errors were recorded.

Two main results were obtained. The first was a clear priming effect of meaning (i.e., a semantic priming effect of word on gesture). The second result showed a greater priming effect when both words and gestures referred to objects and actions requiring tools.

 

The study of similiarites in learning foreign languages

Daboveanu Diana-Cristina, Niculae Carmen – Eufrusina

 Romania, Bucharest, Sector 3, 37, Mircea-Voda Boulevard, Block M29, Scara D, 3rd Floor, Flat 113,

Abstract:

Today student mobility has already become reality. It is supported by numerous education programmes at national and international levels and in particular within the framework of EU funded programmes. There is, however, a clear need for support programmes to assist exchange students to prepare for their studies. The EUROMOBIL project (72139-CP-2-2000-1-FI-L2) the development of a multimedia language learning and information programme on CD-ROM for DE, EN, HU and FI) was started in 1999 with the aim of developing a self-study course, which would enable exchange students to prepare, both in terms of the  host language and knowledge of the culture, for their visit to universities in DE, UK, HU and FI. The project was expanded to CZ, FR, PL, PT and RO.

In a needs analysis at the beginning of the project differences in the requirements for studies abroad were noted. Starting from this need analysis and using rank distance as an measure for similarity our goal was to research how related are the problems that apear when studying a foreign language and to expand this result to see the differences and similarities between the cultures and languges that are part of the project.

 

Tongue motor cortex excitability is modulated by the observation of the type of grasp action

Dalla Volta R¹, Bernardis P¹, Buonocore A¹, Sato M¹, Palumbo D¹, Gentilucci M¹

¹Dept. Neuroscience, University of Parma, Italy, gentiluc@unipr.it

Abstract:

Voice spectrum and lip kinematics during pronunciation of syllables are affected by the simultaneous execution or observation of transitive actions, like the grasp of different objects, according to the type of involved hand grip. The study aimed to verify whether in humans the observation of hand grasping actions onto objects requiring different types of grip affects tongue motor cortex excitability. We recorded motor evoked potentials (MEPs) from the tongue of 16 right handed healthy subjects after delivering single pulses by using transcranial magnetic stimulation (TMS) over the tongue left motor cortex. While stimulating the participants looked at the PC monitor where video-clips showing either hand grasping of fruits of different size that required different types of grip or the same fruits alone were presented. The syllable DA appeared on the fruits in both cases. Power grip of large fruits was linked to MEPs significantly greater than those linked to precision grip of small fruits. In a control experiment we presented either different tools approaching geometric solids or the same solids alone. Neither when presenting the same fruits alone nor in the control experiment any MEP modulation was observed. We conclude that observation of different biological hand actions onto edible objects specifically modulates tongue motor cortex excitability. These data support the hypothesis that a motor resonance system is activated by hand action observation. This resonant circuit sends double motor commands to both hand and mouth.

 

Machine Translation Evaluation: a Case study of Croatian-English and Russian-English MT Systems

Ivana Simeon

Department of Linguistics Faculty of Philosophy, University of Zagreb, Ivana Lučića 3, HR-10000 Zagreb, Croatia, E-mail: isimeon@ffzg.hr

Abstract:

From the earliest days of machine translation (MT), evaluation has been an inherent and significant part of efforts invested into machine translation research. In this paper, an overview of the history of MT evaluation is presented, with emphasis on one of the most comprehensive MT evaluation projects, undertaken in the 1960s, namely the Automatic Language Processing Advisory Committee Report.

Furthermore, strategies and problems pertaining to MT evaluation are discussed, with emphasis on the distinction between subjective criteria, such as comprehensibility, and the objective, quantifiable criteria, such as error quantification and analysis.

Within the practical part of the paper, the results of testing four MT systems – one for the language pair Croatian-English, and three for the language pair Russian-English – are shown. The systems were tested on three textual samples belonging to general, fictional and scientific genres. The analysis of the results included a comprehensibility poll which included five informants (native or proficient target language speakers) for each target language, as well as quantification of errors and error type assessment across genres and across individual MT systems. Finally, cumulative results are given for each MT system and for each language pair.

            As a conclusion, recent developments in the field of MT evaluation are presented, including automatic evaluation methods, such as IBM’s measures BLEU and NIST.

 

Using the Wavelet Transform in Real-time Digital Signal Processing

Jan Vlach, Přinosil Jiří

Department of Telecommunications, Faculty of Electrical Engineering and Communication, Brno University of Technology, Purkynova 118, 612 00 Brno, Czech Republic,

Abstract:

The new method of segmented wavelet transform (SegWT) makes it possible to exactly compute the discrete-time wavelet transform of a signal segment-by-segment. This means that the method could be utilized for wavelet-type processing of a signal in "real time", or in case we need to process a long signal (not necessarily in real time), but there is insufficient memory capacity for it (for example in the signal processors). Then it is possible to process the signal part-by-part with low memory costs by the new method. The method is suitable for universal utilization in any place where the signal has to be processed via modification of its wavelet coefficients (e.g. signal denoising, compression, speech segmentation, music processing, alternative modulation techniques for xDSL systems). It is also possible to use SegWT in wavelet-processing (e.g. compression, selective area processing) of large images. In the paper, the principle of the forward segmented wavelet transform is described.

 

The Integrative and Structuring Function of Speech in Face-to-Face Communication from the Perspective of Human-Centered Linguistics

Krzysztof Korżyk

Jagiellonian University Kraków, Poland

Abstract:

This paper illustrates the need for study of the interdependencies between verbal and nonverbal behavior treated as a unified form of activity, manifesting itself in face-to-face communication. Invoking the principles of human-centered linguistics (see Yngve 1996, 2000), the author treats verbal communication not as something passed on via language, but rather as something to which language merely contributes. One of the consequences of such an approach to this issue is a reassignment of focus. Rather than attention being drawn to linguistic phenomena,  the spotlight is on the communicative  properties of the interlocutors, creatively utilizing various elements of the interactional ”symbolic space.”

            With reference to the above, this text presents a realistic account of the interpretational activity taking place between communicating subjects. The action is perceived as a function of choices correlating verbal, prosodic, and kinesthetic signs and signals. Concurrently, taking the pragmatic and interactional aspects of these multimodal choices under consideration, the author discusses typical situations in which the structuring role of speech is particularly evident. Light will also be shed on the crucial interconnections between the above-mentioned systems of signs and signals, as well as on the advantages stemming from an integrated modeling of communicative phenomena.

1.        Yngve V.H. (1996) From Grammar to Science. New Foundations for General Linguistics, Amsterdam: John Benjamins.

2.       V.H. Yngve and Z. Wąsik (2000) Exploring the Domain of Human-Centered Linguistics from a Hard-Science Perspective (Workshop), Poznań: Motivex.

 

Research on Speech Synthesis and Speech Recognition of Croatian language on the Faculty of Humanities and Social Sciences

Lazic Nikolaj,

Faculty of Humanities and Social Sciences University of Zagreb- Ivana Lucica 3 10000 Zagreb- Croatia, nlazic@ffzg.hr;

Abstract:

Speech synthesis and speech recognition are processes in need of multidisciplinary approach. Faculty of Humanities and Social Sciences in Zagreb has departments that can aid in the processes of synthesis and recognition, namely Information sciences, Linguistics, Phonetics, Croatian language.

One of the problems in Croatian language is accurate word accent, distinguishing orthographically identical, but differently sounding words. Proper word accentuation is therefore essential for accurate speech synthesis of Croatian language. Different word accentuations may be completely wrong or "dialectally coloured". Accurate word accent recognition in speech recognition systems is needed for semantics in case of later machine translation.

Different approaches to speech synthesis may ease or complicate production of correctly sounding synthesized speech. Speech synthesis based on concatenation needs all variants of word accents present in the language repertoire for synthesis, but makes synthesized speech more natural. Formant synthesis, on the other side, produces whatever accentuation needed, but it is harder to describe all acoustically relevant sound segments for speech synthesis.

 

Intercultural Differences in Vocal Communication of Emotions: An Experimental Comparison Between Chinese and Italian Young Adults

Fabrizia Mantovani, Luigi Anolli, Lei Wang, Alessandro De Toni

CESCOM_Centre for Studies in Communication Sciences University of Milan-Bicocca P.za Ateneo Nuovo,1 20123 Milan Italy, mantovani@unimib.it;

ABSTRACT:

The poster presents an experimental study comparing the vocal communication of emotions between Chinese and Italian young adults. Main goal of the study is to investigate whether:

(a)     the vocal expression of eight emotions (joy, sadness, anger, fear, contempt, pride, guilt, shame) is characterized by distinguished patterns of paralinguistic features;

(b)     the vocal patterns of emotional expressions - produced in reaction to comparable eliciting situations - differ between members of two different cultures (Chinese and Italian);

(c)     specific cultural configurations exist in vocal expression of emotion.

Forty-eight undergraduates (29 Chinese and 19 Italian) were asked to read aloud short stories inducing different emotions via scenario approach. The short stories had been prepared and validated in a preliminary phase: in each text a standard sentence was included in order to carry out subsequently acoustic comparisons. Acoustic analyses were carried out through the Computerized Speech Lab (CSL) 4300B. Different acoustic parameters were considered referring to time (total duration, partial duration, duration of pauses, speech rate and articulation rate), fundamental frequency (mean, standard deviation, range, minimum and maximum of F0) and intensity (mean, standard deviation, range, minimum and maximum).

Results from statistical analyses confirmed the importance of vocal production in generating distinctive emotional patterns, as well as the presence of both similarities and differences between the vocal emotional patterns of Chinese versus Italian participants. The theoretical implications of these findings will be discussed.

 

Non-verbal Interaction and Ambient Entertainment

Anton Nijholt, Dennis Reidsma, and others

University of Twente Department of Computer Science PO Box 217 7500 AE Enschede, The Netherlands, anijholt@cs.utwente.nl;

Abstract:

In future Ambient Intelligence (AmI) environments we assume intelligence embedded in the environment, its objects (furniture, mobile robots) and its virtual, sometimes visualized agents (virtual humans). These environments support the human inhabitants or visitors of these environments in their activities and interactions by perceiving them through their sensors (proximity sensors, cameras, microphones, etc.). Support can be reactive, but also and more importantly, pro-active, anticipating the needs of the inhabitants and visitors.

Health, recreation, sports and playing games are among these needs. Sensors in these environments can detect and interpret bodily activity and can give multimedia feedback to invite, stimulate, guide and advise on bodily activity. Rather than aiming at improving user task efficiency, in the environments we investigate the aim is to improve physical and mental health (well-being) through exercise and through play. Exercises can be done in order to improve fitness, to prevent certain injuries (e.g., RSI), or to recover from an accident (e.g., physiotherapy exercises). Other exercises may aim at improving certain capabilities related to a profession (ballet, etc.), some kind of recreation (juggling, etc.), or sports (fencing, etc.). Fun, just fun, achieved from interaction (e.g. dancing or physical gaming) can be another aim of such environments.

In this presentation we look at our research on bodily and gestural interaction with environments equipped with some simple sensors (cameras, microphones, dance pads), some application-dependent intelligence (allowing reactive and pro-active activity), and an embodied virtual agent employed in the display of reactive and pro-active activity. Dance, music, and associated movements in human and virtual agents are the main modalities that are used in our environmental installations.

 

Towards an all-inclusive cross-media relations framework

Katerina Pastra

Language Technology Applications Department, Institute for Language and Speech Processing kpastra@ilsp.gr

Abstract:

While there is a growing demand for developing Intelligent Multimedia Interfaces and Systems, one still strives to find a descriptive framework of how different media and modalities interact with one another. The significance of the latter becomes evident, when one attempts to build multimedia systems or intelligent agents, where multimedia content integration decisions are to be made. In this paper, we identify two important parameters in developing such a framework: the use of multiple and clearly stated criteria for defining interaction relations across media and the integration of findings from the analyses of the interaction of as many different media-pairs as possible. In correlating our own corpus-based work on image-language interaction with existing work on image-language and gesture-language interaction, we identify three such criteria and corresponding interaction relations. We further suggest a way of validating the applicability and expressiveness of these interaction relations, which involves a set of simple metrics for computing them in a multimedia corpus. Therefore, we lay the bases for a descriptive framework that will be closer to an ``all-modalities'' and an ``all-perspectives'' inclusive one.

“Unseen gestures" and the Mind of the Speaker: An analysis of co-verbal gestures in map-task activities

Nicla Rossini

Dipartimento di Linguistica Teorica e Applicata Università degli Studi di Pavia, tattvamasi@libero.it;

Abstract:

The analysis of co-verbal gestures in map-task activities is particularly interesting for several reasons: on the one hand, the speaker is engaged in a collaborative task with an interlocutor; on the other hand, the task itself is designed in order to place a cognitive demand on both the speaker and the receiver, who are not visible to one another. The cognitive effort in question implies the activation of different capabilities, such as self-orientation in space, planning (which can also be considered a self-orientation task concerning the capability of organising successful communicative strategies for the solution of a given problem), and communication in "unnatural" conditions.

The co-verbal gestures performed during such a task are quantitatively and qualitatively different from those performed in normal conditions, and can provide information about the Mind of the Speaker (Poggi & Magno Caldognetto, 1997). In particular, the recursive pattern of some metaphors (McNeill, 1992 and following) can be interpreted as a reliable index of the communicative strategy adopted by the speaker: recurrent metaphors indicating the adoption of a plan, its abandonment, or its confirmation will be shown and analysed. Moreover, cases of gestures indicating the opposition between Given and New (Halliday, 1985), and other basic psycholinguistic phenomena centred on collaborative speech acts, such as awaiting feedback, frustration, wrong-footing, etc., will be discussed and compared with the co-verbal gesticulation of subjects intent on a face-to-face  interaction.

 

On the analysis of fundamental frequency control characteristics of nonverbal utterances and its application to communicative prosody generation
Ke Li, Yoko Greenberg and Yoshinori Sagisaka

Waseda univ. GITI  29-7 building 1-3-10 Nishi-Waseda Shinjuku-ku Tokyo 169-0051 Japan, yoshinori.sagiska@atr.jp;

Abstract:
Aiming at communicative speech generation, control characteristics of nonverbal utterances were analyzed. From the analyses using F0 generation model, utterance specific control characteristics were observed. Their prosodic characteristics are linked to the multi-dimensional vectors expressing listener’s subjective impression. A quantitative prosody control
scheme is newly proposed to test the validity of F0 generation and their effectiveness is conformed by perceptual evaluation tests.

 

Face recognition using sparse meshes: a promising approach

Samokhval Vladimir

United Institute of Informatics Problems National Academy of Sciences of Belarus
Surganova 6, 220012 Minsk,Belarus

Abstract:

The sparse meshes are considered as the suitable tool for construction of the classifier in recognition problems of human faces. Their use for face modeling is based on a number of remarkable properties of such data presentation and additional opportunities to increase a level of authentic recognition. In particular, representation of area of interest in the form of a 2,5-dimensional mesh potentially allows to receive missing foreshortenings of the image of human face, that essentially increases recognition rate as it is shown with use of PCA and discriminant analysis methods. For successful work of PCA method is necessary to receive the frontal image of the face, and it is possible to rotate a mesh in depth on a certain angle. At the synthesis of discriminant filters their functioning is directly connected with the volume of training sample. In this case mesh rotations enable to receive additional views of facial image and to expand training set. Besides, the degree of mesh sparseness in itself is the parameter, that influences both on classification results, and on the volume of calculations, and finally on speed and system functioning. In this research we consider some aspects of the performance of PCA and discriminant analysis methods for which input data are 2,5-dimensional models of the face in the form of sparse meshes, and also we establish a degree of sparseness of these meshes for optimum performance of recognition system.

Verbal and nonverbal resources in constructing the topical flow in early interaction in picture book environment

Sari Karjalainen

Department of Speech Sciences, Siltavuorenpenger 20 A/F, P. O. Box 9, 00014 University of Helsinki, Finland , sari.karjalainen@helsinki.fi

Abstract;

The gestures, particularly pointings, will be analyzed in the process of topical co-operation between adult and child at preverbal stage and, especially, how these resources are used in making the shifts within the topic when looking at picture books. The method for the study is qualitative and data driven CA (conversation analysis). The data base is composed of videotaped naturalistic picture book conversations between child (at the age from 1 to 2 years) and adult. Video data from 6 Finnish families is transcribed and analyzed and the events are presented with different text-based transcriptions, and also a computer-aided visualization of these annotations to supplement the analysis. The micro-analysis is focused on the sequential organization of the participants’ verbal and nonverbal action (pointing and other gestures, gaze, vocalizations and adult's speech) and, especially, on the sequences where the topic is extended from the referent in the picture book to the noticeable or non-noticeable referents outside the book. The child’s acts, for example, the pointing at the window, get different meanings in different sequential contexts. The topical sequences of different kind will be presented focusing on how both the verbal and nonverbal resources used in referring to picture referents and related referents outside the picture book reveal the use of already existing shared knowledge or constructing the shared knowledge between the participants.

 

Visual Search, Baggage Screening, and the Assessment of Mental Workload through the Analysis of Eye-Movements

Michela Terenzi & Francesco Di Nocera

Cognitive Ergonomics Laboratory, Department of Psychology, University of Rome “La Sapienza”

Abstract:

In light of the events of September 11, 2001, many efforts have been made to support security officers in identifying potential threats. Most of them are technological aids used by airports’ personnel when performing security operations such as baggage screening. However, automation support is known to alleviate some tasks and, at the same time, to create new forms of workload. For this reason, also Human Factors / Ergonomics researchers addressed the vast range of technical challenges that, because of these acts of terrorism, now face society (Hanckock & Hart, 2002). The most important issues in this field, is the analysis of the operators’ mental workload, which is a key factor in determining human error. Therefore, it is crucial to find viable strategies for minimizing the cognitive load, for optimizing work schedule, and for managing automation support by workload-matched procedures.

Indeed, one approach for improving performance could be to support the limited human information processing capabilities through the use of adaptive aids, triggered by variations in human physiology and behavior. Previous studies (Di Nocera et al, 2006a; 2006b) showed a relation between the distribution of eye fixations and workload, providing a real-time measure of the operator’s load. In the present study, a typical visual search task was used. Subjects were requested to find a target among a set of distractors. Eye movements were recorder during the task. Results showed sensitivity of the proposed index to variations in mental workload, thus confirming the utility of fixations patterns as triggers for adaptive automation.

 

 

The socio-cultural differences and the personal traits in Ukraine’s scientific life

Oksana Udovyk and Oleg Udovyk

National University “Kyiv Mohyla Academy”and National Institute for Strategic Studies Kyiv, Ukraine, E-mail: xenna_2003@ukr.net and oleg_udovyk@hotmail.com

Abstract:

Ukraine suffers from an identity crisis that is inhibiting its scientific, as well as its economic and political, development. The 47 million inhabitants of the former Soviet republic are deeply divided between pro-European and pro-Russian factions. The celebrated 'orange revolution' of November 2004 did less to bridge this divide than is commonly thought.

   The nation's research system broadly reflects this wider societal divide. On the one hand, there are many young, well-educated and highly motivated researchers and a network of increasingly independent universities. On the other, there's the National Academy of Sciences of Ukraine, a leviathan of militant senility that retains just enough power to control critical aspects of Ukraine's scientific life.

   The academy employs 47,000 permanent staff in a network of largely unproductive research establishments. Given the advanced age of its senior management, time alone will eventually resolve the issue. But that won't happen soon enough for those young Ukrainians currently in search of a productive scientific career.

   Integrating the Ukraine into the Framework research programme of the European Union (EU) would allow this generation far greater interaction with its peers abroad. The European Commission supports the idea, which could also help open the way to future EU membership for Ukraine. But the leadership of the academy, deeply rooted in Soviet traditions, seems to be thwarting such integration through a mixture of contrariness and lack of interest.

   A high-level EU–Ukrainian steering committee on scientific cooperation, for example, was established on paper four years ago but has yet to actually meet. When it does, the academy's leaders are expected to obstruct collaborative steps that might bring an infusion of foreign influences into the country — including respect for the value of independent peer review.

   Ukrainian science has potential in several spheres, including materials sciences, radioastronomy, theoretical physics and agricultural research. The nation badly needs to focus its scarce resources in those areas where its scientists can compete, and dispose of some of its anachronistic scientific heritage. That will require a rigorous external evaluation of the performance of hundreds of the academy's institutes.

   The government needs to identify these reforms as a priority and then act with determination to overcome the academy's likely resistance to them. The oligarchy that has controlled Ukrainian science since Soviet times may then lose out. But the nation's economic potential and its prospects for integration into the EU, as well as science itself, can only benefit. Reform of Ukraine's archaic research system is needed sooner rather than later.

 

Recognizing the effects of voluntary facial activations using heart rate patterns

Toni Vanhala and Veikko Surakka

Research Group for Emotions, Sociality, and Computing Tampere Unit for Human-Computer Interaction, Dept. of Computer Sciences, FI-33014 University of Tampere, Finland, Email: Toni.Vanhala@cs.uta.fi

Abstract:

Continuously measured physiological signals have the potential to act as non-invasive, real time indicators of human psycho-physiological phenomena. Recently, several non-intrusive, wireless, and discrete measurement devices have been developed. For these reasons, there has been growing interest for using physiological signals for estimating emotions and other psychological processes during human-computer interaction, as well as for person identification [e.g. 1]. Due to the interaction of the human physiological and psychological systems there are several unique challenges for analyzing these signals. In the current work, we present the first steps towards constructing an online system that automatically identifies heart rate responses and estimates subjective experiences during voluntary facial activations. The preliminary results of our study showed that voluntarily produced facial expressions had an effect on subjective emotional experiences and physiological processes. Further, our results suggest that heart rate responses to facial activations can be detected in order to develop face detection systems for more accurate, online person-identification and emotion recognition.

1.        Poulos, M.   Rangoussi, M., Chrissikopoulos, V., Evangelou, A. (1999) Parametric person identification from the EEG using computational geometry. In Proceedings of ICECS '99, 1005-1008.

 

On the analysis of disfluencies in large spontaneous speech corpora: the case of autonomous fillers

Ioana Vasilescu

Limsi-Cnrs Spoken Language Processing Group Bat.508 Bp 133 F-91403 Orsay Cedex France, ioana@limsi.fr

Abstract:

The hesitation or “edition” phenomena, such as filled pauses, silent pauses, word lengthening etc. are widely encountered in world’s languages. They are to be distinguished from the lexical level. Consequently, they have been for decades considered as “speech disfluencies”, i.e. articulatory events without a role in building the verbal message. Recently, the research of cognitivists such as Clark and Fox Tree, brought into light a new decoding of the presence of those phenomena in speech [1]. The authors focused more particularly on the autonomous fillers in English (“uh”, “um”), which are defined as long and stable vocalic segments, potentially inserted at any moment within spontaneous speech. According to the authors, those items play a role in communication, i.e. “to announce the initiation of what is expected to be a […] delay in speaking”.

My work focuses on the analysis and modeling of the speech disfluencies in the framework of automatic speech processing in a multilingual context. In this purpose I analyzed fillers from a multilingual corpus of broadcast news in Arabic, Mandarin Chinese, French, German, Italian, European Portuguese, American English and Latin American Spanish. I addressed so far the question of the specificity of acoustic fillers models: generic across languages or language-dependent. I will present some acoustic and perceptual findings supporting the second hypothesis. I will also mention the effects of number of external factors which influence the acoustic and prosodic patterns of fillers in different types of speech corpora (prepared, conversational etc.). Among those factors language, gender, speaking style and language proficiency engender significant variation needing to be taken into account in order to accurately model the phenomenon.

1.        H.H., Fox Tree J.E., Clark, Using uh and um in spontaneous speaking, Cognition 84, 73-111, 2002.

 

Prosodic Cues for Automatic Phrase Boundary Detection in ASR

Klara VICSI -  Gyorgy SZASZAK

Budapest University for Technology and Economics, Dept. for Telecommunications and Mediainformatics,

Budapest, Hungary.

ABSTRACT:

This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level based on prosodic features. A word level segmentationer has been developed which can indicate the word boundaries with acceptable accuracy for both languages. The ultimate aim is to increase the robustness of Automatic Speech Recognizers (ASR) by detection of word and phrase boundaries, and thus significantly decrease the searching space during the decoding process, very time-consuming in case of agglutinative languages, like Hungarian and Finnish. They are however fixed stressed languages, so by stress detection, word beginnings can be marked with reliable accuracy. Algorithms based on data-driven (HMM) approach were developed and evaluated. The best results were obtained by time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high correctness ratio, if we not going to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.

 

Experiments in Assessing the Validity and Reliability of Item N-gram Distributions in Texts as Proxy for Textual Fingerprints

Carl Vogel
Computational Linguistics Group, Trinity College, University of Dublin, Dublin 2, Ireland.

Abstract:

My research in this area is associated with text rather than speech or other visual cues, and is driven by hypotheses within forensic linguistics that there are valid and reliable methods which can be used for authorship attribution tasks.  Behind these hypotheses is the claim that individuals essentially unconsciously fingerprint themselves in the texts that they produce.  There are many proposals in the literature and includes analysis of consciously manipulated author style, and other aspects of text that are extremely difficult for an author to manipulate.  Orthography is one such aspect of texts.

While a great deal of deliberation may go into selection of lexical items from open-class categories, somewhat less reflection tends to be involved in closed class categories, and therefore it is interesting to explore distributions of words in closed class categories used by an author across texts and genre, and in comparison with other authors as a complement to any analysis of lexical richness or quirkiness. However, orthographic analysis crosses both categories, and is basically driven by the fact that while one can choose one's words, one does not generally choose how words are spelled. 

 

Along these lines I and my research group have been experimenting with character n-gram analyses of texts and using similarity of character n-gram distributions to guide the assessment of similarity among texts.  The research involves comparative analysis, rating the character approach with various values of n with other methods of tokenizing texts -- e.g. word n-grams, n-grams of part of speech tags, accounting for stop-words, etc.  Some of the experiments have been on closed systems of texts in which authorship is actually known, others involve partly open systems of texts in which the claim of single authorship is disputed (e.g. whether one author is responsible for the entire Shakespeare corpus, including the apocrypha), and still more open ended explorations in sentiment analysis -- using essentially bag-of-character analysis to estimate similarity among non-governmental political party positions on the basis of election manifestos.

The research relates to the scope of the meeting in that the communication of information about authorship is assumed to be unintended by the author (although it is acknowledged that some authors do directly manipulate orthography for effect --- lipograms provide a relevant example), yet which can potentially be used to identify the author.  I intended to present and obtain feedback on the experiments reported, as well as directions of ongoing research which include tracking language change in individuals over time (an issue perhaps of interest in research on aging and early detection of neuro-degenerative disorders which impinge on language production).

A couple of relevant papers:

1.        Van Gijsel, Sofie and Carl Vogel (2003) Inducing a Cline from Corpora of Political Manifestos, International Symposium on Information and Communication Technologies, edited by Markus Aleksy, et al.,  pp 304 – 310.

2.       O'Brien, Cormac and Carl Vogel (2003) Spam Filters: Bayes vs. Chi-Squared; Letters vs. Words, International Symposium on Information and Communication Technologies, edited by Markus Aleksy, et al., pp 298 - 303.

 

Empty pauses detection in a noisy speech conditions

Vojtich Stejskala,  Zdenek Smékala, Anna Espositob

aDep. of Telecommunications, Brno University of Technology, Purkynova 118, 612 00, Brno CZ,  stejskal@kn.vutbr.cz;

bDep. of Psychology, Second University of Naples and IIASS, Italy

Abstract:

Nowadays, the most important role in a process of speech recognition plays successful pause detection. There is need of robust detection algorithm, if we consider that most of speech recordings are taken under very adverse conditions. This paper presents a comparison of several algorithms for empty pauses detection on a spontaneous speech records gained in noisy environments. Input signal is transformed into log spectral energy and divided into specific frequency bands. Each band is smoothed and tracked by dynamically adjusted threshold based on pause (noise) energy estimation. Then the post processing edges correction follows. All proposed algorithms are capable to process a real time input.

 

Images-Signes and Cognitive Scenes to Anchor Comprehension in Language Learning: An Experiment on the Relationship between Verbal and Non Verbal Communication through Italian Film.

Rosa Volpe

148 rue Saint Honoré, 75001 Paris, France, rvolpe@univ-orleans.fr;

Abstract:

This study has as a central focus a foreign language classroom environment that makes daily use of film discourse in the target language in order to provide « in-context » and « situational training » as well as « anchored » target language input. More specifically this study explores (1) whether first-semester, first-year college students of Italian can and will understand film narrative (2) if and how, the comprehension of the film narrative will affect, if any, their written production. The first experimental probe consists in a comprehension task. The results suggest that the Anchored Learning Group performed better than the Basic-Skills Learning Group in the comprehension of the two film segments. The second experiment probe consists in a production task and shows that compared to the Basic-Skills Learning Group, the written production of the Anchored Learning Group reflects far better the structure of the narrative discourse they have been exposed to. In the occasion of this production task, students were required to write their essay using at least 10 verbs they were familiar with. The performance of the Anchored Group was different from the performance of the Basic-Skills Group both in quality and quantity. The anchored Group wrote more correct verbs, resulting in the production of more correct sentences, and the structure of their discourse was closer to the structure of the discourse of film narrative.

 

Language and Communication. An Eco-Anthropological Point of View

GALATCHI Liviu-Daniel

Ovidius University of Constanta, Str. Dezrobirii, nr. 114, bl. IS7, sc. A, et. 4, apt. 16,

RO 900241, Constanta – 4, Romania, galatchi@univ-ovidius.ro  or  liviugalatchi@yahoo.com

Abstract:

Although wild primates have only call systems, chimps and gorillas can understand and manipulate non-verbal symbols based on language. Primates emit calls only in the presence of particular environmental stimuli. Calls cannot be combined when different stimuli are present simultaneously. At some point in human evolution, our ancestors became capable of displaced speech. Other contrasts between language and call systems include productivity and cultural transmission. Over time, our ancestral call systems developed into true language. Call systems grew too complicated for genetic transmission and began to rely on learning. Language is the main system humans use to communicate, although we also use nonverbal communications, gestures, and body stances and movements.

No language includes all the sounds that the human vocal apparatus can make.

There are culturally distinctive as well as universal relationships between language and mental processes. The lexicons and grammars of particular languages can lead speakers to perceive and think in particular ways. Speakers of different languages categorize their experience differently. However, language does not tightly restrict thought, because cultural changes can produce changes in thought and in language.

People vary their speech on different occasions, shifting styles, dialects, and languages. As linguistic systems, all languages and dialects, are equally complex, rule-governed, and effective for communication. However, speech is used, is evaluated, and changes in the context of political, economic and social forces. The linguistic traits of a low-status group are negatively evaluated (often by the members of the group) not because of their linguistic features but because they are associated with and symbolize low social status. One dialect, supported by the dominant institutions of the state, exercises symbolic domination over the others.

Cultural similarities and differences often correlate with linguistic ones. Linguistic clues can suggest past contacts between cultures. Related languages descend from an original protolanguage. Relationships between languages don’t necessarily mean that there are biological ties between their speakers, because people can learn new languages.

 

Face Recognition Experiments on AR Database

Marco Grassi1, Marcos Faundez2

1Ingegneria Eletronica dell'Universitá degli Studi di Ancona, margra75@hotmail.com

2Escola Universitaria Politecnica de Mataro, Spain

Abstract:

Biometric recognition and authentication based on face recognitions can actually be used in many real-time applications such as: surveillance, security systems, access control, and much more. For these purposes the system has to grant a fast computation speed but also robustness to illumination and facial expressions variations. The main objective of this paper is to implement a face recognition system using the DCT (Discrete Cosine Transform) method for characteristics extraction. Appling the DCT to the image results possible to concentrate the information reducing the dimensionality of the problem. For the classification has been used nearest neighbour classifiers using MAD (Mean Absolute Difference) and MSE (Mean Square Error), that grant a very fast computation, and a RBF (Radial Basis Function) Neural Network, that presents a faster training than a classic Neural Network. Simulation results, over the AR face Database, show that the proposed system have very good performances with very fast computation speed, high recognition rate and good robustness.

 

Overcomplete Blind Separation of Speech Sources in the Post Non Linear case through Extended Gaussianization

S.Squartini, S.Cecchi, E.Moretti, F.Piazza

A3Lab-DEIT Università Politecnica delle Marche

Via Brecce Bianche 31, 60131, Ancona, Italy

Abstract - This work deals with the blind separation problem in presence of more sources than sensors and Post-Nonlinear (PNL) mixing. The addressed method is made of three separate steps: compensation of nonlinearity, mixing matrix recovery and final unknown source estimation. It has been recently proposed and successfully evaluated in the case of synthetic mixtures of real world data (like speech signals).  Here, the Extended Gaussianization approach is employed to perform the first step instead of the common Gaussianization one in order to reduce the approximation error on the linearized mixture pdfs. Computer simulations allowed to achieve a significant improvement of separation performances over the previous approach.

 

The relationships between gestures and prosody: A preliminary investigation on Italian

A. Esposito1, D. Esposito1, M. Refice2, M. Savino3, S. Shattuck-Hufnagel4

1Department of Psychology, Second University of  Naples (SUN), Italy

2Department of Elettrotecnica, Politecnico di Bari, Italy

3Department of Psychology, Università di Bari, Italy

4MIT, Research Laboratory of Electronics, Cambridge, MA, USA

Abstract.

This work investigates on the relationships between gestures and prosody, exploiting a class of gestural movements named HITS and defined by Yasinnik, Renwick, Shattuck-Hufnagel (2004) as: “An abrupt stop or pause in movement, which breaks the flow of the gesture during which it occurs” Our analysis show that, as in American English, also in Italian, these gestural entities are correlated with high level prosodic units.

 

[1] Y. Yasinnik, M. Renwick, S. Shattuck-Hufnagel: The Timing of Speech-Accompanying Gestures with Respect to Prosody.  Proceedings of  the International  Conference From Sound to Sense, C97-C102, MIT, Cambridge, June 10-13, 2004. **********