Audio-Visual Processing in Meetings:
Seven Questions and Current AMI Answers
Marc Al-Hames1 , Thomas Hain2 , Jan Cernocky3 , Sascha Schreiber1 ,
Mannes Poel4 , Ronald Müller1 , Sebastien Marcel5 , David van Leeuwen6 ,
Jean-Marc Odobez5 , Sileye Ba5 , Herve Bourlard5 , Fabien Cardinaux5 ,
Daniel Gatica-Perez5 , Adam Janin8 , Petr Motlicek3,5 , Stephan Reiter1 ,
Steve Renals7 , Jeroen van Rest6 , Rutger Rienks4 , Gerhard Rigoll1 ,
Kevin Smith5 , Andrew Thean6 , and Pavel Zemcik3 ⋆⋆
1
Institute for Human-Machine-Communication, Technische Universität München
2
Department of Computer Science, University of Sheffield
3
Faculty of Information Technology, Brno University of Technology
4
Department of Computer Science, University of Twente
5
IDIAP Research Institute and Ecole Polytechnique Federale de Lausanne (EPFL)
6
Netherlands Organisation for Applied Scientific Research (TNO)
7
Centre for Speech Technology Research, University of Edinburgh
8
International Computer Science Institute, Berkeley, CA
Abstract. The project Augmented Multi-party Interaction (AMI) is
concerned with the development of meeting browsers and remote meet-
ing assistants for instrumented meeting rooms – and the required com-
ponent technologies R&D themes: group dynamics, audio, visual, and
multimodal processing, content abstraction, and human-computer inter-
action. The audio-visual processing workpackage within AMI addresses
the automatic recognition from audio, video, and combined audio-video
streams, that have been recorded during meetings. In this article we
describe the progress that has been made in the first two years of the
project. We show how the large problem of audio-visual processing in
meetings can be split into seven questions, like “Who is acting during
the meeting?”. We then show which algorithms and methods have been
developed and evaluated for the automatic answering of these questions.
1 Introduction
Large parts of our working days are consumed by meetings and conferences.
Unfortunately a lot of them are neither efficient, nor especially successful. In a
recent study [12] people were asked to select emotion terms that they thought
would be frequently perceived in a meeting. The top answer – mentioned from
more than two third of the participants – was “boring”; furthermore nearly one
third mentioned “annoyed” as a frequently perceived emotion. This implies that
many people feel meetings are nothing else, but flogging a dead horse.
⋆⋆
This work was partly supported by the European Union 6th FWP IST Integrated
Project AMI (Augmented Multi-party Interaction, FP6-506811).
, Things get from bad to worse if transcriptions are required to recapitulate de-
cisions or to share information with people who have not attended the meeting.
There are different types of meeting transcriptions: they can either be written
by a person involved in the meeting and are therefore often not exhaustive and
usually from the particular perspective of this person. Sometimes they are only
hand-written drafts that can not easily be shared. The second type are profes-
sional minutes, written by a person especially chosen to minute the meeting,
usually not involved in the meeting. They require a lot of effort, but are usually
detailed and can be shared (if somebody indeed takes the time to read over
them). The third and most common transcript is no transcript at all.
Projects, like the ICSI meeting project [14], Computers in the human inter-
action loop (CHIL) [29], or Augmented Multi-party Interaction (AMI) [7] try to
overcome these drawbacks of meetings, lectures, and conferences. They deal with
the automatic transcription, analysis, and summarisation of multi-party interac-
tions and aim to both improve the efficiency, as well as to allow a later recapitu-
lation of the meeting content, e.g with a meeting browser [30]. The project AMI
is especially concerned with the development of meeting browsers and remote
meeting assistants for instrumented meeting rooms – and the required compo-
nent technologies R&D themes: group dynamics, audio, visual, and multimodal
processing, content abstraction, and human-computer interaction.“Smart meet-
ing rooms” are equipped with audio-visual recording equipment and a huge range
of data is captured during the meetings. A corpus of 100 hours of meetings is
collected with a variety of microphones, video cameras, electronic pens, presen-
tation slide and whiteboard capture devices. For technical reasons the meetings
in the corpus are formed by a group of four persons.
The first step for the analysis of this data is the processing of the raw audio-
visual stream. This involves various challenging tasks. In the AMI project we
address the audio-visual recognition problems by formulating seven questions:
1. What has been said during the meeting?
2. What events and keywords occur in the meeting?
3. Who and where are the persons in the meeting?
4. Who in the meeting is acting or speaking?
5. How do people act in the meeting?
6. What are the participants’ emotions in the meeting?
7. Where or what is the focus of attention in meetings?
The audio-visual processing workpackage within the AMI project aims to
develop algorithms that can automatically answer each of these questions from
the raw audio-visual streams. The answers can then be used either directly during
or after the meeting (e.g. in a meeting browser), or as an input for a higher level
analysis (e.g. summarisation). In this article we describe the progress that has
been made in the first two AMI project years towards the automatic recognition
from audio-visual streams, and thus towards answering the questions. Each of
the next chapters discusses algorithms, methods, and evaluation standards for
one of the seven questions and summarises the experiences we made.