Diss: Comp Ling/Cognitive Science: Pastra: 'Vision...'

        1.    Katerina Pastra, Vision-Language Integration: a Double-Grounding Case

Message 1: Vision-Language Integration: a Double-Grounding Case

Date: 20-Jan-2005
From: Katerina Pastra <>
Subject: Vision-Language Integration: a Double-Grounding Case

Institution: University of Sheffield
Program: Department of Computer Science
Dissertation Status: Completed
Degree Date: 2004

Author: Katerina Pastra

Dissertation Title: Vision-Language Integration: a Double-Grounding Case

Linguistic Field(s): Cognitive Science
                            Computational Linguistics

Dissertation Director:
Phil Green
Louise Guthrie
Yorick Wilks

Dissertation Abstract:

This thesis explores the issue of vision-language integration from the
Artificial Intelligence perspective of building intentional artificial
agents able to combine their visual and linguistic abilities automatically.
While such a computational vision-language integration is a sine qua non
requirement for developing a wide range of intelligent multimedia systems,
the deeper issue still remains in the research background. What does
integration actually mean? Why is it needed in Artificial Intelligence
systems, how is it currently achieved and how far can we go in developing
fully automatic vision-language integration prototypes?

Through a parallel theoretical investigation of visual and linguistic
representational systems, the nature and characteristics of the subjects of
this integration study, vision and language, are determined. Then, the
notion of their computational integration itself is explored. An extensive
review of the integration resources and mechanisms used in a wide-range of
vision-language integration prototypes leads to a descriptive definition of
this integration as a process of establishing associations between images
and language. The review points to the fact that state of the art
prototypes fail to perform real integration, because they rely on human
intervention at key integration stages, in order to overcome difficulties
related to features vision and language inherently lack.

In looking into these features so as to discover the real need for
integrating vision and language in multimodal situations,
intentionality-related issues appear to play a central role in justifying
integration. These features are correlated with Searle's theory of
intentionality and the Symbol Grounding problem. This leads to a view of
the traditionally advocated grounding of language in visual perceptions as
a bi-directional, not one-directional, process. It is argued that
vision-language integration is rather a case of double-grounding, in which
linguistic representations are grounded in visual ones for getting direct
access to the physical world, while visual representations, in their turn,
are grounded in linguistic ones for acquiring a controlled access to mental
aspects of the world.

Last, the feasibility of developing a prototype able to achieve this
double-grounding with minimal human intervention is explored. VLEMA is
presented, a prototype which is fed with automatically reconstructed
building-interior scenes, which it subsequently describes in natural
language. The prototype includes a number of unique features which point to
new directions in building agents endowed with real vision-language
integration abilities.

