CoCQA: Co-Training Over Questions and Answers
with an Application to Predicting Question Subjectivity Orientation
Baoli Li Yandong Liu Eugene Agichtein
Emory University Emory University Emory University
csblli@gmail.com yliu49@emory.edu eugene@mathcs.emory.edu
with a specific, accurate, and complete an-
Abstract swer that addresses the question. Although
much progress has been made, answering
An increasingly popular method for
complex, opinion, and even many factual
finding information online is via the
questions automatically is still beyond the
Community Question Answering
current state-of-the-art. At the same time, the
(CQA) portals such as Yahoo! An-
rise of popularity in social media and collabo-
swers, Naver, and Baidu Knows.
rative content creation services provides a
Searching the CQA archives, and rank-
promising alternative to web search or com-
ing, filtering, and evaluating the sub-
pletely automated QA. The explicit support
mitted answers requires intelligent
for social interactions between participants,
processing of the questions and an-
such as posting comments, rating content, and
swers posed by the users. One impor-
responding to questions and comments makes
tant task is automatically detecting the
this medium particularly amenable to Ques-
question’s subjectivity orientation:
tion Answering. Some very successful exam-
namely, whether a user is searching for
ples of Community Question Answering
subjective or objective information.
(CQA) sites are Yahoo! Answers 1 and
Unfortunately, real user questions are
Naver 2 , and Baidu Knows 3 . Yahoo! Answers
often vague, ill-posed, poorly stated.
alone has already amassed hundreds of mil-
Furthermore, there has been little la-
lions of answers posted by millions of par-
beled training data available for real
ticipants on thousands of topics.
user questions. To address these prob-
The questions posted to such CQA portals
lems, we present CoCQA, a co-training
are typically complex, subjective, and rely on
system that exploits the association be-
human interpretation to understand the corre-
tween the questions and contributed
sponding information need. At the same time,
answers for question analysis tasks.
the questions are also usually ill-phrased,
The co-training approach allows
vague, and often subjective in nature. Hence,
CoCQA to use the effectively unlim-
analysis of the questions (and of the corre-
ited amounts of unlabeled data readily
sponding user intent) in this setting is a par-
available in CQA archives. In this pa-
ticularly difficult task. At the same time,
per we study the effectiveness of
CQA content incorporates the relationships
CoCQA for the question subjectivity
between questions and the corresponding an-
classification task by experimenting
swers. Because of the various incentives pro-
over thousands of real users’ questions.
vided by the CQA sites, answers posted by
users tend to be, at least to some degree, re-
1 Introduction sponsive to the question. This observation
Automatic question answering (QA) has been suggests investigating whether the relation-
one of the long-standing goals of natural lan-
guage processing, information retrieval, and 1
http://answers.yahoo.com
artificial intelligence research. For a natural 2
http://www.naver.com
language question we would like to respond 3
http://www.baidu.com
937
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 937–946,
Honolulu, October 2008. c 2008 Association for Computational Linguistics
, ship between questions and answers can be
exploited to improve automated analysis of the
CQA content and the user intent behind the
questions posted.
To this end, we exploit the ideas of co-
training, a general semi-supervised learning
approach naturally applicable to cases of com-
plementary views on a domain, for example,
web page links and content (Blum and
Mitchell, 1998). In our setting, we focus on the
complimentary views for a question, namely
the text of the question and the text of the as-
sociated answers.
As a concrete case-study of our approach
we focus on one particularly important aspect
of intent detection: the subjectivity orientation.
We attempt to predict whether a question
posted in a CQA site is subjective or objective.
Objective questions are expected to be an-
swered with reliable or authoritative informa-
tion, typically published online and possibly
referenced as part of the answer, whereas sub-
jective questions seek answers containing pri-
vate states, e.g. personal opinions, judgment,
experiences. If we could automatically predict
the orientation of a question, we would be able
to better rank or filter the answers, improve
search over the archives, and more accurately
identify similar questions. For example, if a Figure 1: Example question (Yahoo! Answers)
question is objective, we could try to find a
few highly relevant articles as references, The rest of the paper is structured as fol-
whereas if a question is subjective, useful an- lows. We first overview the community ques-
swers are not expected to be found in authori- tion answering setting, and state the question
tative sources and tend to rank low with cur- orientation classification problem, which we
rent question answering and CQA search tech- use as the motivating application for our sys-
niques. Finally, learning how to identify ques- tem, more precisely. We then introduce our
tion orientation is a crucial component of in- CoCQA system for semi-supervised classifi-
ferring user intent, a long-standing problem in cation of questions and answers in CQA com-
web information access settings. munities (Section 3). We report the results of
In particular, we focus on the following re- our experiments over thousands of real user
search questions: questions in Section 4, showing the effective-
• Can we utilize the inherent structure of the ness of our approach. Finally, we review re-
CQA interactions and use the unlimited lated work in Section 5, and discuss our con-
amounts of unlabeled data to improve classi- clusions and future work in Section 6.
fication performance, and/or reduce the
amount of manual labeling required? 2 Question Orientation in CQA
• Can we automatically predict question sub- We first briefly describe the essential features
jectivity in Community Question Answering of question answering communities such as
– and which features are useful for this task Yahoo! Answers or Naver. Then, we formally
in the real CQA setting? state the problem addressed in this paper, and
the features used for this setting.
938