FFF CONFERENCE CTF07

Massimo Poesio, Marco Baroni, Brian Murphy, Eduard Barbu, Luigi Lombardi, Abdulrahman Almuhareb, Gabriella Vigliocco, David Vinson - Speaker-generated and corpus-generated concept features

Computational research on extracting concept descriptions from corpora provides vast quantities of empirical data on concepts that may supplement findings from linguistics, philosophy, psychology and neural science. A lot of this work is inspired by theoretical work in linguistics and philosophy and aims at extracting ontological properties of concepts, as opposed to just distributional properties of words or syntactic relations. Some of this work has also attempted to extract information about semantic attributes including e.g., parts and qualities (Almuhareb and Poesio, 2004, 2005; Cimiano et al 2005). Almuhareb and Poesio (henceforth A&P) developed supervised and unsupervised methods for feature extraction from the Web based on ideas from Guarino (1992) and Pustejovsky (1995) among others, and showed that focusing on extracting ‘attributes’ and ‘values’ leads to concept descriptions that are more effective from a clustering perspective – e.g., to distinguish ANIMALS from TOOLS or VEHICLES - than purely distributional descriptions.
   In order to identify areas of further progress, ‘gold-standard’ concept descriptions against which to compare such corpus-generated features (CGFs) are required, but finding objective, ‘gold standard’ description of concepts is not easy. There is a possible solution however: use the lists of features collected from human subjects by psychologists. These are lists of features ranked, e.g., by the number of subjects that listed that feature for that concept. The three best known databases of these speaker-generated features (SGFs) were produced by Garrard et al 2001 (GA), Vigliocco et al 2004 (VV), McRae et al 2005 (MCRA). In this paper we discuss work we have carried out exploring this option, including an extensive qualitative and quantitative analysis of these three databases, and comparisons between the features associated with concepts in these databases and corpus-generated features.
   SGFs have been collected using very different methodologies, and as a result, the lists of features produced by different researchers do not entirely overlap and ranks are also not entirely equivalent, but Spearman rank correlation are reasonable, ranging from .48 (MCRA vs. GA) to .68 (VV vs. MCRA). Computing correlations between CGFs and SGFs requires some form of normalization, but with normalization we obtained correlations not much lower than those between SGF databases. (‘Raw’ correlation between VV and CGFs was 0.28; this rose to 0.43 with orthographic and stem normalization, and excluding low-count features.) It is interesting to note that for all these differences, the distances between concepts (viewed as vectors in a high-dimensional feature space) using CGFs and SGFs are highly correlated (ro=.777).
   Preliminary results of a qualitative comparison suggest, first of all, that SGFs provide the view of kinds that might be found in a children’s encyclopedia (e.g., for HORSE, the features <animal>, <4-legs>, <hoof>, <ride>, <mane> are highly ranked in all SGF databases), whereas CGFs include in addition more detailed information, from a variety of perspectives, and about specific instances (<dressage>, <quarter>, <white>). There are also however important differences in ranking that raise the question of what makes a feature important, and whether this notion can be independent from the comparison set: e.g., the top 10 features of APPLE extracted from the corpus include <top>, <surface> and <circumference>, all properties of physical objects but not particularly distinctive properties of apples.