LINGUIST List 27.2006
Mon May 02 2016
Review: Computational Ling; General Ling: Harispe, Ranwez, Janaqi, Montmain (2015)
Editor for this issue: Sara Couture <saralinguistlist.org>
Emiel van Miltenburg <emiel.van.miltenburg
Semantic Similarity from Natural Language and Ontology Analysis E-mail this message to a friend Discuss this message
Book announced at http://linguistlist.org/issues/26/26-4929.html
AUTHOR: Sebastian Harispe
AUTHOR: Sylvie Ranwez
AUTHOR: Stefan Janaqi
AUTHOR: Jacky Montmain
TITLE: Semantic Similarity from Natural Language and Ontology Analysis
SERIES TITLE: Synthesis Lectures on Human Language Technologies
PUBLISHER: Morgan & Claypool Publishers
REVIEWER: Emiel van Miltenburg, Vrije Universiteit Amsterdam
Reviews Editor: Helen Aristar-Dry
In the words of the authors, “this book proposes an extended introduction to semantic measures targeting both students and domain experts” (p. xiii) in the field of Natural Language Processing. The emphasis of the book is on semantic measures of similarity and relatedness, as derived from natural language (text corpora) and knowledge bases (WordNet, domain ontologies, thesauri and encyclopedias).
Chapter 1 -- introduction to semantic measures.
This chapter opens with an overview of application areas for ‘semantic measures’, after which the authors shortly discuss psychological models of similarity (spatial models, feature models, alignment models, and transformational models). Following this, the authors start to formally and mathematically define the notions of ‘semantic measures’ (an umbrella term covering all measures that quantify some semantic relation), ‘relatedness’ and ‘similarity’, leading up to a classification of semantic measures according to the following four aspects:
1. The type of elements that the measure aims to compare.
2. The semantic proxies used to extract the semantics required by the measure.
3. The semantic evidence and assumptions considered during the comparison.
4. The canonical form adopted to represent an element and how to handle it.” (p. 22)
The main aspect used to structure the book is the second one, with Chapter 2 devoted to measures using unstructured or semi-structured texts, and Chapter 3 devoted to knowledge-based measures. The other aspects are secondary, and referenced in the discussion of the relevant measures.
Chapter 2 -- corpus-based semantic measures.
This chapter starts out with the observation that most corpus-based semantic measures are either implicitly or explicitly based on the ‘distributional hypothesis’, the idea that words occurring in similar contexts convey similar meanings. (The authors restrict themselves to distributional measures that compare words, excluding measures that compare texts or sentences.) It continues with a general description of how count-based distributional models work. Next, the chapter discusses the meaning of words and the difference between syntagmatic and paradigmatic contexts, after which the authors explain the field of distributional semantics. This explanation is followed by an overview of different ‘distributional measures’. The authors briefly mention the set-based approach (Bollegala 2007 a.o.) and the probabilistic approach (Dagan 1999 a.o.), but the emphasis is on ‘the geometric or spatial approach.’ The authors highlight LSA, ESA, HAL, Schütze wordspace, Random Indexing and COALS (Deerwester et al. 1990; Gabrilovich and Markovitch 2007; Lund and Burgess 1996; Schütze 1993; Kanerva et al. 2000; Rohde et al. 2006) as being the most popular approaches. The chapter closes off with a list of the main advantages and limitations of corpus-based measures, and a final summary.
Chapter 3 -- knowledge-based semantic measures.
This is the longest chapter in the book. It starts out by explaining the idea of representing ontologies as graphs, and introducing the necessary formal notation. The authors follow up by making a distinction between cyclic and acyclic graphs, and exploring different semantic measures for cyclic graphs. Before discussing semantic measures for acyclic graphs, however, the authors take a detour to discuss graph properties that can be used to compute semantic measures. This discussion is followed up by an extensive overview of pair- and groupwise semantic similarity measures, making use of structured taxonomies. The authors dedicate a short section to other knowledge-based measures before providing a list of the main advantages and limitations of knowledge-based measures. Finally, the authors dedicate a section to hybrid approaches mixing knowledge-based and corpus-based approaches, followed by a short conclusion.
Chapter 4 -- methods and datasets for the evaluation of semantic measures.
This chapter provides an overview of the datasets that have been used to evaluate semantic measures. It starts out with a general introduction to semantic measure evaluation, and goes on to explain different criteria that one may have for an evaluation (including e.g. computational complexity of the evaluation). These criteria may help researchers in selecting the right dataset to evaluate their own semantic measure. After discussing direct versus indirect evaluation strategies, the authors first list a large number of datasets, and then provide additional details for all of them (e.g. how the evaluation data was created). The chapter closes with a discussion, noting that more research needs to be done to find out how to better evaluate semantic measures.
Chapter 5 -- conclusion and research directions.
This chapter first summarizes what has been covered in the preceding chapters, and then presents several suggestions for future research. These are:
“* Better characterize semantic measures and their semantics;
* Provide theoretical and software tools for the study of semantic measures;
* Standardize ontology handling;
* Improve models for compositionality;
* Study current models of semantic measures w.r.t. language specificities;
* Promote interdisciplinary efforts;
* Study algorithmic complexity of measures;
* Support context-specific selection of semantic measures.” (pp. 159-160)
It is beyond the scope of this summary to cover all these suggestions in more detail, but I would like to commend the authors on their extensive list of suggestions for further research.
The book has four appendices, for which the titles are mostly self-explanatory:
Appendix A -- examples of syntagmatic contexts (5 pages).
Appendix B -- a brief introduction to Singular Value Decomposition (2 pages).
Appendix C -- a brief overview of other models for representing units of language (7 pages). This appendix covers two different kinds of language models: n-gram models and Neural Network Language-based Models (NNLMs), discussing only the basic intuitions behind these models. The chapter closes with a discussion of compositionality in distributional semantics (the idea of building a sentence representation by mathematically combining word vectors), offering some references to further explore the subject.
Appendix D -- software tools and source code libraries (9 pages).
Production-wise, there are two (minor) issues with this book. First, the style of the figures is not uniform, and the images frequently suffer from compression artifacts. Second, due to the fact that the authors are not native speakers of English, the prose is sometimes a bit awkward (e.g. “the hypothesis that has been aforementioned” (p. 42)) and the style is at times a bit dense.
Coverage & audience
The authors note that “[c]ommunities of Psychology, Cognitive Sciences, Linguistics, Natural Language Processing, Semantic Web, and Biomedical informatics are among the most active contributors” to the field (pp. 2-3). But in their discussion of semantic similarity, they restrict themselves mostly to the latter three. This is of course excusable for a book in a series on human language technologies, but readers mostly interested in Psychology, Cognitive Science or Linguistics-related aspects of similarities should look elsewhere. The authors note that their section on the Psychology of similarity is based on (Hahn 2011), but readers interested in a written overview could also consult (Hahn and Heit 2001), which not only covers the same ground but also makes the connection with later work in distributional semantics. For an introduction to distributional semantics and Latent Semantic Analysis, see (Landauer and Dumais 1997). (For a book-length introduction to distributional semantics, see Widdows 2004.) One might accompany this with Gärdenfors’ (2000) seminal work on Conceptual Spaces, or either one of (Margolis and Laurence 1999) or (Murphy 2002) for a general overview of theories of conceptual representation. More experimental (and cross-cultural) work has been done by Malt et al. (1999), Khetarpal et al. (2010) and others (see also their references). A book connecting this body of work with current advances in human language technologies is yet to be written.
The core of the book (Chapters 2 and 3) is about corpus-based and ontology-based semantic measures. The expertise of the authors clearly lies in the field of ontology analysis. This book can be best understood as an attempt to put their knowledge (as exemplified in Chapter 3) in a broader perspective. That also explains the fact that the 29-page chapter on corpus-based semantic measures has a relatively narrow focus “for the sake of clarity and due to space constraints” while the chapter on knowledge-based semantic measures totals at 72 pages. The extensive coverage of these measures does make Chapter 3 a solid reference for ontology-related matters.
So what about the coverage of corpus-based semantic measures? Chapter 2 provides a decent introduction to what Baroni et al. (2014) call ‘count models’ (i.e. distributional models that build up a matrix of (word-document or word-word) co-occurrence counts, and then transform that matrix to get vector representations corresponding to word meanings), but even the newest model the authors discuss in detail is almost ten years old. Meanwhile, ‘predict models’ (based on neural networks) have taken the field by storm since Mikolov (2013) et al. released their word2vec tool. Participants of EMNLP 2015 (Empirical Methods in Natural Language Processing) were even joking that the ‘E’ in EMNLP now stands for ‘embeddings’! Right now, predict models are only covered in appendix C, which almost seems to have been added as an afterthought. Given the ubiquity of these models, this is a mistake. Other recent developments, such as multimodal distributional semantics (MDS; Bruni et al. 2014; Baroni 2015), aren’t even mentioned. This is a missed opportunity for a textbook that wishes to “stimulate creativity toward the development of new approaches” (pp. xiv), because MDS leads us to rethink what we mean by the ‘context’ of a word, and discuss whether distributional models are (or could be made) biologically plausible. This is an exciting avenue of research in Cognitive Science that sadly isn’t touched upon. The second chapter, then, is a no-frills introduction to the very basics of distributional semantics. It is still usable in a classroom context, but might be supplemented with additional literature. With respect to neural network language models, Yoav Goldberg’s excellent ‘Primer on Neural Networks for Natural Language Processing’ (draft available through: http://u.cs.biu.ac.il/~yogo/nnlp.pdf
) provides a good introduction. If Goldberg’s introduction is too much, Chris Olah’s blog (http://colah.github.io/
) has some very accessible explanations of language models and deep learning. (Also see the reading list at http://deeplearning.net/reading-list/
The remaining chapters give a good overview of the field, and include all the necessary references to embark on a research project on semantic similarity. Despite its shortcomings, this book does make a solid reference work on knowledge-based similarity measures, and it provides good overview of evaluation protocols that are currently out there.
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 238-247).
Baroni, M. (2015). Grounding Distributional Semantics in the Visual World. Language and Linguistics Compass.
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). An Integrated Approach to Measuring Semantic Similarity between Words Using Information Available on the Web. In HLT-NAACL (pp. 340-347).
Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal Distributional Semantics. J. Artif. Intell. Res.(JAIR), 49, 1-47.
Dagan, I., Lee, L., & Pereira, F. C. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1-3), 43-69.
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JAsIs, 41(6), 391-407.
Gärdenfors, P. (2000). Conceptual spaces : the geometry of thought. Cambridge, Mass.: MIT Press.
Gabrilovich, E., & Markovitch, S. (2007). Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In IJCAI (Vol. 7, pp. 1606-1611).
Hahn, U. (2011) What makes things similar? Invited talk at the 1st International Workshop on Similarity-based Pattern Analysis URL: http://videolectures.net/simbad2011_hahn_similar/
Hahn, U. and E. Heit (2001) Semantic Similarity, Cognitive Psychology of, In International Encyclopedia of the Social & Behavioral Sciences, edited by Neil J. SmelserPaul B. Baltes, Pergamon, Oxford, 2001, Pages 13878-13881.
Kanerva, P., Kristofersson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd annual conference of the cognitive science society (Vol. 1036). Mahwah, NJ: Erlbaum.
Khetarpal, N., Majid, A., Malt, B. C., Sloman, S., & Regier, T. (2010). Similarity judgments reflect both language and cross-language tendencies: Evidence from two semantic domains. In S. Ohlsson, & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 358-363). Austin, TX: Cognitive Science Society.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2), 211.
Lund, K., & Burgess, C. (1996). Hyperspace analogue to language (HAL): A general model semantic representation. In Brain and Cognition (Vol. 30, No. 3, pp. 5-5).
Malt, B. C., Sloman, S. A., Gennari, S., Shi, M., & Wang, Y. (1999). Knowing versus naming: Similarity and the linguistic categorization of artifacts. Journal of Memory and Language, 40(2), 230-262.
Margolis, E., & Laurence, S. (1999). Concepts: core readings. Mit Press.
Mikolov, T, Chen, K., Corrado, G., and Dean, J.. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
Murphy, G. L. (2002). The big book of concepts. MIT press.
Rohde, D. L., Gonnerman, L. M., & Plaut, D. C. (2006). An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8, 627-633.
Schütze, H. (1993). Word space. In Advances in Neural Information Processing Systems 5.
Widdows, D. (2004). Geometry and meaning (Vol. 773). Stanford: CSLI publications.
ABOUT THE REVIEWER
Emiel van Miltenburg is a PhD candidate working at the Vrije Universiteit Amsterdam, under the supervision of Piek Vossen. His research interests include conceptual representation, semantic similarity, pragmatics and natural language processing.
Page Updated: 02-May-2016