Publishing Partner: Cambridge University Press CUP Extra Publisher Login

New from Cambridge University Press!


Revitalizing Endangered Languages

Edited by Justyna Olko & Julia Sallabank

Revitalizing Endangered Languages "This guidebook provides ideas and strategies, as well as some background, to help with the effective revitalization of endangered languages. It covers a broad scope of themes including effective planning, benefits, wellbeing, economic aspects, attitudes and ideologies."

New from Wiley!


We Have a New Site!

With the help of your donations we have been making good progress on designing and launching our new website! Check it out at!
***We are still in our beta stages for the new site--if you have any feedback, be sure to let us know at***

Review of  Representation and parsing of multiword expressions

Reviewer: Viatcheslav Yatsko
Book Title: Representation and parsing of multiword expressions
Book Author: Yannick Parmentier Jakub Waszczuk
Publisher: Language Science Press
Linguistic Field(s): Computational Linguistics
Issue Number: 31.1728

Discuss this Review
Help on Posting

This book is a collection of papers created within Workgroup 2 of the European PARSEME COST Action. The aim of the workgroup was to develop techniques and methodologies for detection, presentation, and parsing of multiword expressions (MWEs). From the linguistic point of view multiword expressions are constructions, whose components have lost their original lexical meaning, though they admit of some variation, cf. ''to take a haircut'' meaning ''to agree to accept less money for something'' that admits of modifiers (to take a serious/70% haircut) and different verb forms (takes/is taking/has taken a haircut).

The book comprises Preface and 10 chapters grouped into three parts: MWE presentations (Chapters 1-5), MWE parsing (Chapters 6-8), and Multilingual NL applications for MWEs (Chapters 9 and 10).

In the Preface, Yannick Parmentierand Jakub Waszczuk, the editors of the book, substantiate the relevance of the investigations presented in the book to the field of automatic natural language processing, pointing to their idiosyncratic properties and frequency of occurrence (MWEs cover up to 30% of all words in human language utterances). They are right, because tokenization based on the traditional bag-of-words-approach usually fails to detect phraseological constructions, which can negatively affect results of text processing. Obviously, if the MWE mentioned above is divided into separate words its meaning as well as the meaning of a part of the text or even the whole text will be lost. Hence the importance of specific techniques for MWEs detection, presentation, and processing.

Part I of the books opens with the chapter ''Lexical encoding formats for multiword expressions. The challenge of 'irregular' regularities'' written by Timm Lichte, Simon Petitjean, Agata Savary, and Jakub Waszczuk. The chapter gives a clear and logical interpretation of the notions of regularity and irregularity in terms of the set theory. A property ''p'' is considered regular with respect to a set of objects ''E'' if ''p'' is shared by at least two members in ''E''. If it is associated with only one member of ''E'', this property is irregular. If a given property is shared by a subset of ''E'', it is considered non-trivially regular. It is trivially regular if it is shared by all objects in ''E''.

This interpretation proposed by the authors reminded me of the methodologies developed within the scope of cluster theory that also involve correlation between properties and objects. In cluster theory, a cluster is defined as ''a set of objects that share some property'' [1, p. 495]. A somewhat similar methodology is used in componential analysis, where semantic features are assigned to some linguistic units (usually words) to distinguish between words with similar meaning. Table 1 (p. 5) is similar with the tables used in distributional analysis. The authors should have provided references to imbed their research into a larger theoretical framework. I was surprised not to find such references.

Most of the chapter focuses on various encoding formats that may be used to represent the structure of MWEs.

The second chapter ''Verbal multiword expressions: Ideomacity and flexibility'' is written by Tali Arad Greshler, Nurit Melnik, and Shuly Wintner. The authors focus on the work of Nunberg et al who differentiated between decomposable and non-decomposable MWEs basing on the degree of their flexibility. On analyzing results of psycholinguistic experiments and investigations of MWEs in languages other than English the authors of this chapter come to a conclusion that correlation between decomposability of MWEs and their transformational flexibility is language specific since different languages admit of different variants of such correlation. They think the notion of decomposability to be fuzzy and difficult to apply to idioms classification. The authors suggest an alternative categorization of MWEs based on the notions of FIGURATION and TRANSPARENCY. They conjecture that transformational productivity depends on transparency and figurativeness, the more transparent and figurative is an idiom, the more transformationally productive it is. Fifteen verbal MWs were selected and examples of variations in their structure retrieved from a billion-token Hebrew corpus.

Assessing the approach suggested by Greshler et al I'd like to note the following. 1) I don't see any essential difference between this conception and Nunberg's conception that they criticize. Both are of pragmatic character, being speaker-focused. And, as such, both need experimental data to test their validity. Criticizing Nunberg's conception the authors refer to psycho-linguistic data, but they never provide any experimental data to substantiate their own conclusions. For me 'shoot the breeze' is figurative and transparent in the same way as ''saw logs'', because it creates in my mind a ''vivid picture'' of a person who speaks so fast that words go from his mouth like shots of a gun. Perhaps other speakers' perception will be different. Without psycholinguistic data the authors' concepts of figurativeness and transparency will be fuzzy and unconvincing. 2) As the analysis is limited to 15 phrases, the authors, as they admit, were not able to obtain reliable statistical data to corroborate dependency between transparency and flexibility. 3) The analysis is limited to verbal idiomatic expressions that, apparently, are flexible by nature. It is not clear whether the concepts of transparency and figurativeness can be applied to other types of idioms that do not feature verbal components. 4) The analysis of productivity of idioms performed by the authors is interesting and may be used in other investigations.

The third chapter entitled ''Multiword expressions in an LFG grammar for Norwegian'' is written by Helge Dyvik, Gyri Smordal Losnegaard, and Victoria Rosen. It focuses on the methods for presenting MWEs in NorGram, a computational grammar of Norwegian, developed on the basis of Lexical-Functional Grammar. The authors distinguish between fixed, semi-fixed, and syntactically flexible MWEs. LFG analysis involves two levels of syntactic representation: constituent structure (c-structure) and functional structure (f-structure). First with the help of phrase structure rules and lexicon the c-structure is revealed, and the f-structure is derived from the c-structure. The chapter gives a detailed description of methodologies for presenting the three types of MWEs in NorGram. The authors distinguish between eight main types of complementation patterns of phrasal verbs and discuss specific realizations of these patterns.

On the whole, this chapter provides valuable material about integration of MWEs in LFG analysis that may be of interest for investigation of MWEs not only in Norvegian but also in other languages.

The fourth chapter ''Issues in Parsing MWEs in an LFG/XLE framework'' is authored by Stella Markantonatou, Niki Samaridi, and Panagiotis Minos. It deals with the system for parsing Modern Greek multiword expressions with LFG/XLE expressions. The general idea that underlies the chapter is differentiation between fixed and flexible parts of MWEs, the former treated as words with spaces (WWS), i. e. single syntactic and semantic units. The system comprises four modules, viz. part-of-speech tagger, lexicographic tool for formal description of MWEs, filter, and LFG/XLE grammars.

The general idea that underlies this chapter (to differentiate between flexible and fixed parts of MWEs) is sound but its realization is far from being perfect. Modern Greek is a morphologically rich language with relatively free word order, and it is not clear why the authors decided to employ LFG that has been developed for English that represents a different group of morphologically poor languages with rigid word order. They point to some problems they faced applying LFG (pp. 111-113), without providing convincing arguments for LFG choice.

The scheme that represents parsing system's architecture (p. 110) is inadequate as it lacks any preprocessing module that performs lexical and syntactic decomposition, as well as the formatter module, to which the authors refer (p. 119) without giving any description. A general requirement for a paper that hinges upon functioning of a computer system is its representation in a data flow diagram. This chapter lacks such a diagram. The screenshots that illustrate the functioning of main modules (figures 3-5) are of low quality, some of their parts are not discernable. These drawbacks significantly diminish the scientific quality of the chapter.

The fifth chapter written by Krasimir Angelov ''Multiword expressions in multilingual applications within the Grammatical Framework'' focuses on the ways MWEs are represented in the Grammatical Framework, a programming language for developing multilingual applications, such as machine translation and question answering systems. It describes methods for encoding linguistic units in Grammatical Framework. The author suggests factorization as a methodology for analyzing MWEs with non-compositional meaning.

It was difficult for me to grasp the aim of the paper because the author constantly refers to difficulties that Grammatical Framework faces coming to the conclusion that ''translation via the vanilla resource grammar is far from perfect'' (p 144), and ''current case by case solution does not scale well for open domain applications'' (ibid). The so called ''factorization'' is illustrated by the examples of sentences that don't have any idiomatic expressions; moreover, the statement that the translation is non compositional is incorrect, because the German equivalent for ''My name is John'' may be ''Mein name ist John'', which is quite acceptable in modern German. Generally, many languages that have simple verbal predicates to express this idea also have predicative variants, cf. ''me llamo Alex'' and ''mi nombre es Alex'' in Spanish. The author didn't provide any evidence of the usefulness of Grammatical Framework for interpretation of MWEs.

The sixth chapter ''Statistical MWE-aware parsing'' written by Mahieu Constant, Guelsen Eryigit, Mike Rosner, and Gerold Sneider focuses on different approaches that have been developed for statistical MWE-aware parsing. The chapter opens with a brief overview of main approaches to statistical parsing, statistical and dependency formalisms, transition-based and graph-based approaches to dependency parsing. The chapter outlines chunking, subtree, and multilayer presentations of MWEs. Identification of MWEs may be performed before or after sentence parsing, thus there are two main approaches based on their preprocessing and post-processing. Preprocessing approach involves two types of methodologies, concatenation, when an MWE is identified as a single token during tokenization, and substitution, when the MWE is substituted by its head word. The authors show advantages and disadvantages of these methodologies. Discussing post-processing approaches they soundly distinguish between MWEs identification and discovery. Identification is the process of recognizing MWEs in context, while discovery aims at creating a lexicon of MWEs types from some other lexicon. The chapter demonstrates how the use of T-score and Yule's K filters allows effective recognition of MWEs estimating the degree of non-modifiability of candidate expressions. Precision, recall and F-score metrics are used to evaluate this type of parser. In case of dependency parsing the number of dependencies produced by the parser should equal the number of total dependencies in the gold standard parse tree. Common metrics to evaluate this type of parser include the percentage of tokens with correct head and the percentage of tokens with correct head and dependency label. The authors show how the identification of MWEs affects the quality of parsing.

This chapter is a substantial review that provides useful information about MWE representation, orchestration, and external resource integration. It can be of interest to many experts in the field of natural language processing.

Chapter Seven entitled ''Investigating the effect of automatic MWE recognition on CCG parsing'' is written by Myriam de Lhoneux, Omri Abend, and Mark Steedman. It focuses on the impact of MWE recognition on parsing with Combinatory Categorial Grammar (CCG). CCG is a strongly lexicalized formalism that allows for dealing with long range dependencies and presents syntax and lexicon as interacting modules. The chapter opens with a review of experiments that prove the positive effect of correct MWE recognition on syntactic parsing. To test how MWE recognition affects CCG parsing the authors suggest first recognizing MWEs in the unlabeled version of CCG bank, and then collapsing MWEs to one lexical item in the annotated version of the treebank and in the unlabeled test data. The experiments conducted by the authors involved matching results obtained on the annotated treebank and unlabeled test data against reference data. Results of the experimentation show that MWEs automatic recognition has a positive impact on parsing accuracy and produces a good training effect. The results also show that collapsing MWE units to one token is most useful for MWEs made up of proper nouns.

This chapter provides a valuable and a far broader insight than the existing works into the impact of automatic MWEs recognition on parsing quality. The authors developed an original experimental methodology using the whole array of tools to distinguish, for the first time, between parsing and training effects. This methodology can be used to assess not only CCG parsing, but also parsing within the scope of other formalisms.

The eighth chapter ''Multilingual parsing and MWE detection'' is written by Vasiliki Foufi, Luka Nerima, and Eric Wehrli. It focuses on collocations consisting of content words (in contrast to stop words). The authors argue that the identification of collocations and parsing are interrelated processes. The common approach of treating MWEs as words-with-spaces doesn't work well as far as collocations are concerned because they have a high morphosyntactic flexibility. A separate section of the chapter is devoted to the Fips parser, a multilingual parser that works on a manually built lexicon designed to detect collocations of various types, including nominal and verbal ones. Due to the built-in anaphora resolution module, it copes with recognition of pronominal substitution, and can detect collocations whose elements are separated by many intervening words. The parser processed the English corpus first with the collocation detection module switched off and then with this module switched on. It turned out that collocation knowledge significantly improves parts-of-speech recognition.

In this chapter the authors have succeeded in demonstrating close interrelation between collocation identification and syntactic parsing. On condition that collocation identification is a part of the parsing process, it can improve parsing quality solving lexical and syntactic ambiguities.

The ninth chapter ''Extracting and aligning multiword expressions from parallel corpora'' written by Nasredine Semmar, Christophe Servan, Meriama Laib, Dhouha Bouramor, and Morgane Marchand addresses the task of extracting and aligning MWEs from parallel corpora. The authors adopt Sag's (2002) classification, according to which MWEs are divided into lexicalized and institutionalized. The former are classified into semi-fixed, fixed, and syntactically flexible expressions. Semi-fixed expressions include non-decomposable idioms, compound nominals and proper names. Syntactically flexible ones comprise verb-particle constructions and decomposable idioms. Institutionalized phrases include anti-collocations. It should be noted at once, that to put fixed expressions between semi-fixed and syntactically flexible ones (fig. 1, p. 242) is not quite logical. Arranged according to the degree of flexibility the order should be ''fixed'' - ''semi-fixed'' - ''syntactically flexible'', or ''syntactically flexible''-''semi-fixed''-''fixed''. The specific expressions that the authors give to exemplify different types of MWEs are not good. Stating that fixed expressions do not admit of morphological and syntactic variations, they give such examples as ''nest of vipers'' and ''pomme de terre'' that actually can be used in the plural form and cannot be considered fixed. Exemplifying semi-fixed expressions the authors again give the ''pomme de terre'' phrase, pointing to the fact that it can take the plural ending (p. 243). Having included anti-collocations into institutionalized phrases (fig. 1, p.242), they state that institutionalized phrases ''often refer to 'collocations'... ''(p. 244). Why these phrases are termed ''anti-collocations'' remains completely unclear. The main part of the chapter falls into two distinct sections. The first one deals with MWEs extraction and alignment. The other section hinges upon impact of MWEs alignment on Moses machine translation system. The authors suggest three methods of such an evaluation. The ''corpus'' methods, the ''table'' method, and the ''feature'' method. It turned out that the best improvement was achieved by using the ''feature'' method. The main drawback of the main part of the chapter is that the authors often do not give information about the source material they use. They exemplify the statistical approach by the English and equivalent French sentences (Table 1, p. 246) without giving any information about them. Why did the authors select these specific sentences? Where were the sentences taken from? re the sentences exemplar for the given task? The authors didn't provide information to answer these questions. The same goes to 12 phrases in Table 3 (p. 249), text material in table 4 (p. 252), figures 3, 4 (p. 253), figure 5 (p. 255). Lack of information about source material significantly diminishes the scientific quality of the chapter and undermines validity of the experimental results.

The last chapter ''Cross-lingual linking of multi-word entities and language-dependent learning of multiword entity patterns'' written by Guillaume Jacquet, Maud Ehrmann, Jakub Piskorski, Hristo Tanev, and Ralf Steinberger deals with recognition of names of organizations (NOO) in ''Europe Media Monitor'' (EMN), a meta-news platform that gathers about 300,000 news articles per day in about 70 languages. Recognition of NOOs presents difficulties because of a large number of acronyms that have to be associated with long forms. Long (expanded) forms may differ in lengths (cf. ''Space Station'' and ''International Space Station'') and may take inflections. Thus, one acronym may correspond to several or more expanded variants. As EMN is very big, using traditional linguistic tools such as POS tagging that underlies parsing was problematic, and the authors decided to develop an original methodology that does not imply their use. The authors developed four aggregation methods, monolingual expansion aggregation, multilingual expansion aggregation, aggregation based on similar tokens, and aggregation based on translated tokens. To the last two methods they applied cosine and CombMNZ similarity measures. Efficiency of the developed methods was assessed against a gold standard in terms of precision and recall. Multilingual expansion aggregation showed the best result. A special section of the chapter focuses on the task of learning MWEs structural patterns to facilitate recognition of new, not previously mentioned MWEs. To collect source material the authors used BableNet, a semantic network that contains about 7.7 million of named-entity related synsets. They developed a metalanguage to encode the NOOs structural patterns. Each pattern includes a natural language unit (surface form) and a token class element. Basing on combination of these parameters the authors performed filtering to significantly reduce the number of patterns. To assess the quality of NOOs recognition the authors matched their patterns against two existing named-entity annotated corpora to get promising results.

This chapter is an example of research that relies on term weighting and similarity metrics without using sophisticated and resource consuming linguistic techniques, such as POS tagging and parsing. This approach resembles in a way the one I suggested earlier [2]. The authors have done lots of work developing numerous methodologies for extraction and recognition of names of organizations that may be of interest to researchers investigating problems in named-entities processing.


The book comprises chapters that differ in size and quality, the longest being Chapter 3 (40 pages), while the smallest one is Chapter 5 (20 pages). The latter is a paper rather than a chapter that falls out of book's scope.

MWE is an umbrella term used to denote various linguistic units that can be classified according to semantic, syntactic, pragmatic and functional criteria. According to the functional criterion, parenthetical and connective constructions (''on the one hand'', ''because of'') can be distinguished; numerical expressions, light verbs (''give a laugh'', ''have a meal''), verbs with postpositions (''take off'') are expressions that can be differentiated by the syntactic criterion; according to the semantic criterion multiword expressions that denote one object fall into one group (''New York'', ''hot dog''); greetings, farewells, and metaphoric constructions (''as thin as a rail'') may be differentiated by the pragmatic criterion. I give this brief classification (basing on the idea of the authors of Chapter 7) to show how vast is the domain of multiword expressions research. And I can't say that the book gives a full picture of this domain. It hasn't a single comment on the apparent metaphorical nature of pragmatic phraseological constructions that are used to produce stylistic effect on the listener. Metaphor processing [3] has been intensively developing during the last decades and can provide valuable information about the structure of these constructions that might have been of use to the authors of the second chapter as they use ''figuration'' as a distinction of such constructions. The chapters of the book heavily rely on existing tools for MWEs processing instead of creating new ones. The only exception is Chapter 8. Nevertheless, many chapters present original experimentation methodologies that may be of interest to experts and researchers in various fields of natural language processing.


This review was written thanks to the support from Russian Foundation for Basic Research, grant 20-07-00124


1. Tan, P.N. et al (2005) Cluster analysis: basic concepts and algorithms. URl:

2. Yatsko, V. A. (2013). The algorithms for proper names recognition. In: Nauchno-technicheskay informatsia. Series 2, no5, pp. 34-39. (In Russian).

3. Shutova, E. et al (2012) Statistical metaphor processing. In: Computational linguistics, vol. 39, no 2., pp. 301-353. URL:
Viatcheslav Yatsko is an independent researcher, ScD, an expert in computational linguistics