Editor for this issue: Jeremy Coburn <jecoburnlinguistlist.org>

E-mail this message to a friend

Discuss this message

Book announced at https://linguistlist.org/issues/30/30-1843.html

AUTHOR: Shay Cohen

TITLE: Bayesian Analysis in Natural Language Processing

SUBTITLE: Second Edition

SERIES TITLE: Synthesis Lectures on Human Language Technologies edited by Graeme Hirst

PUBLISHER: Morgan & Claypool Publishers

YEAR: 2019

REVIEWER: Brett Mylo Drury

SUMMARY

Introduction

Probabilistic reasoning and analysis is a popular subfield of machine learning applied to Natural Language Processing (NLP). One field of probability, Bayesian Statistics, can offer unique techniques that can be exploited by NLP practitioners and academic researchers. A timely publication by Shay Cohen provides an in-depth explanation of Bayesian Analysis applied to NLP.

Chapters

This review is centred around the chapters of the book. The first of which is Preliminaries, which is an explanation of basic Bayesian statistics and related principles. The areas covered in the chapter include random variables as well as conditional distributions. This chapter is not for the novice to the field and should be considered as a refresher for researchers that are familiar with Bayesian Statistics because Cohen addresses fundamental principles at a breathtaking speed. For example, Directed Acyclic Graphs (DAGS) are dismissed in two pages. To his credit, Cohen directs the reader to standard texts in the area, but the admission that Bayesian Networks are not addressed in depth is a pity because Bayesian Networks and its associated augmented Naive Bayes classifier have a role to play in Bayesian Analysis of natural language.

The Introduction chapter is the second of eight chapters, and in this chapter, Cohen seeks to differentiate language exploration, which is the secondary theme of this book, from NLP. Cohen claims that language exploration deals with the understanding of language, whereas NLP: ''learns and perform inferences with data''. This definition is not authoritative, and NLP practitioners and researchers who work in fields such as Natural Language Understanding (NLU) will disagree with these definitions. It seems that the aforementioned statement was made to justify the remaining content of the book. Although the chapter is called Introduction, a more accurate title would have been Latent Diadict Allocation (LDA) because it is the dominant theme of the chapter. Except for a small number of pages which are an introduction to and justification of Bayesian methods in NLP, the remainder of the chapter is a dedicated explanation and demonstration of LDA. The chapter provides the theoretical justification of LDA, and its advantages over bag-of-words representation. It is debatable whether this justification still holds because the majority of the NLP community has moved over to use word vectors. Although the advantages of using of LDA have faded with the introduction of word vectors and language models, the examples provided by Cohen give a good illustration of principles that are in use in Bayesian statistics, which will be relied upon later on in the book.

Priors are fundamental to Bayesian statistics, and that is the name of the third chapter. Priors are: “distributions over a set of hypothesis”. Priors are essentially pre-existing beliefs about a domain or problem. The chapter covers the following: conjugate priors, priors over multinomial and categorical distributions, non-informative priors as well as conjugacy and exponential models. The explanations for each of the described priors is clear, and each section ends in a summary. This was a comprehensive handling of priors, and more attention was paid by Cohen to this area compared with similar books such as the one by Barber (Bayesian reasoning and machine learning).

Chapter Four is concerned with Bayesian Estimation. Bayesian estimation is the bedrock on which Bayesian Inference is based. Bayesian Inference, which many NLP practitioners are directly or indirectly familiar with infers a ''posterior distribution from data'' using the ''parameters of a model''. The principle aim according to Cohen of Bayesian estimation is to summarise the posterior distribution, rather than capturing the distribution in full. The ultimate aim of the technique is to provide interpretable inferences about a specific problem or domain. Cohen provides some example problems such as synaptic tree generation and sentence alignment which can be addressed using Bayesian estimation.

The main areas that these chapters cover are:

1. Learning with Latent Variables

2. Bayesian Point Estimation

3. Empirical Bayes

4. Asymptotic Behaviour of the Posterior

Learning with Latent Variables is an opinion section by the author where he defines the two main methods of inference from data, which are: using all the observed data and splitting the data into training and test sets. Cohen differentiates these strategies as analogous to unsupervised and supervised learning strategies. Cohen concentrates on the second scenario, where Cohen claims that Bayesian Point Estimation provides a suitable compromise between the computational expensive fully Bayesian approach and the need for efficient lightweight models that make inferences upon unseen data points. This justification acts a segway into the Bayesian Point Estimation of the chapter.

Point estimation is a technique from statistics where a single value is computed from a data sample. This value acts as the best estimation for a given parameter. Point estimation is achieved using point estimators. The focus of a number Bayesian Point Estimators is the central tendency of the posterior distribution. The central tendency can be estimated using techniques such as Posterior Mean and Median. And as stated in the previous section Bayesian Estimators can be used in the inference process.

Cohen states that the goal of Bayesian Point Estimation as: ''summarising the posterior over the parameters into a fixed set of parameters'', and he links this goal to a frequentist approach known as maximum likelihood estimation. Cohen states that Bayesian maximum a posterior estimation(MAP) is a suitable technique. The remainder of the section describes the mathematical principles of MAP as well its ability to adhere to the Minimum Message Length principle which is an encapsulation of Occam's Razor. The section also includes sub-sections on smoothing (default probabilities for words that are absent from the sample data) and regularization as well as the computation of MAP with latent variables. The section finishes with decision-theoretic point estimation which introduces the notion of Bayes Risk. which introduces the idea of a loss function.

The remaining sections are quite short when compared with point estimation. Empirical Bayes discusses a method of encoding information into hyperparameters. And the final section briefly discusses the consequence of sampling from the distribution to which the model does not belong.

Sampling Methods is the next chapter. It describes some data sampling techniques that are used to estimate the posterior. These techniques need to be used when the posterior probability cannot be represented or efficiently computed. Sampling draws samples from the underlying distribution, and from these samples, a distribution can be inferred.

The chapter concentrates upon the Monte-Carlo family of sampling techniques, and in particular Markov Chain Monte Carlo (MCMC) And as part of this focus the chapter starts with an overview of the family of MCMC techniques.

The sampling technique for Bayesian statistics that most people are familiar with is Gibbs sampling, and a large chunk of the chapter is devoted to this technique. It also discusses a variant of the technique - Collapsed Gibbs Sampling. In both cases, a comprehensive mathematical treatment is given, as well as the differences between the two techniques. Cohen highlights the drawback of the Gibbs sampling, which is that it can be computationally expensive. Consequently he describes a method for parallelising the technique across multiple processors. The chapter also considers non-Gibbs sampling MCMC techniques such as Metropolis-Hastings, Slice Sampling and Simulated Annealing. The chapter concludes with a discussion about the convergence of MCMC Algorithms and some theory about Markov Chains as well as a brief discussion about alternatives to MCMC sampling techniques.

The variational inference chapter considers an alternative technique to that of approximate inference. Variational inference approaches the problem of estimating the posterior as an ''optimisation problem''. Cohen states that variational inference borrows concepts from maths that are concerned with the minimisation and maximisation of functionals. The chapter is broken down into the following subsections:

1. Variational Bound on Marginal Log-Likelihood

2. Mean-Field Approximation

3. Mean-Field Variational Inference

4. Dirichlet-Multinomial Variational Inference

5. Connection to the Expectation-Maximization Algorithm

6. Empirical Bayes With Variational Inference

The chapter is concluded with a discussion of the contents of the chapter and the main points that were covered.

The variational bound subsection steps through some calculations for what Cohen describes as a typical scenario. Cohen describes the Mean-Field approximation as a technique that describes an approximate posterior group that has a factorized form. He also states that in common with Gibbs Sampling that the technique requires a partition of latent variables. The partitions are of random variables. Cohen goes on to describe the factorized form of the random variables. The Mean-Field Variational Inference algorithm section describes an algorithm which is typically used in the context of the mean-field approximation. The subsection provides pseudo code as well as an explanation for each phase of the algorithm. The Dirichlet-Multinomial Variational Inference subsection describes an application of the Mean-Field Variational Inference as applied to Dirichlet-Multinomial models. Connection to the Expectation-Maximization Algorithm subsection describes the connection between the Mean-Field Variational Inference algorithm and the Enterprise Maximization algorithm. And finally, the chapter describes the variational algorithm in an Empirical Bayes setting.

Nonparametric priors are the next chapter where Cohen motivates the use of nonparametric Bayesian Modelling by providing an example of mapping clusters to cluster-specific distributional properties of a word drawn from the vocabulary under study. He states that there are two issues with this type of arrangement, it may generate too few clusters, which will not capture a sample large enough to represent the majority of the clusters, or it may generate too many clusters which will also capture the noise in the document collection. Cohen states that a way to represent this arrangement is to use nonparametric Bayesian Modelling that uses a nonparametric prior. A nonparametric prior Cohen reminds us is ''a set of random variables indexed by an infinite, linearly ordered set''. Cohen then provides an example of the Dirichlet process which uses a nonparametric prior to define a distribution over a set of distributions. The chapter continues with the Dirichlet process and provides various views of the process which include: Stick-Breaking, and Chinese restaurant. The chapter also provides a discussion of Direct process mixture models, which are a ''generalisation of the finite mixture model''. The discussion is mainly based around inference with DPMMs which would include Monte Carlo Markov Chains (MCMC) and Variational Inference. The chapter ends with the Hierarchical Dirichlet Process and the Pitman-Yor Process as well as a discussion.

The next chapter in the book is Bayesian Grammar Models, which Cohen claims is one of the most successful applications of Bayesian strategies to NLP. This chapter is mainly focused around probabilistic context-free grammars. Cohen provides some justifications for this approach, which include that context-free grammars are relatively simplistic, and the research literature is relatively complete. The first approach addressed is Hidden Markov Models (HMMs) which Cohen claims are a special form of context-free grammars. Cohen provides a short description of HMMs as well as the mathematical formalisation. The chapter then addresses probabilistic context-free grammars. The author describes the link between phrase-structure tree and context-free grammars. The author provides the mathematical formalisation as well as discussion concerning PCFGS as well as their inference algorithms. The remainder of the chapter covers Bayesian context-free grammars, adaptor grammars, HDP-PCFGS, Synchronous Grammars and Multilingual learning. Each of these sections as well as the chapter in general builds upon the concepts described early on in the book.

The final chapter is Representational Learning and Neural Networks. Choen starts the chapter by describing the advance of representational learning, as well as the conditions required for representational techniques, and their applications. The chapter describes Neural Networks as they are a form of representational learning. The first part of the chapter describes the history of Neural Networks, and why they have become popular now. The second part of the chapter refers to word embeddings. Word embeddings is a form of vector representation of words and their co-occurrences with other words. As most practitioners and researchers know that word vectors can be generated by word2vec. As this is a Bayesian book, there is a technique described based upon a Bayesian version of word2vec. A large amount of the book describes modern-day Neural Networks, and their training techniques, and activation functions as well as the use of word embeddings with Neural Networks. There is a brief discussion of Neural Networks that can remember time-steps such as Long Term Short Term Memories as well as Gated Recurrent Units. The chapter ends with a discussion of tuning Neural Networks as well as Generative Modelling with Neural Networks.

Conclusion

It is difficult to place the audience for this book. It is information-dense, but is a little short on introduction, and therefore is not suitable for beginners or novices to this field. I found myself jumping to reference books, and rereading sections to understand the author's point. There is also a significant amount of material omitted which I hoped that the author may cover such as Bayesian Networks. Additionally, the Neural Network chapter felt a little dated and forced to meet the Bayesian remit of the book. Large language models such as BERT have replaced word vectors for most practitioners. It was also surprising that there was no mention of Bayesian Neural Networks. On the plus side, this book dramatically improved my theoretical understanding of several areas of Bayesian analysis. If you are working in the area and have a strong grasp of the area then this book may be useful. If you are a novice then there are books you need to read before this one.

ABOUT THE REVIEWER

Brett is a Senior Data Scientist based in Porto, Portugal who works for Skim Technologies. He has a PhD from the University of Porto. His current research interests are causal and logical inference from information in text. He can be contacted at brettskim.it

Page Updated: 18-Nov-2019