LINGUIST List 29.4869

Thu Dec 06 2018

Support: English; Computational Linguistics,Text/Corpus Linguistics: PhD, University of Birmingham & Alan Turing Institute

Editor for this issue: Yiwen Zhang <>

Date: 04-Dec-2018
From: Jack Grieve <>
Subject: English; Computational Linguistics,Text/Corpus Linguistics: PhD, University of Birmingham & Alan Turing Institute, United Kingdom
E-mail this message to a friend

Web Address:

Level: PhD

Institution/Organization: University of Birmingham & Alan Turing Institute

Duties: Research

Specialty Areas: Computational Linguistics; Text/Corpus Linguistics

Required Language(s): English (eng)


Web Archives and Cities: Mining the Web to Learn our Cities
Alan Turing Institute Doctoral Scholarship
Supervisors: Emmanouil Tranos (University of Birmingham & Alan Turing Institute ) & Jack Grieve (University of Birmingham)

This project will utilise an innovative data source of billions of archived web pages under the .uk domain during the period 1996-2013. It will exploit the unstructured textual data contained in these webpages in order to understand the changes that cities in the UK have undergone. Essential element in this process would be the geolocation of these data. Specifically this project will answer the following key research questions:
- How are the dynamics of the UK urban system reflected in online internet content?
- Can we detect or even predict the dynamics of the inner structures of cities in the UK by mining online content?
- Can we understand urban functions and create urban typologies by using online content? How is such a ‘digital’ understanding of cities compared to our long-existing understanding based on traditional data sources?

This project will use, but not limited to, data from the Internet Archive, the most complete archive of web pages (Holzmann et al., 2016; Ainsworth et al., 2011). It will employ the JISC UK Web Domain Dataset, which is a subset of the Internet Archive curated by the British Library. These data contain billions of web addresses of webpages within the .uk domain, which have been archived by the Internet Archive during the period 1996-2013 as well as the archiving timestamp. The British Library has also generated a subset of this dataset called Geoindex which contains circa 2.5 billion web addresses of archived webpages which include at least one UK postcode.
These unstructured textual data will be interrogated by employing corpus analytics in order to create meanings, themes and classifications. The student will have the opportunity to approach the above questions from specific thematic viewpoints, including, but not limited to, land values, tourism, local governance etc. Topic modelling and similar type of methods will be used first in small samples of the corpora and then will be scaled-up. These methods will be coupled with statistical modelling and spatial analysis in order to understand the spatiality of these processes.

The successful applicant will have
- Relevant social science background in either geography/planning/urban studies or linguistics. Alternatively, a computer science background and willingness to engage with the above disciplines.
- Strong computational background including experience in R or Python.
- Good statistical knowledge.
- Preferably, experience in Natural Language Processing and Machine Learning.

Funding Notes:
To support students the Turing offers a generous tax-free stipend of £20,500 per annum, a travel allowance and conference fund, and tuition fees for a period of 3.5 years. Only open to UK/EU students.
The Turing doctoral studentship scheme combines the strengths and expertise of world-class universities with the Turing’s unique position as the UK’s national institute for data science and artificial intelligence, to offer an exceptional PhD programme.
Turing doctoral students spend approximately half of their time based at the Institute headquarters at the British Library in London. They will apply and register for their doctorate at the University of Birmingham, where they will spend the remainder of their time.

For more information please see

Applications Deadline: 22-Jan-2019

Web Address for Applications:

Contact Information:
        Jack Grieve

Page Updated: 06-Dec-2018