Death registers, historic IQ measures, marriage certificates –mapping our pasts and reconstructing our lifecourses.
Professor Chris Dibben of our own School of GeoSciences here in Edinburgh, presented on the ambitious project ‘Digitising Scotland’. He explained that the project’s aim is to digitally capture key life events from birth, marriage and death certificates, along with some IQ test records from the Scottish population. These key life moments will be joined together to construct lifecourses and then, to go further, linked across the generations e.g. connecting the lifecourses of children to their parents, creating intergenerational data sets. The project aims to build on previous classic studies such as the Dutch famine, the Overkalix study and especially the Scotland, Lothian Birth Cohorts 1921/1936 studies.
Prof. Dibben explained that the project is in the early stages and the talk would necessarily focus on the methodologies being used. He outlined some of the challenges to the digital capture of such large amounts of data and the conversion to research usable formats. Scottish vital events have been systematically recorded since 1855 but only in electronic format since 1974. There are an estimated 24 million event records containing 50 million occupational strings and 8 million causes of death. The records are mainly held as scanned images of hand written documents and their processing requires some innovative methods.
Occupational strings are translated into the standard classification Historical International Standard Classification of Occupations (HISCO) and similarly, causes of death are recorded in a modified international classification of disease (ICD10).
Digitising the occupations involves translating the rich written descriptions into a social class. To achieve this machine learning is being used. This involves the computer learning to recognise the textual descriptions and connect it to the HISCO code. To learn, the computer needs a gold standard of data with similar content and the Cambridge Family History study is being used for this purpose. Prof. Dibben explained how algorithms are used to convert the strings of word to the correct code and how the assessment of different algorithms showed the Naïve Bayes was the most effective with a 90% level of accuracy.
The method for classifying causes of death is similar and this time datasets from Kilmarnock, Tasmania and Massachusetts, all from the same time period, are used as the gold standard. However, the task is more difficult as causes of death are written in complex textual descriptions and the same cause of death may be recorded differently. This was highlighted by the varying ways the deaths of people who drowned in the Tay Bridge disaster had been recorded. Changes over time in medical knowledge and terminology have also led to different classifications of diseases being introduced. The task is further complicated as historical descriptions need to be translated to modern clinical codes.
In the final part of his presentation, Prof. Dibben talked about geocoding of the event records, which will be key to enabling spatial analysis of the data. The aim is to match modern addresses, which are very accurate, to historical ones. Historical addresses often contain errors but, usefully, event records accurately record the registration district where the event occurred. Additionally, there is a rich source of historical maps available to match historical locations to modern ones. Combining information from these different sources is helping improve the accuracy of the geocoding of the records, a process that is currently ongoing.
Although the project is only part way through, Prof Dibben expressed his hope the results of the analysis will soon be available, delivering large rich data sets able to answer spatially based questions regarding the past lives of people in Scotland.
(MSc in GIS at the University of Edinburgh)