Scotland's Census 2022 - Census–CCS Linking – Final Methodology
Background
All households in Scotland are required to complete a census return for all usually resident persons, although sometimes people are missed. Estimation is the process where the true population of Scotland is estimated from the number of records on the Census (see Figure 1). In producing population estimates from the Census, undercount is the primary issue where households do not complete a questionnaire.

In order to avoid underestimating the population as a result of this, a sample of areas are surveyed again in a Census Coverage Survey (CCS). This is carried out a few weeks after census day, and is designed to have independence from the census. For example the collection mode is different (the CCS collection is enumerator led), and the address frame is produced manually.
The CCS records are linked to census records in order to count the number of people appearing on both, and the number appearing only on the census, or only the CCS. A logistic regression model then uses these counts to estimate the total population.
An address linking process was run in addition to person linking, with both sets of links being used to identify linked households, which also feed into the estimation system.
If people and households that are present on both the census and the CCS are not identified successfully, this will contribute to artificial inflation and overcount in the population estimates. Therefore, it is vital that the linking process is as precise as possible to produce an accurate estimation of those people and households who did not complete a census return.
In this paper a link is defined as two records that have been connected by the linking process, and a match is defined as two records that represent the same person.
This paper presents the methodology used to link census and CCS records, including both person and household linking. It is an update to the original PMP010: Census to CCS Linking External Methodology Assurance Panel (EMAPs) paper, which was published before data processing started.
2022 Method
2.1 Census to CCS Linking Core Method
Each CCS record is compared to census records in terms of the name, date of birth, postcode and sex. Comparing every CCS record to every census record would be unfeasible. Therefore we separate the data into blocks (a set of records that are only compared to other records in the same block during linking. See Steorts et al. (2014) for a discussion of blocking) and compare the CCS records to the census records that are in the same block. For example, records in one postcode might be compared only to other records in that postcode. To avoid missing matches, linking is repeated using a range of blocks such as postcode sector, date of birth, and parts of the name.
Each comparison is scored according to the similarity of the different information (first name, last name and date of birth). These scores are used to categorise the link and assign a distance score. These distance scores are combined with a score derived from the similarity of the postcodes to give an overall distance score. If a CCS record links to just one census record with a better overall distance score than to any other census record then that link is automatically made.
Remaining CCS records are all then manually compared to the list of census records they are linked to (ranked by the overall distance score of the link) to find any remaining matches. In this way we can reduce both the false positives (links made that are not matches) and false negatives (matches that have not been linked).
2.2 Changes made to the methodology during processing
2.2.1 Extra Clerical Review Steps
Quality assurance of the process run on live census data identified further matches amongst CCS records that were initially left unlinked after clerical review. This would contribute to overcount, so additional approaches to clerical review were designed and implemented.
The first extra step used associative linking, based on the approach of the Office for National Statistics (ONS). It involved identifying unlinked person records on the census and the CCS within households that contained other, linked person records.
For example, if a census and CCS household both contain two people, and the only link is identified between the households is between person one, this method would highlight person two as a potential match. This emphasised household composition, using shared individuals as a potential predictor for further links. Although this information is used in the main linking process, this new step presented only these associative linking records to a reviewer. This allowed them to focus on these in the knowledge that they all have another link within the households.
A second method involved identifying all unlinked census records within the same postcode as an unlinked CCS person record, where they did not clearly represent different people. These records were ranked based on similarity of additional variables not considered within the main person linking methodology, such as ethnicity and marital status.
A third method built upon the rationale of associative linking, broadening the scope to consider unlinked census and CCS person records that were at the same address, regardless of whether a person in the household had been linked.
Together, these three methods identified enough additional links that would have otherwise been left unlinked, that the linkage error rate was reduced to an acceptable level, resolving the problem.
2.2.2 Unreviewed CCS Records
During reconciliation of clerical review outcomes, a small number of CCS person records were identified that had no potential links produced by the main person linking process. These were largely due to insufficient data to be used as linking variables, for example missing name, date of birth and sex. These records were reviewed manually using the methods outlined in Section 2.2.1, and a small number of additional links were identified that would have otherwise been missed.
Conclusion
The methodology for Census–CCS Linking was used largely successfully, with over 99 per cent of accepted links being identified by the original method. Additional approaches to clerical review were implemented in response to a number of incorrectly unlinked CCS records, and these three new methods were successful in identifying additional links, the majority of which had been previously sent to review and were either deliberately rejected, or missed.