Scotland's Census 2022 - Administrative Data Spine for Census Estimation
1. Abstract
Scotland’s census 2022 had a lower than expected response rate, and as a result the decision was made to improve the accuracy of estimation by adding administrative data records to the CCS (Census Coverage Survey). To do so, an administrative data spine needed to be produced, with links to the census and CCS. An administrative data spine was constructed by combining datasets from health, education, vital events registrations and electoral registers. As the resulting records would be used in census estimation, the combination was done in a way so as to minimize the amount of overcoverage in the data.
Two methods of linking were used in the process. One method, based on the method used for census–census linking, was used when liking large datasets together, especially when linking administrative datasets together. The other method was based on census–CCS linking. It was more-thorough, but could only be used when one of the datasets was comparatively small. That method was used when linking the records that would be added to the CCS to the census, as these would be the links used in estimation.
2. Introduction and Background
National Records of Scotland (NRS) is responsible for Scotland’s Census. It happens every decade, providing information on all people and households in Scotland. All households in Scotland are required to complete a census return for all usually resident persons, although sometimes people are missed.
Scotland’s Census 2022 faced unexpected challenges, leading to a lower than expected return rate. The scale of this meant that NRS had to develop new methods to count those missed by the census field collection.
Administrative data is information that is created when people interact with organisations and public services, such as schools or the NHS. NRS are using different sources of administrative data to enhance our standard estimation procedures.
Estimation is the process of calculating the total size of the Scottish population, including those who were missed by census collection. Adjustment adds new records to the census database, to match the estimated population. This improves the accuracy of population estimates at lower geographical levels.
This paper will focus on the methodology for creating the administrative data spine, which is then used in Estimation and Adjustment.
3. Method
3.1 Summary
NRS received administrative datasets from different public sector organisations in Scotland. These administrative datasets were linked together to create a population spine. These datasets were linked on personal information such as name, date of birth and sex.
This spine was then linked to the census and CCS, allowing us to count individuals who did not respond to the census, CCS, or both.
Information from administrative datasets is used to determine how sure we are that an individual is in Scotland. We do not want to include individuals who may have left Scotland before census day, or include people at the wrong address within Scotland.
The data is held securely, and datasets are minimized so that staff only have access to the data they require. A minimized version of the spine was used for the census processing. This only contained information that is essential for estimation and adjustment (age, sex, address, census ID, CCS ID, strength of links, ethnicity, full-time student flag, and strength of evidence). Name and date of birth are needed for effective linking, but not for estimation and adjustment so this information was not included. If the record linked to the census or CCS, then the IDs of those linked records were included. More detail is available at the Data Protection Impact Assessment.
Statistical models were used to calculate the number of individuals missing from Scotland’s census, and their characteristics. Records from the spine were used alongside CCS records in estimation. Records from the spine were also used to guide the adjustment process that generates records to account for non-response.
3.1.1 Administrative Data Sources
NRS has secured access to a number of high-quality administrative datasets, that provide coverage across the whole population of Scotland. The datasets used to create the spine were:
- NHS Central Register (NHSCR)
- Everyone who is or has been registered with a GP in Scotland, or whose birth was registered in Scotland
- Health Activity
- People who have interacted with selected NHS services in the 3 years prior to census day
- Electoral Register
- People registered to vote in Scotland
- Higher Education Statistics Agency (HESA)
- Higher education students studying or domiciled in Scotland
- School Pupil Census
- Pupils enrolled in state funded schools in Scotland
- Vital Events
- Information on births, deaths, marriages and civil partnerships
NHSCR, Health Activity and Electoral Register provide the widest coverage of the entire population of Scotland. Additional datasets are used to find individuals who may not appear on these datasets, such as school children (who will not appear on the electoral register as they cannot vote) or university students, who may not register with a GP, or update their address. Using all six datasets gives us the most complete picture of the population of Scotland.
3.2 Quality Assurance and Standardising the Data
When the raw administrative datasets are received by NRS, a number of data consistency and validation checks are performed. These include:
- Check the dimensions of the dataset received
- Check that all requested variables are present and in the expected formats and values
- Compare the data received with published data and investigate any significant changes
- Check the age distribution of the population
These checks will flag any potential issues in the data received by NRS. If these checks suggest the data may need to be amended/adjusted then the potential issues are communicated with the data supplier so the data can be amended if appropriate.
Data from different providers may be formatted differently. When NRS receive data they are standardised before they are linked together. This ensures that links are not missed due to differences in formatting between datasets.
Table 1 Standardised Variables
Variable |
Description |
First |
First name in upper case letters |
Middle |
Middle name(s) in upper case letters |
Last |
Last name in upper case letters |
Gender |
Gender or sex. Set to ‘M’ for male, ‘F’ for female, and missing for any other values. |
Day |
Day of birth, stored as a text string of length 2, with leading zeros |
Month |
Month of birth, stored as a text string of length 2, with leading zeros |
Year |
Year of birth, stored as a text string of length 4 |
Postcode |
Postcode (unit), stored with no spaces |
3.3 Linking Methods
There are two methods used to link administrative datasets together. The method used depends on factors such as the size of the dataset, data quality, efficiency and the level of accuracy needed.
Method 1 is a more-thorough method, based on Census-CCS linking. This is used when linking smaller datasets, where it is important not to miss matches. Method 2 is based on Census-Census linking. It is used to link large datasets together, and is optimised for efficiency.
3.4 Production of Trimmed Spine
The linking to produce the trimmed spine is done in three main steps. Overall we want to err on the side of having too few people on the spine. The census estimation process is designed to deal with undercoverage, but has limited ability to remove individuals who appear in the data but are not usual residents. This means that we often need to be conservative about linking, one way or another depending on what datasets are being linked.
The thresholds for what links to accept, review or reject, are chosen by reviewing a sample of links around those thresholds. These may differ for different steps or different combinations of datasets, depending on how strict or thorough we want the linking to be. For example, when linking the School Pupil Census we only want exact links as names are not included on that dataset. However, when linking with the census we will expect scanning errors, so we may need to be more thorough. Dataset size will also impact on how thorough linking can be from an efficiency point of view.
The way the different administrative and census datasets are used is shown in Figure 1, with the detail of the linking process described below.

Step 1
The first step is to determine which records are in scope. We start by linking together the following datasets to create a list of people who appear on at least one of:
• Health Activity
• Birth registrations (child)
• HESA (studying in Scotland)
It is important that we do not miss matches between records in these datasets, as this may lead to a person being counted twice, and cause problems when linking to the census. Linking records that are not a match would not cause any problems, it would simply result in a record that could have been included in the spine not being included. This would slightly reduce the utility of the spine, but not introduce any error during estimation. The linking is therefore conservative, so that individuals from birth registrations and HESA are only added to the spine when it is clear they are not already on the health activity dataset. No individual should appear on both birth registrations and HESA (birth registration data covered the period from July 2020 until census day), so these datasets do not need to be linked directly. Method 2 linking was used to link health activity to birth registrations, and then health activity to HESA.
Step 2
The spine is then updated to remove individuals who are not usual residents in Scotland on census day, which creates the trimmed spine. The following admin datasets are used as evidence of an individual not being in Scotland:
• NHSCR – individuals who are recorded as having left Scotland, or having died
• Death registrations
• Electoral Register – individuals registered as overseas voters
• HESA – individuals studying elsewhere in the UK
Again, any missed link would result in a record being on the spine that should not be there (as there is evidence that it represents an individual who is not a usual resident). An incorrectly made link would result in the removal of a record that could have been used. Therefore, as with step 1, linking in error would reduce the utility of the spine, while missing links would introduce error to estimation. As a result, the linking process uses method 2 and is conservative, with ambiguous cases being linked, and the record removed from the spine. The spine is linked to each of these datasets in turn, and trimmed after each dataset has been linked.
Step 3
Once we have created the trimmed spine, the information needed for the strength of the evidence variable is linked on. This variable provides an indication of how likely the individual was a usual resident in Scotland on census day. The appearance of a person on the following datasets are used as evidence of them being in Scotland:
• Electoral Register – individuals registered to vote, excluding overseas voters
• NHSCR – individuals with an active posting
• School pupil census – all pupils in publicly funded schools
• Vital events – parents listed on birth certificate, marriages, registering a death
This strength of evidence variable is used to determine which records to include in estimation. These datasets are linked to the trimmed spine using method 2, as again, the consequences of failing to link are fairly minor. If we fail to find a match then we would underestimate the strength of evidence for the record, which could result in a record not being used. This means we can err somewhat on the side of not linking ambiguous cases.
School Pupil Census does not have any names on it. For this source, records are only used if they agree exactly on date of birth and postcode.
Electoral Register does not have data of birth, so agreement on name and postcode are needed to make a link. Blocking on postcode allows slight differences in name to be detected. Electoral Register can also be useful for finding an alternative location for an individual, as it may be updated more readily than health records, especially if the old address can be used as a correspondence address. This would need to be strong (possibly exact) agreement on name, and not a common name.
Vital event information is used as an extra activity indicator. It is also used to link parents with each other, or their children, and partners. This means that if one had more-recent activity then that can be taken as proxy evidence.
This step completes the process of producing the trimmed spine.
3.5 Linking to Census and CCS
Once the trimmed has been produced the next two steps are to link the full trimmed spine to the census and CCS.
Step 4
The trimmed spine is then linked with the census. As these are both large datasets, method 2 needs to be used. This step means that any spine record that links to a census record will have this recorded, as this is useful when the spine records are used in adjustment.
Step 5
The trimmed spine is then linked with the CCS. As this is just the first pass for linking to the CCS it also uses method 2.
3.6 Strength of Evidence
When considering administrative data, it is important to be confident that the individuals represented were usual residents of Scotland and recorded at the correct address on census day. The strength of evidence is used to differentiate between administrative records where a person for example, may have moved house but not yet updated their GP, and the people who we are confident are usual residents and recorded at the correct address at the administrative data. The strength of evidence variable is calculated using information on which datasets the person appears on, along with when their last NHS interaction was. Using this, records are categorized as core, standard, reserve or discard. Only records from the core and standard groups were used in census processing. The core records were used in estimation, while both core and standard records were used in adjustment. The rules categorizing the records vary by age, as there are different administrative data sources that people of different ages can appear on. In general, to be classified as core, someone needs to appear on three administrative data sources.
The detail of which group each record is assigned to is given in Annex B. The strength of evidence is calculated following Step 5 of the linking.
3.7 Business Rules
In addition to the strength of evidence, business rules are also used to determine whether a spine record is in scope of being used in estimation, adjustment and quality assurance. Records meeting any of the following conditions will not be used:
- The record links to the CCS (as the person is already counted there)
- The record is not at a location included in the CCS sample
- The record is at an address where a CCS response was received
- The record is at an address where 7+ admin data records appear
- The record is at an address where any of the admin records appear anywhere on the CCS responses
This helps to ensure that we do not add records that are inconsistent with CCS responses, or include admin records that may be out of date, as this would create bias in the population estimates.
In addition, any spine record with a conflict in the location recorded between the different sources was excluded. We do not want to add records to the CCS if we are unsure where they reside. The records that are passed to the final linking steps are those that meet these business rules, and are categorized as core or standard.
3.8 Thorough Linking to CCS and Census
It is important that we do not add any spine records to the CCS that already responded to the CCS, as that would inflate the estimates. It is also important that the links between the spine records to be added to the CCS, and the census are accurate, as any errors will lead to errors in estimation. Therefore, the spine records to be used in estimation are linked to the CCS and census again, using the more-thorough method 1.
Step 6
In this step, the records that pass the business rules described above are linked to the CCS using method 1. This more-thorough linking method can be used here because both the CCS and the dataset of remaining spine records are both relatively small. Because it is important for estimation that we only add records who have not responded to the CCS, we are conservative with this linking and we accept weak links, erring on the side of caution.
Step 7
The spine records used in step 6 are filtered down again, to remove any that linked to the CCS at that stage. The remaining administrative data records are then linked to the census. Although the census is a large dataset, the dataset of remaining spine records is comparatively small. Effectively it consists of individuals in CCS sampled areas who did not respond to the CCS. As a result, the more-through method 1 can be used, as it was designed to link a small dataset and a large dataset together (the CCS and the census).
Any incorrect links (either linking non-matches, or failing to link matches) at this stage would affect estimation. In order to minimize error we use method 1 for this linking, and also pass links to clerically review.
Table 2 Summary of linking steps, linking methods used, and the effect of linking error.
Step |
List A |
List B |
Method |
1A |
Health Activity |
Child registrations |
2 |
1B |
Health Activity |
HESA (Scotland) |
2 |
2 |
Step 1 list |
Death registrations, NHSCR (died or left), HESA (RUK), ER (overseas) |
2 |
3 |
Step 2 list |
NHSCR (active), HESA (Scotland), ER (not overseas), SPC |
2 |
4 |
Step 2 list |
CCS |
2 |
5 |
Step 2 list |
Census |
2 |
6 |
Step 2 list in CCS areas not responding to CCS |
CCS |
1 |
7 |
Step 6 residuals |
Census |
1 |
Summary of Linking Steps
A summary of the steps and their linking methods is given in Table 2. The effect of linking error would be largest when linking the spine with the CCS and census, as these will affect results in estimation. Therefore, the linking to CCS and census uses both method 2, and also the more-thorough method 1 linking for records that could be used in census estimation. Earlier linking steps involved linking larger datasets, and so needed to use the more-efficient method 2 in order to complete in a reasonable time.
3.9 Clerical Review
Clerical review is carried out following the step 7 linking, where spine residuals were linked to the census, as any linking errors at this stage would have the largest effect on estimation. The majority of links are dealt with automatically, but some are sent to clerical review if a match could not be automatically accepted or rejected. This may include:
- cases involving multiple links (as RMR (Resolve Multiple Responses) had not yet been run on the census data)
- discrepancies in name or date of birth (such as scanning errors, typos, or a person using a nickname or middle name on the census form)
- appearing at a different address in admin records than on the census
Each of these links are passed to a person, who compares the original data for the spine record with an ordered list of census records. These are records that could be matches for the spine record, and are shown in order with the most similar appearing first. Shading of parts of the census record (name, date of birth and address components) draws the reviewers attention to similarities and differences from the spine record. The reviewer is also shown information for other spine records at the same address as the spine record under consideration. The final piece of information available is the information for other census records at a particular address. This can all assist the reviewer in coming to a decision about whether any of the presented census records are a match for the spine record.
This is similar to the clerical review process used when linking CCS records to the census, where links will also be used in estimation.
3.10 Output Dataset
Once all the linking processes, including clerical review, have been completed, all the links are merged back to the data to ensure that the full spine dataset includes the census IDs of all the linked records.
Conclusion
This paper presents the method used for linking the administrative datasets together to produce the administrative data spine. The method balances the need for accuracy with time constraints. This gave us a list of people we are confident resided in Scotland on Census Day. Linking to the census and CCS tells us which of these people appeared on the census, and this information is then used in census estimation and adjustment.
Term |
Definition |
|
|
Link |
Two records that have been connected |
Match |
Two records that relate to the same individual |
Non-match |
Two records that do not relate to the same individual |
RMR |
Resolve Multiple Returns. This is a process that resolves duplicate records at the same location in the census dataset. |
NHSCR |
National Health Service. This is an administrative dataset of people who are registered with an NHS GP. |
Census Day |
Sunday 20th March 2022 |
Age |
Standard Group This group has good strength of evidence and so can be considered for use in estimation. |
Core group This is the group with the strongest strength of evidence. |
<1 month |
Appearance on all of: · Health activity in 2021 or 2022 · NHSCR · Birth registrations
|
Appearance on all of: · Health activity · NHSCR · Birth registrations · At the same location as 1+ parent in the core or standard group OR Appearance on: · Birth registrations · At the same location as both parents who are in the standard or core groups |
0 |
Appearance on all of: · Health activity in 2021 or 2022 · NHSCR · Birth registrations |
Appearance on all of: · Health activity · NHSCR · Birth registrations · At the same location as 1+ parent in the core or standard group |
1 |
Appearance on all of: · Health activity in 2021 or 2022 · NHSCR · Birth registrations |
Appearance on all of: · Health activity · NHSCR · Birth registrations · At the same location as 1+ parent in the core or standard group |
2–4 |
Appearance on all of: · Health Activity in 2020 or 2021 · NHSCR |
Appearance on all of: · Health Activity in 2022 · NHSCR
|
5–15 |
Appearance on all of: · Health Activity in 2020 or 2021 · NHSCR · School pupil census |
Appearance on all of: · Health Activity in 2022 · NHSCR · School pupil census
|
16–17 |
Appearance on 3 of: · Health Activity · NHSCR · School pupil census · Electoral Register
|
Appearance on all of: · Health Activity · NHSCR · School pupil census · Electoral Register OR Appearance on: · Health activity in 2022 And 2 of: · NHSCR · ER · SPC |
18–59 |
Appearance on 3 of: · Health Activity · NHSCR · Electoral Register · HESA
|
Either Appearance on all of: · Health Activity, · NHSCR · Electoral Register · HESA OR Appearance on: · Health activity in 2022 And 2 of: · NHSCR · ER · HESA |
60+ |
Appearance on all of · Health Activity in 2021 · NHSCR · Electoral Register |
Appearance on all of: · Health Activity in 2022 · NHSCR · Electoral Register |