Do you use census data? We'd like your feedback.

Scotland's Census 2022 Resolve Multiple Responses: Resolving Duplicates – Final Methodology

1. Background

Resolve Multiple Responses (RMR) is the third data cleansing step in the census data processing journey.

Figure 1: Census Data Processing Journey for 2022

There are situations when people submit more than one response to the Census.  This is sometimes for legitimate reasons; if a respondent is making a paper response but the household is larger than five people, for example, they are asked to submit another questionnaire (a continuation form).  These situations do not result in overcount and pose no problems in terms of accuracy and data quality.  However, there are also situations where someone responds to the Census more than once and it causes issues in data quality — where people, or even whole households, are duplicated in the dataset.  This can occur for a number of reasons, for example:

 

  1. Where someone in the household fills out the questionnaire and sends it off, but someone else in the household has already done this;

 

  1. When someone changes their mind about what they want to include in their response, and submits a new one;

 

  1. A respondent begins filling in the census return online but decides they would rather fill it in on paper. In such cases the information on the online return would be collected as an unsubmitted return;

 

  1. A respondent begins filling in the census return online, but forgets their login details before completing it. They would then need to request a new Internet Access Code (IAC) and begin a new return. Again, the information on the first return would be collected as an unsubmitted return;

 

  1. Where a person gets confused about a paper response, and answers the individual questions for themselves for Persons 1–5 (each paper household questionnaire contains spaces for up to five people)

 

  1. Where a respondent chooses to fill out an individual questionnaire in addition to the household questionnaire.

 

This duplication creates what is called ‘overcount’, an inflation in the count of people or households.  Overcount can lead to an overestimation of the population.  Therefore, the process that looks at duplication of individuals at a location — Resolve Multiple Responses (RMR) — is an important part of data cleansing.  

 

Resolve Multiple Responses (RMR) is the process by which responses are reviewed in order to identify duplicate person records and duplicate household or communal establishment records, and resolves them by combining duplicate responses in a way which retains information where possible (rather than discarding it).

 

This paper covers the methodology used in resolving duplicates in the Census dataset during the Resolve Multiple Response stage. Please note that the process of identifying the records can be found here Scotland's Census 2022 - Resolving Multiple Responses: Identify Duplicates – Final Methodology.

 

The following paper is an update to the original RMR Resolving Duplicates External Methodology Assurance Panel (EMAPs) paper, which was published before data processing started. More information on the detailed background and methodology of the process can be found here:  PMP021: Resolve Multiple Responses Prioritisation and Resolution | Scotland's Census (scotlandscensus.gov.uk)

 

2. 2022 Method

2.1 Overview of RMR for 2022

The first two points concern the identification of the clusters and are mentioned here for clarity. More information can be found in the RMR Identify paper at the link above.

 

  1. Identify Duplicate Person Records

We identify duplicate person records upfront by linking within enumeration postcode. Linked person records are grouped into clusters. These clusters are linked again to administrative data, or clerically reviewed, or both to determine whether they come from the same person or not. This provides a list of person duplicates to resolve, and by identifying duplicates across questionnaires, we can see which households have a person in common.

 

  1. Identify Household and Communal Establishment Records to Combine

Where there are multiple household or communal establishment returns associated with the same address, we identify them and in some cases combine them according to the following rules:

  1. Unoccupied households are grouped with occupied households at that address where possible;
  2. Household responses that indicate they represent only part of a household are grouped with other households at that address where possible;
  3. Households with one or more people in common (i.e. a person is duplicated across households) are grouped;
  4. Communal establishments are grouped

 

  1. Resolve Household and Communal Establishment Records Within an Address

For those HH and CE cases in (2) that are to be resolved, a priority record is identified based on rules set out below. The priority record and its valid question responses are retained; missing or invalid question responses take a value from the next highest priority record where possible. Any person records associated with a non-priority record are moved to the priority record, and the non-priority record discarded.

 

  1. Resolve Additional Household Records

Where we see a duplicate person across multiple household addresses within a postcode, this is taken as evidence of an error in the address frame. These households are resolved as set out in point 3 above.

 

  1. Resolve Duplicate Person Records

Duplicate person records identified in (1) are resolved in a similar way. The priority record and its valid question responses are retained; missing or invalid question responses take a value from the next highest priority record where possible. Non-priority records are discarded.

 

  1. Resolve duplicate non-response returns so they can be used in later processes

 

  1. Before the data is passed on for later processing, affected records (whether they have been resolved or affected by another resolution) are flagged. Person records are renumbered within their household or communal establishment, and relationships between resolved person records are recovered.

 

 

Prioritisation

 

For 2022, where a group of household or communal establishment records are to be resolved, we prioritise as follows:

  • Prioritise by mode of collection and return status (Online Submitted over Paper over Online Un-submitted);
  • Then by number of valid fields filled;
  • Then at random

 

When resolving person records, we prioritise first by questionnaire type (where individual questionnaires take priority over any other type which means that anyone aged 16 and over who filled in an individual form in private to the rest of the household their data would be kept), and then prioritised using the same approach as with household or communal establishment returns if more than one return still existed.

 

The intention here is to retain as much accurate data as possible. Online returns are subject to more validation at collection than paper returns and are not subject to scanning errors, so are more accurate. Online un-submitted returns are considered an exception; they are not subject to the same validation as a submitted return, and the presence of an un-submitted return did not stop non-response follow-up. In general un-submitted returns will duplicate complete submitted returns, and so will only be retained where follow-up did not result in a submitted return.

 

Consideration was given to the idea that as paper questionnaires require additional effort for the respondent to obtain and complete, the associated return could be expected to be more accurate or complete than an online submitted return. However, this does not account for errors introduced in capture or scanning. Favouring online submitted returns allows for these errors to be avoided.

 

By prioritising again by number of valid fields filled we favour more-complete over less-complete records. This means maintaining more variables associated with the priority record than we otherwise would, and so creating fewer inconsistencies by overwriting missing or invalid values. 

 

 

Multi-part questions

 

We resolve person duplicates by using fields from lower-priority records to overwrite missing or invalid fields in the priority record.  In general, this allows us to retain the most data, and any inconsistencies created between questions will be resolved in Edit & Imputation as they would be for non-duplicates. However, where one question response is spread over multiple variables, inconsistency is more of a problem and harder to account for later on.

 

If we were to allow overwriting of a missing or invalid response for part of a multi-part question, we could create an inconsistency between parts of that question, or a response which is then not valid.  For example, the ‘date of birth’ question populates variables for day, month, year of birth as well as age on Census day. By combining two partial dates of birth which may not match exactly we could create a new date of birth combination which is less accurate in data quality terms than either of the contributing partial dates. This is a problem for later processes that use date of birth (for example, any other linking or matching process — the presence of dates of birth that do not match is evidence against a match, whereas a missing or partial date of birth is not).

 

Instead we treat specific multi-part questions as one variable; we only overwrite a response if completely missing or invalid, and then with a complete response from a lower priority record. This avoids creating combined responses that are of less value than the original partial response.

 

The multi-part questions treated in this way are date of birth (made up of day, month, year of birth and age) and year of arrival in the UK (made up of month and year of arrival).

 

 

Voluntary questions

 

Most Census questions and associated variables are mandatory — we require a valid answer, and will go on to impute any missing or invalid values to produce a complete dataset.  However, some particularly sensitive questions and associated variables are voluntary — respondents can choose not to provide an answer, and if they do not, this is considered a valid response and will not be imputed over.

 

When resolving duplicate persons, we treat non-response for these questions as valid and do not overwrite a missing or invalid value here. This is done to ensure that the response associated with an individual return is given the appropriate priority and not overwritten — even where that response is valid non-response.

This means that if any voluntary question is left missing in the individual response, this missing value is treated as valid and retained rather than being overwritten, as the individual person returns take priority over other person returns. As an example, consider a case of multiple response where a respondent’s householder provides a response to the religion question and no response to sexual orientation. If that respondent requests an individual questionnaire so that they can provide a response to the sexual orientation question, then the response to religion on that individual questionnaire will be retained as well, even if it is non-response.

 

This is an unavoidable consequence of prioritising individual response over other response — we cannot distinguish between intentional and unintentional non-response for voluntary questions, and would otherwise risk overwriting legitimate data provided by a respondent, with inaccurate data provided by their householder. However, we do note that incomplete individual questionnaires may result in unintentional non-response being retained.

3. Conclusion

The Resolve Multiple Responses process was carried out successfully for Scotland’s Census 2022 using the following principles to resolve duplicate records:

  • That duplicate person records are identified and combined in a way that retains quality data where possible;
  • That duplicate households are identified on the basis of having a person in common, and are also combined in a way that retains quality data where possible;

 

 

Contents