Do you use census data? We'd like your feedback.

Scotland's Census 2022 Form Reconciliation - Final Methodology

Background

Form Reconciliation (FR) is part of the third data cleansing step in the census data processing journey.

Form reconciliation is a group of processes that work around/ in parallel to the main data cleansing processes. We need to go from working with the data in the format it was collected (forms/questionnaires/returns), to working with it in the format it will be published (households and communal establishments (CEs)) – and we need to do this at the same time as the data cleansing step Resolve Multiple Responses (RMR), as RMR looks to resolve multiple responses within a household or CE.

 

There are four form reconciliation processes. They occur at the following stages of data processing:

  • FR1 reconciles CE manager and CE resident forms.
  • FR2 reconciles continuation forms with their parent household (continuation forms are paper forms for households of more than 5 people).
  • FR3 makes sure all household individual forms are allocated to a household and assigns orphan forms to an address.
  • FR4 fixes person numbering and relationships within a household or CE.

 

To allow us to associate communal establishment (CE) individuals or individuals on a continuation form or household individual form with their correct CE or household, we create a new variable in the dataset, called abode_id.

 

The abode_id variable is derived as part of form reconciliation and it becomes the standard household/ CE identifier for later processing steps.

 

For individuals, abode_id tells you which household or CE that person is associated with, and for households and CEs it serves to distinguish them. Creating a new ID allows us to retain all of the collect information and not disrupt it.

 

The process for each of the form reconciliation steps is described in this paper.

 

2022 Method

2.1 FR1 - CE manager and CE resident forms

This step deals with CE manager and CE resident forms and reconciling them together – this is where the term ‘form reconciliation’ comes from. For CEs, this means having a single identifier that associates the correct individuals with the correct establishment.

 

There is no automated way to associate communal establishment residents (i.e. CI forms) with their establishment manager form (CE form). Their return identifiers are totally distinct (and it’s beneficial to preserve return identifiers since they link directly back to data collection – e.g. scanned images).

 

For CEs, each individual gets their own distinct sub-address (sub_address_id). This makes it tricky to match the CE residents to the CE based on address/ sub-address information, as each CE individual has a unique sub_address_id.

 

We either match on address_id (if this is available), which acts as a truncated sub_address_id, or truncate sub_address_id ourselves, so that everyone in the same establishment has an identifier in common, and the establishment also has that identifier in common.

 

This is where we create the abode_id variable. Each CE resident is given an abode_id, which is the return_id of the associated CE manager form.

 

We match the CE individuals and CE managers on address/sub-address and then apply the new abode_id. For cases where there is more than one CE manager at that address/sub-address, we assign the CE individual to one CE manager at random. The CE managers are resolved at the RMR stage, as RMR combines these CE managers as duplicates. Such cases where there is a duplicate CE manager form were reviewed at the RMR stage to ensure they had been resolved appropriately.

 

If there is no CE manager form at the address by the time data collection was over, the CE individual is treated as an orphan form. The process for orphan forms is described in section 2.4.

2.2 FR2 – Continuation forms

This step is similar to FR1 but it deals with continuation forms and reconciling them with their parent household. Continuation forms are forms for households larger than 5 people who wish to fill out a paper questionnaire. The standard paper questionnaire only has space for five respondents, so an additional form is required for larger households.

 

It involves updating the continuation person records so that they are associated with their parent household.

 

Similar to FR1, there is no inbuilt way to automatically associate continuation form residents and their parent household form, as their return identifiers are totally distinct. The only information to match these up is address and sub-address.

 

This sub-address information is ideal for matching continuation forms and parent households, as it is common between the continuation people and parent household.

 

Similar to FR1, this is where we create the abode_id variable. We match the continuation people and parent households on address/sub-address and then apply the new abode_id. Each continuation person is then given an abode_id, which is the return_id of their parent household.

 

For cases where there is more than one parent household, we can manually inspect the forms and chose the most appropriate parent household to associate the continuation form with. If the two parent households forms overlap in terms of people then they’ll be combined into one household by RMR; if they don’t overlap in terms of people then they won’t be resolved into one household, and we may have no grounds to pick one over the other to associate the continuation form with. The form would then be associated at random.

 

If there is no parent household at the subaddress by the time data collection was over, the continuation form is treated as an orphan form. The process for orphan forms is described in section 2.4.

 

2.3 FR3 – Individual forms

RMR will resolve many individual forms, by resolving the individual with the minimal information back to the household form. More information on the RMR methodology can be found here.

 

We associate Household Individual forms to a household by matching the person on the Individual form with the associated person on the Household form in RMR. If more than one match is found then the whole group are resolved together into one record.

 

During RMR, all people within a postcode are compared to identify matches. There are a number of ways a Household Individual may not match to a Household Person, with different outcomes. If a Household Individual doesn’t match any household person then either:

  • there is no corresponding household
  • there is a corresponding household but no corresponding household person
  • they failed to match to the corresponding person due to lack of evidence for a match (i.e. very incomplete data in matching variables)
  • they failed to match to the corresponding person due to evidence against a match (i.e. contradictory information in matching variables).

 

There are another couple of considerations here:

  • Household persons may indicate that they are ‘decoupled’, i.e. have requested an individual form that the householder knows about. These household persons will provide minimal information (name and relationship within the household) and then be routed to the end of the questionnaire.
  • Individuals who provide a Paper individual response can indicate which Household person number they correspond with (i.e. “I am person 3 in the household”) so that they can be associated with their household information. The response to this question may be missing or incorrect.

 

FR 3 deals with any individual forms that RMR can’t find a match for. There are three outcomes when doing this.

 

 

 

Outcome 1 – match individual to household resident:

Use the Household Individual return to overwrite one of the Household Persons, as determined by clerical review (this is where a person looks at the record to make a decision). In general the approach is as follows 

  1. If the individual has provided a person number (PERS_NUM_INDIVIDUAL), could they be matched to a person with that person number (RETURN_PERS_ID) at that address
  2. Otherwise, could they be matched to a person at that address who has indicated they’ll be providing an individual response (INTENTION_INDIVIDUAL = 1)
  3. Otherwise, is there anyone at that address who the individual could match with

If we can match the individual to a household resident at that address (and we can explain why RMR identify missed this) then we would do so and resolve them in RMR resolve.

Examples of why RMR might miss a match are:

  • in a ‘decoupled’ case where the Household Person record is of poor quality (e.g. the householder recorded an inaccurate name which cannot be matched to the Household Individual)
  • the Household Individual provides information which is different from the Household Person return but has accurately indicated which Household Person they correspond with (e.g. a respondent who used their maiden name on the Household Individual response but their married name on the Household form will still have the same date of birth and forename across both forms).

 

For outcome 1, ‘overwriting’ is done according to RMR rules, i.e. the Household Individual response gets priority, and any missing or invalid mandatory questions are filled using responses from the non-priority return. As non-response to a voluntary question is a valid response, these would not be overwritten with the information from the non-priority return.

 

 

Outcome 2 – add household individual to household as distinct person:

Add the Household Individual return to the household as a distinct person. This option is preferred where we think that the Household Individual does not correspond with any of the existing household persons.

 

For example, if the householder requested an individual rather than a continuation form to capture their 6-person household (and so the Household Individual indicates a person number which isn’t already in use), or wanted to add an additional person to their household after submitting the household return and did this through a Household Individual return.

 

If we can say the individual doesn’t match any household resident at that address (e.g. non-missing, correctly captured dates of birth which are different), then retain the individual as an additional household resident.

 

Similarly to option 1, this option requires clerical review to decide how to proceed, and in this case we add an additional person to the household without knowing their household relationships, which will then be imputed during the Edit and Imputation (E&I) process.

 

Outcome 3 – discard the household individual:

If we can do neither outcome 1 or 2 (e.g. because there is too much missing information) then we must discard the individual return – we cannot correctly match it, so cannot fulfil the respondent’s intentions, and to retain the individual as a distinct person would guarantee overcount in a way which our methods cannot fix (since overcount correction relies on consistent name and date of birth for matching).

There were no cases in live census 2022 processing where a household individual form was discarded for this reason.

More information on the overcount correction methodology can be found here.

Outcome 4 – treat as orphan form:

Finally, if we have a Household Individual return at a subaddress (household address) where there is no household form, then we treat as an orphan form.

 

As set out above, there are ways we can end up with unassigned CE, continuation or household individual people and no household or CE to correctly assign them to – these are called orphan forms.

 

In these cases there must be at least an address that the return is associated with, and that address must have been considered either a CE (in the case of CE individuals) or a household (in the case of continuation or household individuals).

 

Accordingly we should create CE and household records at each necessary address (CE) and subaddress (household) and associate these orphan forms with the created CEs and households, using a created abode_id variable.

 

There are two points to note:

 

  1. For created CEs, we can populate any communal establishment characteristics from the CE register (establishment type etc.), since we do not impute the establishment characteristics in the E&I process.

 

  1. For created households, we do not need to populate household characteristics (e.g. accommodation type, tenure) since these will be imputed during the E&I process.

 

The CE Register details all the communal establishments in Scotland and their characteristics, like establishment type.

 

2.4 FR4 – Person Numbering

The earlier cleansing processes may have affected person numbering within a household or CE, so this FR4 step deals with fixing person numbering and the relationship matrix within a household or CE. This is done after RMR resolve, before going into filter rules and E&I processing steps.

 

E&I in particular relies on individuals within a household or CE being numbered sensibly (i.e. if there are N people, they’re numbered 1, 2, 3…N) – any gaps (e.g. those created by Remove False Persons) will cause problems for E&I.

 

Similarly, although person numbers will be unique to each return, they won’t be unique to a household (i.e. an abode) – if we’ve combined two five-person households and the people don’t overlap then we’ll have two person 1, two person 2 etc.

 

The FR4 process involves arranging the associated people within a household or CE in a reasonable order, then renumbering them within the household/CE. We then move the values from the relationship matrix onto the person table in the correct order.

 

Note here that we won’t have complete relationship information going into E&I – for example, FR4 won’t try to backfill relationships between two households we’ve combined in RMR. We will know, though, which relationships should be imputed or left as not required based on the number of people in the household.

 

Conclusion

This paper outlines the process of carrying out four form reconciliation steps in the census data processing journey. These steps allowed for household individuals and CE residents to be associated with their parent household or CE, and to correct the person numbering within a household or CE that may have been affected by earlier cleansing processes. This was required for the later data processing steps, in particular Edit and Imputation.

Contents