Scotland's Census 2022 - Remove False Persons – Final Methodology
Background
Remove False Persons (RFP) is the second data cleansing step in the census data processing journey.

The raw census dataset for Scotland’s Census 2022 contained some blank or mostly blank records, which may not represent a genuine person. These records were mostly created at the data capture stage for a number of reasons, such as scanners recording paper dust as ticks or text, or respondents crossing through individual forms on the paper questionnaires (meaning that if the score runs through a tick or text box, it gets picked up as a character and a record is created). In such cases, no person is related to these records. Another option for mostly blank records is respondents intending to help with enumeration (for example “No-one lives here” in the name field). Again, there is no person or household related to these records.
However, keeping them in the census dataset creates overcount, and burdens later statistical processes which are required to fill in the missing variables. Scotland’s Census therefore had a data cleansing step called ‘Remove False Persons’ to account for this.
This paper is an update to the original RFP External Methodology Assurance Panel (EMAPs) paper, which was published before data processing started. RFP was run very much as described in the EMAPs paper.
2022 Method
Remove False Persons was mainly designed to deal with mostly blank records on paper, however, the process was run on the whole census dataset to ensure that any spurious online records were found as well. In addition to this, it was necessary for unsubmitted returns to be processed in this way to make sure that these records had enough information for further processing. Unsubmitted returns are online returns which a respondent started but never fully submitted. These unsubmitted returns are taken at the end of the collection period, as they may contain useful information about households and people and go through data processing.
The sequence of steps for the Remove False Persons process for all records online submitted/unsubmitted and paper therefore was:
- Check for False Names
- ‘2 of 7 Rule’
- Administrative Data Check
Each step is described in the order they were run in this paper.
2.1 Check for False Names
For Scotland’s Census 2022, we ran a check to find records that had strings that were not actual names written into the name field. The contents of these ‘false names’ tended to be either instructions for Census staff (for example ‘I live alone’) or purposely disguising names (for example ‘Anonymous’). The final list for strings that were searched for as part of this check can be found below:
- N/A
- NO ONE
- NONE
- NOT RELEVANT
- DON'T KNOW NAME
- ANON
- ANONYMOUS
These False Names were then set to be blank for the remainder of the process so that values in the name field that were not in fact names would not count towards the 2 of 7 rule.
In addition to this false names check, the records with uncodeable letters in their name were clerically reviewed by a person to make sure that this was an intentional value in the name field. If the reviewer determined that it was a false name, the name was set to missing, if it was a real name, the record was left as it was.
2.2 2 of 7 Rule
Records were then given a score based on how many of seven key variables they had values for. The seven variables can be found in Table 1.
Table 1: The Variable Groupings for the 2022 2 of 7 Rule
|
Variable Group and Description |
Validity |
1 |
Household Name
First or Last name in the household member listing of the Census questionnaire |
Valid name fields in 2022 include any character in the name fields (i.e. initials are acceptable) |
2 |
Relationship Matrix Name
First or Last name found on the relationship section of the Census questionnaire |
As above |
3 |
Person Name
First or Last name in the person section of the Census questionnaire |
As above |
4 |
Date of Birth fields |
Valid date of birth includes a valid month OR a valid year |
5 |
Variables which describe relationship |
Any indication of a valid relationship (tick) |
6 |
Sex |
Any tick response |
7 |
Marital Status |
Any tick response |
For date of birth and name, the record received 10 points, for each additional variable it received 1 point. The final score (or rfp_score) determined if the record failed the 2 of 7 rule (less than 10), passed (more than 10) or was a borderline case (equals 10). Therefore, for a record to pass RFP it had to have at least two out of seven completed variables, and one of these had to be name or date of birth.
The borderline cases were matched against administrative data to determine if there was a person with that name or date of birth at the address.
Single households were exempt from the relationship matrix and therefore had a 2 of 6 rule applied. Communal establishments were exempt from the relationship matrix and housemember table and therefore had a 2 of 5 rule applied.
Once the 2 of 7 rule was applied and the borderline cases were resolved through administrative data, those records that passed the filter (those having at least a name or date of birth, and any of the other variables; or matched on administrative data) moved on for further processing. The records that did not pass were removed from the primary census dataset.
In some cases, a household may be completely ‘emptied’ of person records — for example, if only sex is contained on each of the person records, they will not pass the 2 of 7 rule. In such situations, the person records were discarded while the household records were carried through further processing. The Edit and Imputation process ensures that missing questions on the household record are imputed.
2.3 Administrative Data to Quality Assure the RFP process
There are some cases where a record may fail the 2 of 7 check, but includes some key information (name or date of birth) that would enable the record to link to an administrative data source. These are the borderline cases described above.
By coupling this name or date of birth with the postcode information from the associated questionnaire, a corresponding record found in the administrative dataset would provide an indication that the Census record represents a genuine person. As such, this check was implemented for 2022 on these borderline records, so if a corresponding administrative data record was found, the census record was retained instead of being removed as it would have been without the administrative data check. Administrative data was used to check if the record was valid. No administrative data was kept in the census dataset.
If neither name nor date of birth is provided then any link to an administrative data record would not be strong enough to allow us to be confident that the return represents the same person as the record in the administrative dataset. Therefore, this check was only performed on borderline records, i.e. those that had not passed the 2 of 7 rule, but had either a name or date of birth recorded.
The administrative linking process is written in SAS, and makes use of linking methods developed for other census processing tasks. The linking method only considered records in the same postcode. When name was available the pairs were scored, and the scores used to categorise the links. These in turn were used to decide what to do with each link. Almost 2000 cases were sent to clerical review, to be checked by a person. In around 300 of these cases the reviewer indicated that they likely represented the same person, and so the census record should not be removed from the dataset.
Conclusion
Remove False Persons was run successfully and was able to remove 134,000 spurious records. The use of administrative data lowered the likelihood of erroneously removing real people from the census dataset. If a failing record is found in the independent administrative dataset, the indication this provides that the census record is a genuine person, allowed us to include the record in the census and thus improved the quality of the census data.