Data Cleaning and Pre-processing
One of the essential steps after collecting data is to check for potential errors. This is particularly crucial for manually entered data, as people are prone to making mistakes. Although it is impossible to identify all errors, the goal is to minimize their negative impact by tracing as many errors as possible. Various methods can be used to identify errors, depending on the type of data collected. For instance, a reasonableness check can be conducted to spot errors such as entering an age of “333”, which can be corrected to “33”, but the same is not easy to do if the age is “232” (in this case we cannot be sue if is it 23, or 32, or 22). In some cases, multiple data fields may need to be examined to identify errors.
For automatically collected data, error checking usually focuses on time consistency issues or whether the performance falls within a reasonable range. In studies that collect data from multiple channels, it is crucial to ensure that data about the same participant is correctly grouped together.
When errors are identified, it is important to correct them and replace them with accurate data. However, this is not always possible, particularly in online studies or anonymous surveys. In such cases, problematic data items must be removed, and treated as missing values in statistical analysis.
Sometimes, data must be cleaned up due to inappropriate formatting. For example, in an online survey, participants may enter their age in various formats, such as numeric values or text descriptions. In such cases, the entries in text formats may need to be transformed into numeric values before statistical analysis can be performed.
Before conducting with the statistical analysis, it is common for researchers to code the original data collected. For example, while the age information is already numerical and does not require coding, other information such as gender, or previous software experience must be coded to be interpreted by statistical software. Typically, researchers use codes “0” and “1” for dichotomous variables, which are categorical variables with only two possible values. For example, we may use “1” representing “male” and “0” representing “female”, or code the previous software experience using “1” for “Yes” and 0 for “No”.
For further studying about data cleaning in SPSS watch this video:
References
Lazar, J. , Feng, J. H., Hochheiser, H. (2017), Research methods in human-computer interaction: Morgan Kaufmann, 2017.