Data Science

Discussion on Data Cleaning

This is discussion on Data Cleaning by members of the Datarmatics Research groups

[21:17, 9/3/2016] Nduka Wonu: We are reading the document on data cleaning 

[21:34, 9/3/2016] Kenedy Nnaji: What is data cleaning?
• Data Cleaning is the process of determining errors in datasets and reducing or eliminating such errors.
• Data cleaning is the process of dealing with missing observations in datasets
• Data cleaning is the process of detecting and managing outliers from datasets.
• Data cleaning helps quality researchers showcase their skills and competences

Kenedy Nnaji: If you don’t your data, they will surely embarrass you by revealing your ignorance and incompetence. So, clean up your datasets before using them for analysis and save yourself from this high profile embarrassment.

Kenedy Nnaji: If you are using datasets with categorical variables you need to clean them by getting rid of the non-response categories like ‘do not know’, ‘no answer’, ‘no applicable’, ‘not sure’, ‘refused’, etc.

Kenedy Nnaji: Usually non-response categories have higher values like 99, 999, 9999, etc (or in some cases negative values). Leaving these will bias, for example, the mean age or your regression results as outliers.

+234 803 674 5636: Is that all? Give examples of ‘missing observations’.

Kenedy Nnaji: Examples of missing observation, outliers etc. are given in the article sent earlier

+234 803 674 5636: Yes. I just read it. That’s a good one.

Kenedy Nnaji: This is seminar and there is need for brain storming, that’s the article was sent

+234 803 551 0045: This is interesting. I wish we had the article earlier enough. Should we ask questions.

Kenedy Nnaji: Yes

 

What is Data Cleaning

 

+234 803 551 0045: How will you know data cleaning has not affected your objectives?

Kenedy Nnaji: I will give examples but files cannot attach from my end

+234 803 551 0045: Ok but does data cleaning address systematic or random errors or both or none ?

Nduka Wonu: 1.What will you do if a set of instrument sent to you for analysis has a lot of missing entries?
2. Is there any instruction you can issue to the software like SPSS, Eviews, etc to solve the problem?

Nduka Wonu: 3. What will you do if the missing data are many?

+234 803 551 0045: Also additional question how does it apply in blinded studies.

Nduka Wonu: If you are using datasets with categorical variables you need to clean them by getting rid of the non-response categories like ‘do not know’, ‘no answer’, ‘no applicable’, ‘not sure’, ‘refused’, etc.
Usually non-response categories have higher values like 99, 999, 9999, etc (or in some cases negative values). Leaving these will bias, for example, the mean age or your regression results as outliers.
In the example below the non-response is coded as 999 and if we leave this the mean age would be 80 years, removing the 999 and setting it to missing, the average age goes down to 54 years.

Age Frequency
88 2
90 3
92 4
93 1
95 1
999 38
Total 1373

Nduka Wonu: Outliers affect results by inflating the estimates

Nduka Wonu: With outliers in your dataset, the assumptions of normality, homoskedasticity are violated, so most statistical techniques will yield biased estimates

Kenedy Nnaji: With outliers in your dataset, the assumptions of normal distribution, homoscedasticity and serial correlation will be violated, so you cannot reliably apply most statistical techniques

Kenedy Nnaji: This will significantly affect your result and your research objectives may not be achieved

Kenedy Nnaji: One way to detect outlier and other errors in dataset is through descriptive statistics

Nduka Wonu: Yes

+234 803 551 0045: If a data is skewed does it mean there an outlier ? So differentiate

Kenedy Nnaji: When you see that the kurtosis coefficient of your data is too large, then you can suspect that there is outlier in your dataset

+234 803 551 0045: So how do you differentiate

Kenedy Nnaji: Skewness may indicate leverage, which is also a type error in dataset

Kenedy Nnaji: However, not all outliers or leverage is bad

Kenedy Nnaji: There are good and bad outliers

Kenedy Nnaji: Check your dataset to trace the source of the outlier

+234 803 551 0045: Ok anything to do to avoid bias

+234 803 551 0045: In other wards this should have been addressed in exclusion criteria ?

Kenedy Nnaji: We don’t avoid bias but removing bad outliers can minimise bias

+234 803 551 0045: In terms of exclusion criteria

Kenedy Nnaji: If the outlier event didn’t occur, then there is good reason to believe that the outlier is bad

Kenedy Nnaji: But if the outlier event actually occur, simply explain the event in your analysis

Kenedy Nnaji: The second part of this seminar will focus on dealing with data errors, outliers and missing observations in SPSS and EViews

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *