Comparative Study of Various Methods of Handling Missing Data

Fredrick Ochieng’ Odhiambo

doi:doi:10.11648/j.mma.20200502.14

| Peer-Reviewed

Comparative Study of Various Methods of Handling Missing Data

Fredrick Ochieng’ Odhiambo

Published in Mathematical Modelling and Applications (Volume 5, Issue 2)

Received: 2 October 2019 Accepted: 13 April 2020 Published: 30 April 2020

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.

Published in	Mathematical Modelling and Applications (Volume 5, Issue 2)
DOI	10.11648/j.mma.20200502.14
Page(s)	87-93
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2020. Published by Science Publishing Group

Keywords

Regression, Nearest Neighbor, Hot Decking, Median Substitution, Missing Data

References

[1]	Acuna et al. (2008): The Treatment of missing values and its effect in the classifier accuracy. http://www.uprm.edu.
[2]	Bernaards, C. A. et al. (2003): Comparison of Two Multiple Imputation Procedures in a Cancer Screening Survey. Journal of Data Science, 1 (3), 293-312.
[3]	Biewen, M. (2001): Item non-response and inequality measurement: Evidence from the German earnings distribution. Allgemeines Statistisches Archiv, 85 (4), 409-425.
[4]	Bover, O. (2004): The Spanish Survey of Household Finances (EFF): Description and Methods of the 2002 Wave. Documentos Ocasionales N. 0409. Banco de Espana.
[5]	Cameron, A. C. and P. K. Trivedi (2005): Microeconometrics. Methods and Applications. New York: Cambridge University Press.
[6]	Essig, L. and J. Winter (2003): Item Nonresponse to Financial Questions in Household Surveys: An Experimental Study of Interviewer and Mode Effects. MEA-Discussion Paper 39-03, MEA – Mannheim Research Institute for the Economics of Aging. University of Mannheim.
[7]	Ezzati-Rice, T. M., W. Johnson, M. Khare, R. J. A. Little, D. B. Rubin, and J. L. Schafer (1995): Multiple imputation of missing data in NHANES III. Proceedings of theAnnual Research Conference, U.S. Bureau of the Census, 459-487.
[8]	Ferber, R. (1966): Item nonresponse in a consumer survey. Public Opinion Quarterly, 30 (3), 399-415.
[9]	Frick, J. R. and M. M. Grabka (2005): Item nonresponse on income questions in panelh surveys: Incidence, imputation and the impact on inequality and mobility. Allgemeines Statistisches Archiv, 90 (1), 49-62.
[10]	Geman, S. and D. Geman (1984): Stochastic Relaxation, Gibbs Distribution, and theBayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6 (6), 721-741.
[11]	Graham, J. W. and J. L. Schafer (1999): On the performance of multiple imputation formultivariate data with small sample size. In: R. Hoyle (Ed.), Statistical Strategies for Small Sample Research, 1-29, Thousand Oaks, CA: Sage.
[12]	Groves, R. M., D. A. Dillman, J. L. Eltinge, and R. J. A. Little (2002): Surveynonresponse. New York: Wiley. 41.
[13]	Hastings, W. K. (1970): Monte Carlo Sampling Methods Using Markov Chain and Their Applications. Biometrika, 57, 97–109.
[14]	Hoynes, H., M. Hurd, and H. Chand (1998): Household Wealth of the Elderly under Alternative Imputation Procedures. In: D. A. Wise (Ed.), Inquiries in the Economics of Aging, 229-257. Chicago: The University of Chicago Press.
[15]	Hud, et. al. (2010): Data non-response. http://www.edu.
[16]	Johnson, N. and S. Kotz (1970): Distributions in Statistics – Continuous Univariate Distributions. Vol. 2. New York: Wiley.
[17]	Kennickell, A. B. (1998): Multiple Imputation in the Survey of Consumer Finances. Proceedings of the 1998 Joint Statistical Meetings, Dallas TX.
[18]	Little, R. J. A. and D. B. Rubin (2002): Statistical Analysis with Missing Data. New York: Wiley.
[19]	Little, R. J. A. and T. Raghunathan (1997): Should Imputation of Missing Data Condition on All Observed Variables? Proceedings of the Section on Survey Research Methods, Joint Statistical Meetings, Anaheim, California.
[20]	Little, R. J. A., I. G. Sande, and F. Scheuren (1988): Missing-data adjustments in large surveys. Journal of Business and Economic Statistics, 6 (3), 117-131.
[21]	Manski, C. (2005): Partial Identification with Missing Data: Concepts and Findings. International Journal of Approximate Reasoning, 39 (2-3), 151-165.
[22]	Naem et al. (2010). Determinant of Households Demand for Electricity in District of Peshawar. European journal of social sciences-volume 14, number (2010).
[23]	Orwa et al. (2006): Non-Response Weighting adjustment approach in survey sampling. East African Journal of Statistics. No 2 pp 143-162.
[24]	Othuon, L. A. (2006): Bias in regression coefficient estimates upon different treatments of systematically missing data. East African Journal of Statistics. No 2 pp. 186-197.
[25]	Pigot, T D, (2002). A review of Methods for missing data. Education research and evaluation, 7-353-385.
[26]	Rancourt, E., Sarndal, C. E., and Lee, H (1994). Estimation of the variance in the presence of nearest neighbor imputation. In 1994 proceedings of the sectionon Survey Research Methods (pp. 888893). Alexandaria, VA: American Statistical Association.
[27]	Rässler, S. and R. Riphahn (2006): Survey item nonresponse and its treatment. Allgemeines Statistisches Archiv, 90, 217-232.
[28]	Riphahn, R. and O. Serfling (2004): Item Non-response on Income and Wealth Questions. Empirical Economics, 30 (2), 521-538.
[29]	Rubin, D. B. (1987): Multiple Imputations for Non response in Surveys. New York: Wiley.
[30]	Rubin, D. B. (1996): Multiple Imputation After 18+ Years. Journal of the American Statistical Association, 91 (434), 473-489.
[31]	Rubin, D. B. and N. Schenker (1986): Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. Journal of the American Statistical Association, 81 (394), 366-374.
[32]	Schafer, J. L. (1997): Analysis of incomplete multivariate data. London: Chapman & Hall.
[33]	Siddharth T. Krishna M. Mayank R. Saurabh K. (2007): Implementing multiple imputation in an automatic variable Selection scenario. Inductis inc. 571 central Avenue New Jersey.
[34]	Silverman, B. W. (1986): Density Estimation for Statistics and Data Analysis. London: Chapman and Hall.
[35]	Smith, J. P. (1995): Racial and Ethnic Differences in Wealth. Journal of Human Resources, 30, 158-183.
[36]	Strainer D. L. (2000). The case of missing Data: Method of dealing with drop outs and other research Vagaries, Canadian Journal of Psychiatry, 47, 68-75.
[37]	Soley Bori M. (2013), Dealing with missing data: key assumptions and methods for applied analysis, Technical report No. 4.
[38]	Wafula C. Otieno R. O., Mwenda M. M: Estimation of variance in the presence of Nearest Neighbour imputation. African Journal of Science and technology (AJST) Science and Engineering series Vol. 4, No 3, pp. 5-11.
[39]	www. imputing missing data.

Cite This Article

Plain Text BibTeX RIS

APA Style

Fredrick Ochieng’ Odhiambo. (2020). Comparative Study of Various Methods of Handling Missing Data. Mathematical Modelling and Applications, 5(2), 87-93. https://doi.org/10.11648/j.mma.20200502.14

Copy | Download

ACS Style

Fredrick Ochieng’ Odhiambo. Comparative Study of Various Methods of Handling Missing Data. Math. Model. Appl. 2020, 5(2), 87-93. doi: 10.11648/j.mma.20200502.14

Copy | Download

AMA Style

Fredrick Ochieng’ Odhiambo. Comparative Study of Various Methods of Handling Missing Data. Math Model Appl. 2020;5(2):87-93. doi: 10.11648/j.mma.20200502.14

Copy | Download

@article{10.11648/j.mma.20200502.14,
  author = {Fredrick Ochieng’ Odhiambo},
  title = {Comparative Study of Various Methods of Handling Missing Data},
  journal = {Mathematical Modelling and Applications},
  volume = {5},
  number = {2},
  pages = {87-93},
  doi = {10.11648/j.mma.20200502.14},
  url = {https://doi.org/10.11648/j.mma.20200502.14},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mma.20200502.14},
  abstract = {Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.},
 year = {2020}
}

Copy | Download

TY  - JOUR
T1  - Comparative Study of Various Methods of Handling Missing Data
AU  - Fredrick Ochieng’ Odhiambo
Y1  - 2020/04/30
PY  - 2020
N1  - https://doi.org/10.11648/j.mma.20200502.14
DO  - 10.11648/j.mma.20200502.14
T2  - Mathematical Modelling and Applications
JF  - Mathematical Modelling and Applications
JO  - Mathematical Modelling and Applications
SP  - 87
EP  - 93
PB  - Science Publishing Group
SN  - 2575-1794
UR  - https://doi.org/10.11648/j.mma.20200502.14
AB  - Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.
VL  - 5
IS  - 2
ER  -

Copy | Download

Author Information

Fredrick Ochieng’ Odhiambo

Department of Mathematics and Actuarial Sciences, South Eastern Kenya University (Seku), Kitui, Kenya

Download PDF

Submit an Article

Sections

Plain Text BibTeX RIS

APA Style

Fredrick Ochieng’ Odhiambo. (2020). Comparative Study of Various Methods of Handling Missing Data. Mathematical Modelling and Applications, 5(2), 87-93. https://doi.org/10.11648/j.mma.20200502.14

Copy | Download

ACS Style

Fredrick Ochieng’ Odhiambo. Comparative Study of Various Methods of Handling Missing Data. Math. Model. Appl. 2020, 5(2), 87-93. doi: 10.11648/j.mma.20200502.14

Copy | Download

AMA Style

Fredrick Ochieng’ Odhiambo. Comparative Study of Various Methods of Handling Missing Data. Math Model Appl. 2020;5(2):87-93. doi: 10.11648/j.mma.20200502.14

Copy | Download

@article{10.11648/j.mma.20200502.14,
  author = {Fredrick Ochieng’ Odhiambo},
  title = {Comparative Study of Various Methods of Handling Missing Data},
  journal = {Mathematical Modelling and Applications},
  volume = {5},
  number = {2},
  pages = {87-93},
  doi = {10.11648/j.mma.20200502.14},
  url = {https://doi.org/10.11648/j.mma.20200502.14},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.mma.20200502.14},
  abstract = {Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.},
 year = {2020}
}

Copy | Download

TY  - JOUR
T1  - Comparative Study of Various Methods of Handling Missing Data
AU  - Fredrick Ochieng’ Odhiambo
Y1  - 2020/04/30
PY  - 2020
N1  - https://doi.org/10.11648/j.mma.20200502.14
DO  - 10.11648/j.mma.20200502.14
T2  - Mathematical Modelling and Applications
JF  - Mathematical Modelling and Applications
JO  - Mathematical Modelling and Applications
SP  - 87
EP  - 93
PB  - Science Publishing Group
SN  - 2575-1794
UR  - https://doi.org/10.11648/j.mma.20200502.14
AB  - Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.
VL  - 5
IS  - 2
ER  -

Copy | Download