Comparative Study of Various Methods of Handling Missing Data
Mathematical Modelling and Applications
Volume 5, Issue 2, June 2020, Pages: 87-93
Received: Oct. 2, 2019; Accepted: Apr. 13, 2020; Published: Apr. 30, 2020
Views 59      Downloads 31
Author
Fredrick Ochieng’ Odhiambo, Department of Mathematics and Actuarial Sciences, South Eastern Kenya University (Seku), Kitui, Kenya
Article Tools
Follow on us
Abstract
Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.
Keywords
Regression, Nearest Neighbor, Hot Decking, Median Substitution, Missing Data
To cite this article
Fredrick Ochieng’ Odhiambo, Comparative Study of Various Methods of Handling Missing Data, Mathematical Modelling and Applications. Vol. 5, No. 2, 2020, pp. 87-93. doi: 10.11648/j.mma.20200502.14
Copyright
Copyright © 2020 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
References
[1]
Acuna et al. (2008): The Treatment of missing values and its effect in the classifier accuracy. http://www.uprm.edu.
[2]
Bernaards, C. A. et al. (2003): Comparison of Two Multiple Imputation Procedures in a Cancer Screening Survey. Journal of Data Science, 1 (3), 293-312.
[3]
Biewen, M. (2001): Item non-response and inequality measurement: Evidence from the German earnings distribution. Allgemeines Statistisches Archiv, 85 (4), 409-425.
[4]
Bover, O. (2004): The Spanish Survey of Household Finances (EFF): Description and Methods of the 2002 Wave. Documentos Ocasionales N. 0409. Banco de Espana.
[5]
Cameron, A. C. and P. K. Trivedi (2005): Microeconometrics. Methods and Applications. New York: Cambridge University Press.
[6]
Essig, L. and J. Winter (2003): Item Nonresponse to Financial Questions in Household Surveys: An Experimental Study of Interviewer and Mode Effects. MEA-Discussion Paper 39-03, MEA – Mannheim Research Institute for the Economics of Aging. University of Mannheim.
[7]
Ezzati-Rice, T. M., W. Johnson, M. Khare, R. J. A. Little, D. B. Rubin, and J. L. Schafer (1995): Multiple imputation of missing data in NHANES III. Proceedings of theAnnual Research Conference, U.S. Bureau of the Census, 459-487.
[8]
Ferber, R. (1966): Item nonresponse in a consumer survey. Public Opinion Quarterly, 30 (3), 399-415.
[9]
Frick, J. R. and M. M. Grabka (2005): Item nonresponse on income questions in panelh surveys: Incidence, imputation and the impact on inequality and mobility. Allgemeines Statistisches Archiv, 90 (1), 49-62.
[10]
Geman, S. and D. Geman (1984): Stochastic Relaxation, Gibbs Distribution, and theBayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6 (6), 721-741.
[11]
Graham, J. W. and J. L. Schafer (1999): On the performance of multiple imputation formultivariate data with small sample size. In: R. Hoyle (Ed.), Statistical Strategies for Small Sample Research, 1-29, Thousand Oaks, CA: Sage.
[12]
Groves, R. M., D. A. Dillman, J. L. Eltinge, and R. J. A. Little (2002): Surveynonresponse. New York: Wiley. 41.
[13]
Hastings, W. K. (1970): Monte Carlo Sampling Methods Using Markov Chain and Their Applications. Biometrika, 57, 97–109.
[14]
Hoynes, H., M. Hurd, and H. Chand (1998): Household Wealth of the Elderly under Alternative Imputation Procedures. In: D. A. Wise (Ed.), Inquiries in the Economics of Aging, 229-257. Chicago: The University of Chicago Press.
[15]
Hud, et. al. (2010): Data non-response. http://www.edu.
[16]
Johnson, N. and S. Kotz (1970): Distributions in Statistics – Continuous Univariate Distributions. Vol. 2. New York: Wiley.
[17]
Kennickell, A. B. (1998): Multiple Imputation in the Survey of Consumer Finances. Proceedings of the 1998 Joint Statistical Meetings, Dallas TX.
[18]
Little, R. J. A. and D. B. Rubin (2002): Statistical Analysis with Missing Data. New York: Wiley.
[19]
Little, R. J. A. and T. Raghunathan (1997): Should Imputation of Missing Data Condition on All Observed Variables? Proceedings of the Section on Survey Research Methods, Joint Statistical Meetings, Anaheim, California.
[20]
Little, R. J. A., I. G. Sande, and F. Scheuren (1988): Missing-data adjustments in large surveys. Journal of Business and Economic Statistics, 6 (3), 117-131.
[21]
Manski, C. (2005): Partial Identification with Missing Data: Concepts and Findings. International Journal of Approximate Reasoning, 39 (2-3), 151-165.
[22]
Naem et al. (2010). Determinant of Households Demand for Electricity in District of Peshawar. European journal of social sciences-volume 14, number (2010).
[23]
Orwa et al. (2006): Non-Response Weighting adjustment approach in survey sampling. East African Journal of Statistics. No 2 pp 143-162.
[24]
Othuon, L. A. (2006): Bias in regression coefficient estimates upon different treatments of systematically missing data. East African Journal of Statistics. No 2 pp. 186-197.
[25]
Pigot, T D, (2002). A review of Methods for missing data. Education research and evaluation, 7-353-385.
[26]
Rancourt, E., Sarndal, C. E., and Lee, H (1994). Estimation of the variance in the presence of nearest neighbor imputation. In 1994 proceedings of the sectionon Survey Research Methods (pp. 888893). Alexandaria, VA: American Statistical Association.
[27]
Rässler, S. and R. Riphahn (2006): Survey item nonresponse and its treatment. Allgemeines Statistisches Archiv, 90, 217-232.
[28]
Riphahn, R. and O. Serfling (2004): Item Non-response on Income and Wealth Questions. Empirical Economics, 30 (2), 521-538.
[29]
Rubin, D. B. (1987): Multiple Imputations for Non response in Surveys. New York: Wiley.
[30]
Rubin, D. B. (1996): Multiple Imputation After 18+ Years. Journal of the American Statistical Association, 91 (434), 473-489.
[31]
Rubin, D. B. and N. Schenker (1986): Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. Journal of the American Statistical Association, 81 (394), 366-374.
[32]
Schafer, J. L. (1997): Analysis of incomplete multivariate data. London: Chapman & Hall.
[33]
Siddharth T. Krishna M. Mayank R. Saurabh K. (2007): Implementing multiple imputation in an automatic variable Selection scenario. Inductis inc. 571 central Avenue New Jersey.
[34]
Silverman, B. W. (1986): Density Estimation for Statistics and Data Analysis. London: Chapman and Hall.
[35]
Smith, J. P. (1995): Racial and Ethnic Differences in Wealth. Journal of Human Resources, 30, 158-183.
[36]
Strainer D. L. (2000). The case of missing Data: Method of dealing with drop outs and other research Vagaries, Canadian Journal of Psychiatry, 47, 68-75.
[37]
Soley Bori M. (2013), Dealing with missing data: key assumptions and methods for applied analysis, Technical report No. 4.
[38]
Wafula C. Otieno R. O., Mwenda M. M: Estimation of variance in the presence of Nearest Neighbour imputation. African Journal of Science and technology (AJST) Science and Engineering series Vol. 4, No 3, pp. 5-11.
[39]
www. imputing missing data.
ADDRESS
Science Publishing Group
1 Rockefeller Plaza,
10th and 11th Floors,
New York, NY 10020
U.S.A.
Tel: (001)347-983-5186