Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets

Md. Bipul Hossen; Md. Rabiul Auwul

doi:doi:10.11648/j.bsi.20200501.14

| Peer-Reviewed

Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets

Md. Bipul Hossen, Md. Rabiul Auwul

Published in Biomedical Statistics and Informatics (Volume 5, Issue 1)

Received: 29 December 2019 Accepted: 10 January 2020 Published: 2 March 2020

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.

Published in	Biomedical Statistics and Informatics (Volume 5, Issue 1)
DOI	10.11648/j.bsi.20200501.14
Page(s)	20-25
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Microarray, Clustering Algorithm, Gap Statistic, Validity Indices

References

[1]	Schena M., Shalon D., Davis R. W., Brown P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, 270, 467–470.
[2]	Hossen M. B. Siraj-Ud-Doulah M. Hoque M. A. (2015) Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study, Computational Biology and Bioinformatics, 3 (6), 88-94.
[3]	Hossen M. B., Mowla A., Rashid or H., Binyamin M. (2017) On the Selection of Appropriate Proximity Measurement for Gene Expression Data, International Journal of Biomedical Materials Research, 5 (5), 59-63.
[4]	Daxin J., Chun T., Aidong Z. (2004) Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, 16 (11), 1370-1386.
[5]	Costa I. G., Carvalho F. A. D., Souto M. C. P. D. (2004) Comparative Analysis of Clustering Methods for Gene Expression Time Course Data, Genetics and Molecular Biology, 27 (4), 623-631.
[6]	MacQueen J. B. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press 1967, 1, 281-297.
[7]	Kaufman L., Rousseeuw P. J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.
[8]	Hossen M. B., Siraj-Ud-Doulah M. (2017) Identification of robust clustering methods in gene expression data analysis, Current Bioinformatics, 12 (6), 558-562.
[9]	Patnaik A. K., Bhuyan P. K., Krishna R. K. V. (2016) Divisive Analysis (DIANA) of hierarchical clustering and GPS data for level of service criteria of urban streets, Alexandria Engineering Journal, 55 (1), 407-418.
[10]	Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J. M., Perona I. (2013) An extensive comparative study of cluster validity indices, Pattern Recognit, 46, 243–256.
[11]	Tibshirani R., Walther G., Hastie R. (2001) Estimation the number of cluster in a data via gap statistic, J. R. Statist. Soc. B, 63 (2), 411-423.
[12]	Jain A. K., Dubes R. C. Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.
[13]	Chowdary D., Lathrop J., Skelton J., Curtin K., Briggs T., Zhang Y., Yu J., Wang Y., Mazumder A. (2006) Prognostic gene expression signatures can be measured in tissues collected in RNA later preservative, J Mol Diagn, 8, 31–39.
[14]	Pomeroy S. L., Tamayo P., Gaasenbeek M., Sturla L. M., Angelo M., McLaughlin M. E., Kim J. Y., Goumnerova L. C., Black P. M., Lau C., Allen J. C., Zagzag D., Olson J. M., Curran T., Wetmore C., Biegel J. A., Poggio T., Mukherjee S., Rifkin R., Califano A., Stolovitzky G., Louis D. N., Mesirov J. P., Lander E. S., Golub T. R. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, 415, 436-42.
[15]	Golub T. R., Slonim D. K., Tamayo P., Huard C., Gaasenbeek M., Esirov J. P., Coller H., Loh M. L., Downing J. R., Caligiuri M. A., Bloomfield C. D., Ander E. S. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 286 (5439), 531-537.
[16]	Nutt C. L., Mani D. R., Betensky R. A., Tamayo P., Cairncross J. G., Ladd C., Pohl U., Hartmann C., McLaughlin M. E., Batchelor T. T., Black P. M., von Deimling A., Pomeroy S. L., Golub T. R., Louis D. N. (2013) Gene expression-based classification of malignant gliomas correlates better with survival than histological classification, Cancer Research, 63 (7), 1602-1607.
[17]	Bittner M., Meltzer P., Chen Y., Jiang Y., Seftor E., Hendrix M., Radmacher M., Simon R., Yakhini Z., Ben-Dor A., Sampas N., Dougherty E., Wang E., Marincola F., Gooden C., Lueders J., Glatfelter A., Pollock P., Carpten J., Gillanders E., Leja D., Dietrich K., Beaudry C., Berens M., Alberts D., Sondak V., Hayward N., Trent J. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature, 406, 536-540.
[18]	Risinger J. I., Maxwell G. L., Chandramouli G. V. R., Jazaeri A., Aprelikova O., Patterson T., Berchuck A., Barrett J. C. (2013) Microarray analysis reveals distinct gene expression profiles among different histologic types of endometrial cancer, Cancer Research, 63, 6–11.
[19]	Tomlins S. A., Mehra R., Rhodes D. R., Cao X., Wang L., Dhanasekaran S. M., Kalyana-Sundaram S., Wei J. T., Rubin M. A., Pienta K. J., Shah R. B., Chinnaiyan AM. (2007) Integrative molecular concept modeling of prostate cancer progression, Nature Genetics, 39, 41-51.
[20]	Khan J., Wei J. S., Ringner M., Saal L. H., Ladanyi M., Westermann F., Berthold F., Schwab M., Antonescu C. R., Peterson C., Meltzer P. S. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nat Med, 7 (6), 673–679.

Cite This Article

Plain Text BibTeX RIS

APA Style

Md. Bipul Hossen, Md. Rabiul Auwul. (2020). Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets. Biomedical Statistics and Informatics, 5(1), 20-25. https://doi.org/10.11648/j.bsi.20200501.14

Copy | Download

ACS Style

Md. Bipul Hossen; Md. Rabiul Auwul. Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets. Biomed. Stat. Inform. 2020, 5(1), 20-25. doi: 10.11648/j.bsi.20200501.14

Copy | Download

AMA Style

Md. Bipul Hossen, Md. Rabiul Auwul. Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets. Biomed Stat Inform. 2020;5(1):20-25. doi: 10.11648/j.bsi.20200501.14

Copy | Download

@article{10.11648/j.bsi.20200501.14,
  author = {Md. Bipul Hossen and Md. Rabiul Auwul},
  title = {Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets},
  journal = {Biomedical Statistics and Informatics},
  volume = {5},
  number = {1},
  pages = {20-25},
  doi = {10.11648/j.bsi.20200501.14},
  url = {https://doi.org/10.11648/j.bsi.20200501.14},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.bsi.20200501.14},
  abstract = {Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.},
 year = {2020}
}

Copy | Download

TY - JOUR
T1 - Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets
AU - Md. Bipul Hossen
AU - Md. Rabiul Auwul
Y1 - 2020/03/02
PY - 2020
N1 - https://doi.org/10.11648/j.bsi.20200501.14
DO - 10.11648/j.bsi.20200501.14
T2 - Biomedical Statistics and Informatics
JF - Biomedical Statistics and Informatics
JO - Biomedical Statistics and Informatics
SP - 20
EP - 25
PB - Science Publishing Group
SN - 2578-8728
UR - https://doi.org/10.11648/j.bsi.20200501.14
AB - Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.
VL - 5
IS - 1
ER -

Copy | Download

Author Information

Md. Bipul Hossen

Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Rabiul Auwul

Department of Statistics, Guangzhou University, Guangzhou, China

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Md. Bipul Hossen, Md. Rabiul Auwul. (2020). Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets. Biomedical Statistics and Informatics, 5(1), 20-25. https://doi.org/10.11648/j.bsi.20200501.14

Copy | Download

ACS Style

Md. Bipul Hossen; Md. Rabiul Auwul. Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets. Biomed. Stat. Inform. 2020, 5(1), 20-25. doi: 10.11648/j.bsi.20200501.14

Copy | Download

AMA Style

Md. Bipul Hossen, Md. Rabiul Auwul. Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets. Biomed Stat Inform. 2020;5(1):20-25. doi: 10.11648/j.bsi.20200501.14

Copy | Download

@article{10.11648/j.bsi.20200501.14,
  author = {Md. Bipul Hossen and Md. Rabiul Auwul},
  title = {Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets},
  journal = {Biomedical Statistics and Informatics},
  volume = {5},
  number = {1},
  pages = {20-25},
  doi = {10.11648/j.bsi.20200501.14},
  url = {https://doi.org/10.11648/j.bsi.20200501.14},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.bsi.20200501.14},
  abstract = {Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.},
 year = {2020}
}

Copy | Download