Comparison of principal component analysis (PCA) and discriminant analysis of principal component (DAPC) methods for analysis of population structure in Akhal-Take, Arabian and Caspian horse breeds using genomic data

Document Type : Genetics & breeding

Authors

1 Department of Animal Science, Faculty of Agriculture, University of Tabriz, Tabriz, Iran

2 Department of Animal Sciences, Faculty of Agriculture and Natural Resources, Arak University, Arak, Iran.

3 Department of Artificial intelligence, Faculty of Computer Engineering, University of Tabriz, Tabriz, Iran.

Abstract

Introduction Development of high-power and cost-effective genotyping methods in recent years has provided the possibility of evaluation the genetic structure and the relationship among species populations utilizing genomic data. Genome wide inference of population structure using genetic markers could provide invaluable information associated with evolutionary relationships and clustering of subpopulations for performing animal breeding programs. In large scale studies, one of the interesting subjects is to study the existence of genetic differences among subdivided groups ascertained from different geographic locations. The objective of this study was to compare the principal component analysis (PCA) and discriminant analysis of principal component (DAPC) approaches for determining the population structure and study how an individual allocated to the true population of origin, in three Horse breeds located in Middle East consisting Akhal Take, Arabian and Caspian using genomic data.
Materials and Methods In this study, the genomic data obtained from 61 animals consisting Akhal Take (19), Arabian (24) and Caspian (18) were used to investigate the population structure of some Asian horse breeds. The data were obtained from the Equine Genetic Diversity Consortium (EGDC) project. Hair or tissue samples were collected from animals. DNA extraction was performed using an optimized Pure gene (Qiagen) assay and approximately 1 μg of DNA was used for genotyping of the samples. Genotyping was performed using Illumina SNP 50K BeadChip arrays that allow to genotype 52603 SNP marker loci, according to the Illumina standard guidelines. In this study, different quality control steps were applied on preliminary data to ensure the quality of genotyping data. Quality control carried out using PLINK v.1.07 program. The samples with more than 5% missing data were excluded from analysis. Then for each SNP, MAF and call percentage were calculated and the SNPs with a call rate<95% and a MAF<2% were discarded. Deviation from Hardy-Weinberg equilibrium (p<10-6) was estimated for the remaining SNPs to identify genotyping errors. The Bonferroni correction (β=α/n) was used to address the multiple testing comparison problem. Principal component analysis (PCA) is a statistical technique for summarizing data from many variables into a few variables which describe as much of the variation in the data as possible. For this purpose, the variance-covariance matrix of independent variables was first calculated and principal components were extracted. Each new variable has an associated Eigen value that measures the respective amount of explained variance. Furthermore, the model independent of discriminant analysis of principal component (DAPC) is a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Analysis was performed using PCA and DAPC approaches and the codes for analysis were provided in R v.3.4.1 software.
Results and Discussion The analysis of the main components summarizes the general variation among individuals, which includes both the variability between the groups and the diversity of the groups, and shows a clear picture of the differences between the groups. The results of this study indicated that 10.8% of the variance was explained by the first two components in both PCA and DAPC methods. Both methods showed high accuracy for assigning of individuals to the true population of origin and both were able to cluster three populations separately. The Bayesian information criterion (BIC) index was used for evaluating the optimal number of clusters for DAPC method and the results revealed that K=3 showing the optimal number with lowest BIC that completely separate three populations. The DAPC method was better than PCA to separate populations from each other due to the increase of intergroup variance and the reduction of intra-group variance. In determining the optimal number of K, it worked better than PCA method and provided a better picture of the relationship between individuals. This results show that DAPC method can be applied in quality control of GWAS as an alternative to the PCA, because of summarizing the genetic differentiation between groups and overlooking within-group variation and provides better population structure.
Conclusion In general, the results of this study showed that although the previous studies grouped these three breeds located in Middle East in one cluster of neighboring trees, however, according to the results of this study, three breeds are grouped separately, and the DAPC method can better illustrate the inter-population relationships in horse breeds.
 

Keywords


  1. Askari, N., A. M. Mohammadi, and A. Baghizadeh. 2011. ISSR markers for assessing DNA polymorphism and genetic characterization of cattle, goat and sheep populations. ‏Iranian Journal of Biotechnology, 9: 222-229.
  2. Azizi, Z., H. Moradi-Shahrbabak, and M. Moradi-Shahrbabak. 2017. Comparison of PCA and DAPC methods for analysis of Iranian Buffalo population structure using SNPchip90k data. Iranian Journal of Animal Science, 2:153-161. (In Persian).
  3. Dodds, K. G., B. Auvray, S. N. Newman, and J. C. McEwan. 2014. Genomic breed prediction in New Zealand sheep. BMC Genetics, 15:92.
  4. Epps, C. W., J. A. Castillo, A. Schmidt-Küntzel, P. du Preez, G. Stuart-Hill, M. Jago, and R. Naidoo. 2013. Contrasting historical and recent gene flow among African buffalo herds in the Caprivi Strip of Namibia. Journal of Heredity, 104(2): 172-181.‏
  5. Jombart, T., and C. Collins. 2015. A tutorial for discriminant analysis of principal components (DAPC) using adegenet 2.0. 0. Imp Coll London-MRC Cent Outbreak Anal Model, 43.‏
  6. Jombart, T., S. Devillard, and F. Balloux. 2010. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics, 11(1): 94.‏
  7. Karimi, K. 2017. Investigation of the population structure in Iranian native cattle using discriminant analysis of principal components. Research on Animal Production, 8:184-193. (In Persian).
  8. Liu, N., and H. Zhao. 2006. A non-parametric approach to population structure inference using multi locus genotypes. Human Genomics, 2(6): 353.‏
  9. Mohammadi, A., M. R. Nassiry, J. Mosafer, M. R. Mohammadabadi, and G. E. Sulimova. 2009. Distribution of BoLA-DRB3 allelic frequencies and identification of a new allele in the Iranian cattle breed Sistani (Bos indicus). Russian Journal of Genetics, 45(2): 198-202.‏
  10. Moradi, M. H., A. Nejati-Javaremi, M. Moradi-Shahrbabak, K. G. Dodds, and J. C. McEwan. 2012. Genomic scan of selective sweeps in thin and fat tail sheep breeds for identifying of candidate regions associated with fat deposition. BMC Genetics, 13: 10.
  11. Moradi, M. H., J. Rostamzadeh, A. Rashidi, Kh. Vahabi, and H. Farahmand. 2013. Analysis of genetic diversity in Iranian Mohair goat and its color types using Inter Simple Sequence Repeat (ISSR) markers. Journal of Agricultural Communications, 1: 2.
  12. Nabiloo, R., M. B. Zandi and M. T. Harakinezhad. 2018. Study of genetic diversity of indigenous (Afshari, Moghani and Ghezel) and exotic (Romney, Merinos and Dorper) sheep breeds using high-density SNP markers. Iranian Journal of Animal Science, 49(3):437-451. (In Persian).
  13. Novembre, J., and M. Stephens. 2008. Interpreting principal component analyses of spatial population genetic variation. Nature Genetics, 40(5): 646-649.
  14. Paschou, P., E. Ziv, E. G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M. W. Mahoney, and P. Drineas. 2007. PCA-Correlated SNPs for structure identification in worldwide human populations. PLoS Genetics, 3(9): e160.
  15. Patterson, N., A. L. Price, and D. Reich. 2006. Population structure and eigen analysis. PLoS Genetics, 2(12): e190.‏
  16. Pearson, K. 1901. On lines and planes of closest fit to systems of point in space. Philosophical Magazine, 2: 559-572.
  17. Petersen, J. L., J. R. Mickelson, E. G. Cothran, L. S. Andersson, J. Axelsson, E. Bailey, D. Bannasch, M. M. Binns, A. S. Borges, P. Brama, A. C. Machado, O. Distl, M. Felicetti, L. F. Clipsham, K. T. Graves, and et al. 2013. Genetic diversity in the modern horse illustrated from genome-wide SNP data. PLOS ONE, 8(1): e54997.
  18. Pometti, C. L., C. F. Bessega, B. O. Saidman, and J. C. Vilardi. 2014. Analysis of genetic population structure in Acacia caven (Leguminosae, Mimosoideae), comparing one exploratory and two Bayesian-model-based methods. Genetics and Molecular Biology, 37(1):64-72.
  19. Purcell S. 2010. PLINK version 1.07. URL: http://pngu.mgh.harvard.edu/~purcell/pink.
  20. Rahmaninia, J., S. R. Miraei-Ashtiani, H. Moradi-Shahrbabak. 2015. Unsupervised clustering analysis of population and subpopulation structure using dense SNP markers. Iranian Journal of Animal Science, 46(3): 277-278. (In Persian).
  21. Reich, D., A. L. Price, and N. Patterson. 2008. Principal component analysis of genetic data. Nature Genetics, 40: 491.
  22. Serre, D., A. Montpetit, G. Paré, J. C. Engert, S. Yusuf, B. Keavney, T. J. Hudson, and S. Anand. 2008. Correction of population stratification in large multi-ethnic association studies. PloS one, 3(1): e1382.
  23. Sethuraman, A. 2013. On inferring and interpreting genetic population structure-applications to conservation, and the estimation of pairwise genetic relatedness. Ph.D. dissertation, Iowa State University, Iowa State.
  24. Shojaei, M., M. Mohammad-Abadi, M. Asadi-Fozi, O. Dayani, A. Khezri, and M. Akhondi. 2011. Association of growth trait and leptin gene polymorphism in Kermani sheep. Journal of Cell and Molecular Research, 2(2): 67-73.‏
  25. Smith, O., and J. Wang. 2014. When can noninvasive samples provide sufficient information in conservation genetics studies? Molecular Ecology Resources, 14: 1011-1023.
  26. Teo, Y. Y., A. E. Fry, T. G. Clark, E. Tai, and M. Seielstad. 2007. On the usage of HWE for identifying genotyping errors. Annals of Human Genetics, 71(5): 701-703.
  27. Uzzaman, M. R., Z. Edea, M. S. A. Bhuiyan, J. Walker, A. K. Bhuiyan, and K. S. Kim. 2014. Genome-wide single nucleotide polymorphism analyses reveal genetic diversity and structure of wild and domestic cattle in Bangladesh. Asian-Australasian Journal of Animal Sciences, 27(10): 1381.‏
  28. Willing, E., C. Dreyer, and C. Van Oosterhout. 2012. Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One, 7(8): e42649.

Zamani, P., M. Akhondi, M. R. Mohammadabadi, A. A. Saki, A. Ershadi, M. H. Banabazi, and A. R. Abdolmohammadi. 2011. Genetic variation of Mehraban sheep using two inter simple sequence repeat (ISSR) markers. African Journal of Biotechnology, 10(10): 1812-1817.

CAPTCHA Image