بهینه‌سازی پارامترهای روش‌های یادگیری ماشین بر ارزیابی ژنومی صفات گسسته دودویی با در نظر گرفتن ساختار جمعیت و توزیع‌های متفاوت فنوتیپ در جمعیت مرجع

نوع مقاله : علمی پژوهشی- ژنتیک و اصلاح دام و طیور

نویسنده

دانشگاه آزاد اسلامی واحد آستارا

چکیده

تنظیم اولیه و بهینه‌سازی پارامترهای ورودی روش‌های یادگیری ماشین گامی اساسی جهت دستیابی به حداکثر صحت پیش‌بینی ژنومی می‌باشد.  در این تحقیق، جمعیت‌های ژنومی برای سطوح مختلف وراثت‌پذیری (05/0 و 2/0)، عدم تعادل پیوستگی (پایین و بالا) و تعداد متفاوت جایگاه صفات کمی (200 و 600) بر روی 29 کروموزوم شبیه‌سازی شد. جهت ایجاد نسبت‌های مختلف فنوتیپ آستانه‌ای دودویی، فنوتیپ افراد جمعیت مرجع وابسته به اینکه باقی‌مانده آنها کمتر از ē-1SDe (رویکرد اول) یا 50 درصد افراد جمعیت (رویکرد دوم) باشد کد یک (فنوتیپ نامطلوب) و سایر حیوانات کد صفر (فنوتیپ مطلوب) اختصاص داده شد. برای بهینه‌سازی پارامترهای ورودی مدل، سطوح مختلف تعداد SNP نمونه‌گیری‌شده (100، 1000 و 2000=mtry)، تعداد بوت استراپ (500، 1000 و 2000=ntree) و حداقل اندازه گره پایانی (1 و 5=node size) برای جنگل تصادفی و سطوح مختلف تعداد درخت (100، 1000 و 2000=ntree)، عمق درخت (1، 5 و 10=tc) و نرخ یادگیری (1/0 و 05/0=lc) برای Boosting در نظر گرفته شد. کمترین میزان خطای خارج از کیسه برای mtry برابر با 2000، ntree برابر با 1000 و node size برابر با 1 و کمترین خطای اعتبارسنجی در روش Boosting برای ntree، tc و lr به ترتیب 1000، 10 و 05/0 مشاهده شد. صحت پیش‌بینی ژنومی روش‌های جنگل تصادفی و Boosting با کاهش فنوتیپ نامطلوب (رویکرد اول) افزایش یافت. بطور کلی در تمام سناریوها روش Boosting عملکرد بهتری نسبت به روش جنگل تصادفی داشت که دلیل این امر را می‌توان لحاظ کردن اثرات متقابل بین نشانگرها، خود ترمیمی و قدرت بالای این روش در کاهش خطای مدل دانست.

کلیدواژه‌ها


عنوان مقاله [English]

The Effect of Parameters Tuning of Machine Learning Methods on Genomic Evaluation of Discrete Traits Considering Population Structure and Different Distributions of Phenotype in Training Set

نویسنده [English]

  • Yousef Naderi
چکیده [English]

Introduction: The development of genotyping technologies has facilitated the genetic progress of breeding programs by implementing genomic selection (GS). In fact, the accuracy of genomic evaluations has been enhanced via GS and quickly spread in livestock breeding. For several decades, most phenotypic variation in dairy cattle populations had focused on continuous traits especially milk yield. From an animal breeding perspective, pay attention to this category of traits because of negative correlation with novel functional traits leads to reduction in genomic merit of these traits. Considerable advances along with increasing economic benefits in modern animal breeding programs requires better understanding and the direct inclusion of novel functional traits. Since many prominent traits in livestock including disease resistance and calving difficulty, present a binary distribution of phenotypes (and are often termed threshold traits), thus these traits are important in animal breeding due to importance of animal welfare and human tendency for healthy and high quality products. Threshold nature of most functional traits, affected by multiple genes, non-compliance from Mendelian inheritance and normal distribution are challenges for accurate prediction of GEBV using statistical methods in such kind of traits. Machine learning methodology as a non-parametric method commonly extended to solve the challenges of genomic selection for threshold traits. Random Forest (RF) and Boosting are powerful machine learning methods in order to recognize gene-gene, protein-protein and gene-environment interactions, to detect disease associated genes, to model the relationship among combinations of markers, to select genes associated with the target trait, to identify the regulatory factors in or protein and DNA sequences, to classify various samples in gene expression of microarrays data and to improve accuracy of genomic prediction. The objective of current study was to investigate the role of threshold phenotype rate of training set and different genomic architecture on performance of RF and Boosting methods. In this regard, per-determined and tuning input parameters of each method is a basic step to achieve maximum genomic accuracy.
Materials and Methods: A population of 2090 animals genotyped for 10,000 markers was simulated using QMSim software. In the first phase, over a time span of 1,000 generations, a historical population was provided from 1045 females and 1045 males. In the second phase, in order to produce a realistic level of LD, bottleneck was used. For this purpose, the population size decreased over 100 generations to 209 individuals. In the third phase, the population size increased over 100 generations (2030 females and 60 males). All 2090 individuals of the last historical generation served as founders and using a random mating design expanded the recent population by simulating an additional 10 generations. During these generations, replacement ratio was set at 0.2 and 0.50 for females and males, respectively and selection of candidate individuals were based on EBV and age. Each mating produced only one offspring with a same probability of being either male or female. Individuals of generations 6 to 9 was used as training set, while the whole generation 10 was considered as validation set. Genomic population were simulated to reflect variations in heritability (0.05 and 0.20), linkage disequilibrium (low and high) and number of QTL (200 and 600) for 29 chromosomes; therefore, four different scenarios including I (10K SNP, h2 = 0.20, LD = low and 600 QTL), II (10K SNP, h2 = 0.2, LD = low and 200 QTL), III (10K SNP, h2 = 0.05, LD = low and 200 QTL) and IV (10K SNP, h2 = 0.05, LD = high and 200 QTL) were simulated. In order to create different rates of discrete phenotype, the animals phenotype of training set was coded as1 (inappropriate phenotype) depending on whether their phenotype residuals was less than the average of residuals ( ) or - 1  for the first and second approachs, respectively, and other individuals was defined ascode 0 (appropriate phenotype). In order to tuning input parameters of the model, different levels of mtry (100, 1000 and 2000), ntree (500, 1000, 2000) and nodesize (1 and 5) for RF and ntree (500, 1000, 2000), tc (1, 5 and 10) and lc (0.1 and 0.05) for Boosting were considered.
Results and Discussion: The least of out-of-bag (OOB) error was obtained for mtry= 2000, ntree= 1000 and nodesize= 1 in RF method while the least of cross validation (CV) error was observed for boosting method with mtry= 2000, tc= 10 and lc= 0.05. In all scenarios, RF algorithm was showed a wide range of genomic accuracy (0.287 to 0.57) compared to Boosting method (0.4 to 0.58). Accuracy of genomic predicted was decreased in RF and Boosting with increasing the inappropriate phenotype, because of more individuals are in the vicinity of the average normal population for the first approach ( )compared to the second approach ( - 1 ), therefore leads to more classification errors (coding)and decrease of the genomic prediction accuracy. RF and Boosting showed a high performance when high-heritability traits were controlled by a large number of QTLs. Increase in number of QTLs generally led to a major improvement in RF accuracies, while a negligible positive effects were found for Boosting.
Conclusion: The composition of training set and population genomic architecture were two basic factors affecting accuracy of genomic prediction in machine learning methods. Interactions among predictive variables (SNP), self-healing and high potency to decrease training error were considered in Boosting method resulting in more accurate estimation in this method compared to the other RF method under all scenarios

کلیدواژه‌ها [English]

  • Cross validation
  • Heritability
  • Linkage disequlibrium
  • Machine learning
  • Threshold traits
1. Abdollahi-Arpanahi, R., A. Pakdel, A. Nejati-Javaremi, and M. M. Shahrbabak. 2013. Comparison of genomic evaluation methods in complex traits with different genetic architecture. Journal of Animal Production, 15:65-77.
2. Andonov, S., D. Lourenco, B. Fragomeni, Y. Masuda, I. Pocrnic, S. Tsuruta, and I. Misztal. 2017. Accuracy of breeding values in small genotyped populations using different sources of external information—A simulation study. Journal of Dairy Science, 100(1):395-401.
3. Bo, Z., J.-J. Zhang, N. Hong, G. Long, G. Peng, L.-Y. Xu, C. Yan, L.-P. Zhang, H.-J. Gao, and G. Xue. 2017. Effects of marker density and minor allele frequency on genomic prediction for growth traits in Chinese Simmental beef cattle. Journal of Integrative Agriculture, 16(4):911-920.
4. Boulesteix, A. L., S. Janitza, J. Kruppa, and I. R. König. 2012. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6):493-507.
5. Breiman, L. 2001. Random forests. Machine Learning, 45(1):5-32.
6. Daetwyler, H., J. Hickey, J. Henshall, S. Dominik, B. Gredler, J. Van der Werf, and B. Hayes. 2010. Accuracy of estimated genomic breeding values for wool and meat traits in a multi-breed sheep population. Animal Production Science, 50(12):1004-1010.
7. Daetwyler, H. D., M. P. Calus, R. Pong-Wong, G. de los Campos, and J. M. Hickey. 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, andbenchmarking. Genetics, 193(2):347-365.
8. Egger-Danner, C., J. Cole, J. Pryce, N. Gengler, B. Heringstad, A. Bradley, and K. F. Stock. 2015. Invited review: overview of new traits and phenotyping strategies in dairy cattle with a focus on functional traits. Animal, 9(2):191-207.
9. Garrick, D. 2017. The role of genomics in pig improvement. Animal Production Science, 57(12):2360-2365.
10. Ghafouri-Kesbi, F., G. Rahimi-Mianji, M. Honarvar, and A. Nejati-Javaremi. 2016. Tuning and application of random forest algorithm in genomic evaluation. Research on Animal Production, 7(13):178-185 (In Persian).
11. Ghafouri-Kesbi, F., G. Rahimi-Mianji, M. Honarvar, and A. Nejati-Javaremi. 2017. Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic BestLinear Unbiased Prediction in different scenarios of genomic evaluation. Animal Production Science, 57(2):229-236.
12. Goddard, M. 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica, 136(2):245-257.
13. Goldstein, B.A., A. E. Hubbard, A. Cutler, and L. F. Barcellos. 2010. An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genetics, 11(1):49.
14. Gonzalez-Recio, O., and S. Forni. 2011. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genetics Selection Evolution, 43(1):7.
15. Gorgani Firozjah, N., H. Atashi, M. Dadpasand, and M. Zamiri. 2014. Effect of marker density and trait heritability on the accuracy of genomic prediction over three generations. Journal of Livestock Science and Technologies, 2(2):53-58.
16. Habier, D., R. L. Fernando, and J. C. Dekkers. 2009. Genomic selection using low-density marker panels. Genetics, 182(1):343-353.
17. Hayes, B., and M. E. Goddard.2001. The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution, 33(3):209.
18. Hill, W., and A. Robertson. 1968. Linkage disequilibrium in finite populations. TAG Theoretical and Applied Genetics, 38(6): 226-231.
19. Jonas, D., V. Ducrocq, and P. Croiseau. 2017. The combined use of linkage disequilibrium–based haploblocks and allele frequency–based haplotype selection methods enhances genomic evaluation accuracy in dairy cattle. Journal of Dairy Science, 10(4): 2905-2908.
20. Ke, X., S. Hunt, W. Tapper, R. Lawrence, G. Stavrides, J. Ghori, P. Whittaker, A. Collins, A. P. Morris, and D. Bentley. 2004. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Human Molecular Genetics, 13(6):577-588.
21. Mc Hugh, N., T. Meuwissen, A. Cromie, and A. Sonesson. 2011. Use of female information in dairy cattle genomic breeding programs. Journal of Dairy Science, 94(8):4109-4118.
22. Meuwissen, T., B. Hayes, and M. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4):1819-1829.
23. Muir, W. 2007. Comparison of genomic and traditional BLUP‐estimated breeding value accuracy and selection response under alternative trait and genomic parameters. Journal of Animal Breeding and Genetics, 124(6):342-355.
24. Naderi, S., M. Bohlouli, T. Yin, and S. König. 2018. Genomic breeding values, SNP effects and gene identification for disease traits in cow training sets. Animal Genetics, 49(3):178-192.
25. Naderi, S., T. Yin, and S. König. 2016. Random forest estimation of genomic breeding values for diseasesusceptibility over different disease incidences and genomic architectures in simulated cow calibration groups. Journal of Dairy Science, 99(9):7261-7273.
26. Naderi, Y. 2018. Evaluation ofgenomic prediction accuracy in different genomic architectures of quantitative and threshold traits with the imputation of simulated genomic data using random forest method. Research on Animal Production, 9(20):129-138 (In Persian).
27. Naderi, Y. 2018. Impact of genotype imputation and different genomic architectures on the performance of random forest and threshold Bayes A methods for genomic prediction. Iranian Journal of Animal Science, 49(1):145-157 (In Persian).
28. Naderi, Y. 2018. Investigation of genotype× environment interaction with considering imputation in simulated genomic data via different animal models. Animal Production, 20(3):375-387 (In Persian).
29. Neves, H. H., R. Carvalheiro, and S. A. Queiroz. 2012. A comparison of statistical methods for genomic selection in a mice population. BMC Genetics, 13(1):100.
30. Pimentel, E. C., M. Wensch-Dorendorf, S. König, and H. H. Swalve. 2013. Enlarging a training set for genomic selection by imputation of un-genotyped animals in populations of varying genetic architecture. Genetics Selection Evolution, 45(1):12.
31. Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. De Bakker, and M. J. Daly. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3):559-575.
32. Ridgeway, G. 2017. Package ‘gbm’, the R project for statistical computing.
33. Sadeghi, S., S. A. Rafat, and S. Alijani. 2018. Evaluation of imputed genomic data in discrete traits using Random forest and Bayesian threshold methods. Acta Scientiarum. Animal Sciences, 40: e39007.
34. Sargolzaei, M., and F. S. Schenkel. 2009. QMSim: a large-scale genome simulator for livestock. Bioinformatics, 25(5):680-681.
35. Sun, X., R. Fernando, and J. Dekkers. 2016. Contributions of linkage disequilibrium and co-segregation information to the accuracy of genomic prediction. Genetics Selection Evolution, 48(1):77.
36. Wang, C., X. Ding, J. Wang, J. Liu, W. Fu, Z. Zhang, Z. Yin, and Q. Zhang. 2013. Bayesian methods for estimating GEBVs of threshold traits. Heredity, 110(3):213-219.
37. Wang, C., X. Li, R. Qian, G. Su, Q. Zhang, and X. Ding. 2017. Bayesian methods for jointly estimating genomic breeding values of one continuous and one threshold trait. PloS One, 12(4):e0175448.
38. Wang, Q., Y. Yu, J. Yuan, X. Zhang, H. Huang, F. Li, and J. Xiang. 2017. Effects of marker density and population structure on the genomic prediction accuracy for growth trait in Pacific white shrimp Litopenaeus vannamei. BMC Genetics, 18(1):45.
39. Wientjes, Y. C., M. P. Calus, M. E. Goddard, and B. J. Hayes. 2015. Impact of QTL properties on the accuracy of multi-breed genomic prediction. Genetics Selection Evolution, 47(1):42.
40. Yañez, J. M., R. D. Houston, and S. Newman. 2014. Genetics and genomics of disease resistance in salmonid species. Frontiers in Genetics, 5:415.
41. Yang, P., Y. Hwa Yang, B. B Zhou, and A. Y Zomaya. 2010. A review of ensemble methods in bioinformatics. Current Bioinformatics, 5(4):296-308.
42. Yin, T., E. Pimentel, U. K. v. Borstel, and S. König. 2014. Strategy for the simulation and analysis of longitudinal phenotypic and genomic data in the context of a temperature humidity-dependent covariate. Journal of Dairy Science, 97(4):2444-2454.
43. Zhang, Z., Q. Zhang, and X. Ding. 2011. Advances in genomic selection in domestic animals. Chinese Science Bulletin, 56(26): 2655-2663.