عنوان مقاله [English]
نویسنده [English]چکیده [English]
Introduction: The development of genotyping technologies has facilitated the genetic progress of breeding programs by implementing genomic selection (GS). In fact, the accuracy of genomic evaluations has been enhanced via GS and quickly spread in livestock breeding. For several decades, most phenotypic variation in dairy cattle populations had focused on continuous traits especially milk yield. From an animal breeding perspective, pay attention to this category of traits because of negative correlation with novel functional traits leads to reduction in genomic merit of these traits. Considerable advances along with increasing economic benefits in modern animal breeding programs requires better understanding and the direct inclusion of novel functional traits. Since many prominent traits in livestock including disease resistance and calving difficulty, present a binary distribution of phenotypes (and are often termed threshold traits), thus these traits are important in animal breeding due to importance of animal welfare and human tendency for healthy and high quality products. Threshold nature of most functional traits, affected by multiple genes, non-compliance from Mendelian inheritance and normal distribution are challenges for accurate prediction of GEBV using statistical methods in such kind of traits. Machine learning methodology as a non-parametric method commonly extended to solve the challenges of genomic selection for threshold traits. Random Forest (RF) and Boosting are powerful machine learning methods in order to recognize gene-gene, protein-protein and gene-environment interactions, to detect disease associated genes, to model the relationship among combinations of markers, to select genes associated with the target trait, to identify the regulatory factors in or protein and DNA sequences, to classify various samples in gene expression of microarrays data and to improve accuracy of genomic prediction. The objective of current study was to investigate the role of threshold phenotype rate of training set and different genomic architecture on performance of RF and Boosting methods. In this regard, per-determined and tuning input parameters of each method is a basic step to achieve maximum genomic accuracy.
Materials and Methods: A population of 2090 animals genotyped for 10,000 markers was simulated using QMSim software. In the first phase, over a time span of 1,000 generations, a historical population was provided from 1045 females and 1045 males. In the second phase, in order to produce a realistic level of LD, bottleneck was used. For this purpose, the population size decreased over 100 generations to 209 individuals. In the third phase, the population size increased over 100 generations (2030 females and 60 males). All 2090 individuals of the last historical generation served as founders and using a random mating design expanded the recent population by simulating an additional 10 generations. During these generations, replacement ratio was set at 0.2 and 0.50 for females and males, respectively and selection of candidate individuals were based on EBV and age. Each mating produced only one offspring with a same probability of being either male or female. Individuals of generations 6 to 9 was used as training set, while the whole generation 10 was considered as validation set. Genomic population were simulated to reflect variations in heritability (0.05 and 0.20), linkage disequilibrium (low and high) and number of QTL (200 and 600) for 29 chromosomes; therefore, four different scenarios including I (10K SNP, h2 = 0.20, LD = low and 600 QTL), II (10K SNP, h2 = 0.2, LD = low and 200 QTL), III (10K SNP, h2 = 0.05, LD = low and 200 QTL) and IV (10K SNP, h2 = 0.05, LD = high and 200 QTL) were simulated. In order to create different rates of discrete phenotype, the animals phenotype of training set was coded as1 (inappropriate phenotype) depending on whether their phenotype residuals was less than the average of residuals ( ) or - 1 for the first and second approachs, respectively, and other individuals was defined ascode 0 (appropriate phenotype). In order to tuning input parameters of the model, different levels of mtry (100, 1000 and 2000), ntree (500, 1000, 2000) and nodesize (1 and 5) for RF and ntree (500, 1000, 2000), tc (1, 5 and 10) and lc (0.1 and 0.05) for Boosting were considered.
Results and Discussion: The least of out-of-bag (OOB) error was obtained for mtry= 2000, ntree= 1000 and nodesize= 1 in RF method while the least of cross validation (CV) error was observed for boosting method with mtry= 2000, tc= 10 and lc= 0.05. In all scenarios, RF algorithm was showed a wide range of genomic accuracy (0.287 to 0.57) compared to Boosting method (0.4 to 0.58). Accuracy of genomic predicted was decreased in RF and Boosting with increasing the inappropriate phenotype, because of more individuals are in the vicinity of the average normal population for the first approach ( )compared to the second approach ( - 1 ), therefore leads to more classification errors (coding)and decrease of the genomic prediction accuracy. RF and Boosting showed a high performance when high-heritability traits were controlled by a large number of QTLs. Increase in number of QTLs generally led to a major improvement in RF accuracies, while a negligible positive effects were found for Boosting.
Conclusion: The composition of training set and population genomic architecture were two basic factors affecting accuracy of genomic prediction in machine learning methods. Interactions among predictive variables (SNP), self-healing and high potency to decrease training error were considered in Boosting method resulting in more accurate estimation in this method compared to the other RF method under all scenarios