عنوان مقاله [English]
Introduction By implementing genomic selection, high accurate estimates of breeding values in newborn individuals could be obtained in the absence of phenotypic records. In genomic selection, selection decisions are based on genomic breeding values predicted from high-density SNP pannels. Dramatic advances in sequencing technologies are providing highly dimensional molecular marker information at low cost. Next generation sequencing protocols such as genotype by sequencing (GBS) technology have been suggested as an efficient and cost-effective genotyping method for genomic selection in cattle. It capable of providing acceptable marker density for genomic selection or genome-wide association studies at roughly one third of the cost of currently available genotyping technologies. However, polymorphic loci scored by GBS can contain a large proportion of missing data across samples because random fragments of the genome are sequenced at low depth, leading some loci to have zero coverage in some individuals. Most analyses require a complete dataset; therefore, marker imputation is a necessary step before GBS data can be used for most purposes such as genomic selection. Order of markers is unknown in GBS data. Therefore, an imputation method which does not require previous information about the order of the markers is needed for imputing GBS data. Nonparametric models from the machine-learning repository have been proposed as an alternative to deal with such situations. These models do not follow a particular parametric design. Several different machine-learning approaches are currently used for genotype imputation and it is important to assess the performance of diverse methodologies and identify the methods that can provide the greatest predictive accuracy in a given population. Singular value decomposition imputation (SVD is capable to impute missed markers in GBS data. The aim of this study was assessing the performance of intelligent SVD algorithm for imputation of missing genotypes.
Materials and Methods A genome consisted of one Morgan chromosome was simulated using the hypred package on which in different scenarios, respectively, 500, 1000, 1500, 2000, 2500 and 3000 SNPs with equal initial frequency of 0.5 were arrayed for 1000 individuals. Coding for each genotype with A1 and A2 alleles were 2 for A1A1, 0 for A2A2 and 1 for A1A2 or A2A1, respectively. Then, in the framework of genotyping by sequencing data (GBS), genotype information of 5%, 10%, 25%, 50%, %75 and 90% of SNPs were masked and then imputed with SVD algorithm. Imputation accuracy (r) was assessed by the percentage of genotypes imputed correctly (number of genotypes correctly imputed/total number of masked genotypes). The effect of number of genotyped individuals (1000 and 2000 individuals), number of genotyped SNPs (500, 1000, 1500, 2000, 2500 and 3000 SNP) and levels of minor allele frequency (MAF) (0.01, 0.05, 0.1, 0.2, 0.3 and 0.4) on imputation accuracy were also studied.
Results and discussion The SVD imputation accuracy was noticeable. So by increasing the percentage of masked markers up to 50%, SVD was imputed missing genotypes with accuracy equal to 80%. In the scenarios of 70% and 90% of missing genotypes, the accuracy of imputation decreased and was 70% and 48%, respectively. In parallel to increase in the size of the population from 1000 to 2000 individuals, the imputation performance of SVD was increased, especially in the scenarios of 75% and 90% of masked genotypes. In parallel to increase in the number of markers, the imputation accuracy (r) increased in such a way that with increasing the number of markers from 500 to 3000 SNP, the accuracy of imputation increased by almost %10. An inverse relationship was observed between MAF and r in a way that by increasing MAF from 0.01 to 0.40, the accuracy of imputation decreased by 8%. In other words, markers with lower MAF were imputed with higher accuracy.
Conclusion SVD performed well regarding genotype imputation for GBS platforms in a way that missing data can be imputed with reasonable accuracy even if the level of missing data are high; up to 50%and even greater accuracies may result if number of individuals in the population is high and level of MAF of genotyped SNPs is low. Therefore, SVD can be recommended for genotype imputation in genome assisted evaluation.