--- title: "Description of the workflow for constructing a genetic map using *hsrecombi*" output: rmarkdown::html_vignette author: D. Wittenburg vignette: > %\VignetteIndexEntry{Description of the workflow for constructing a genetic map using *hsrecombi*} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, tidy.opts = list(width.cutoff = 100), tidy = TRUE ) ``` ## Introduction *hsrecombi* can approximate the genetic-map positions of genetic markers from SNP genotypes in half-sib families. It is referred to [Hampel et al. (2018, Fron Gen)](https://doi.org/10.3389/fgene.2018.00186) and [Qanbari & Wittenburg (2020, Genet Sel Evol)](https://doi.org/10.1186/s12711-020-00593-z) for more information on the methodology. The workflow relies on paternal half-sib families but it can be adapted to maternal half-sib families with minor modifications. In **Part I**, genotypic data from progeny and their common parents are simulated with the R package [*AlphaSimR*](https://CRAN.R-project.org/package=AlphaSimR) and reshaped into the PLINK format. If genotypic data are already available in the required PLINK format, the workflow starts directly at **Part II**. Required input files for Part II: * map files for each chromosome `map_chr.map` * data files for each chromosome `hsphase_input_chr.raw` **Note:** A make pipeline for the main analysis (Part II) is available at [GitHub](https://github.com/wittenburg/hsrecombi). ```{r setup} library(hsrecombi) library(AlphaSimR) library(doParallel) library(ggplot2) library(rlist) Sys.time() ``` ### Global variables ```{r global} # number of chromosomes nchr <- 2 # Number of simulated SNPs (e.g., p = 1500 resembles an average bovine chromosome) p <- 150 # Number of simulated progeny in each half-sib family n <- 1000 # Directory for (simulated) data path.dat <- 'data' dir.create(path.dat, showWarnings = FALSE) # Directory for output path.res <- 'results' dir.create(path.res, showWarnings = FALSE) # Number of computing clusters allocated nclust <- 2 ``` For instance, executing the following workflow with $p=1500$ and $n\geq 1000$ requires about 120 Mb storage space and 10 min computing time (2.6 GHz processor). With $p=150$, it takes about 7 Mb and 20 sec. ## Part I: Data simulation and preparation ### 1: Simulation of genetic data with R package *AlphaSimR* ```{r alphasim} founderPop <- runMacs2(nInd = 1000, nChr = nchr, segSites = p) SP <- SimParam$new(founderPop) SP$setSexes("yes_sys") # Enable tracing location of recombination events SP$setTrackRec(TRUE) pop <- newPop(founderPop) N <- 10; ntotal <- N * n my_pop <- selectCross(pop = pop, nFemale = 500, nMale = N, use = "rand", nCrosses = ntotal) probRec <- list() for(chr in 1:nchr){ co.pat <- matrix(0, ncol = p, nrow = ntotal) for(i in 1:ntotal){ if(nrow(SP$recHist[[1000 + i]][[chr]][[2]]) > 1){ # 1. line contains 1 1 by default loci <- SP$recHist[[1000 + i]][[chr]][[2]][-1, 2] co.pat[i, loci] <- 1 } } probRec[[chr]] <- colMeans(co.pat) } save(list = c('SP', 'founderPop', 'pop', 'my_pop', 'ntotal', 'probRec'), file = 'data/pop.Rdata') ``` Two chromosomes, each with $p$ SNPs and 1 Morgan length, are simulated. The founder population consists of 1\,000 animals with equal gender distribution. The follow-up generation consists of $N$ paternal-half sib families with equal number of progeny $n$. The study design with the equal number of progeny per family was preferred here for simplicity but this is not required in the subsequent steps. ### 2: Selection of sire haplotypes and progeny genotypes ```{r genetic-data} PAT <- my_pop@father rown <- paste(rep(unique(PAT), each = 2), c(1, 2), sep = '_') H.pat <- pullSegSiteHaplo(pop)[rown, ] X <- pullSegSiteGeno(my_pop) # Physical position of markers in Mbp map.snp <- lapply(founderPop@genMap, function(z) z * 100) ``` Note that `founderPop@genMap` includes genetic positions in Morgan units but physical positions are required in the next steps. ### 3: Reshaping data into PLINK format ```{r plink-format} map <- data.frame(Chr = rep(1:length(map.snp), unlist(lapply(map.snp, length))), Name = paste0('SNP', 1:length(unlist(map.snp))), locus_Mb = unlist(map.snp), locus_bp = unlist(map.snp) * 1e+6) colnames(X) <- map$Name FID <- 'FAM001' IID <- my_pop@id MAT <- my_pop@mother SEX <- 2 PHENOTYPE <- -9 for(chr in 1:nchr){ write.table(map[map$Chr == chr, ], file.path(path.dat, paste0('map_chr', chr, '.map')), col.names = F, row.names = F, quote = F) write.table(cbind(FID, IID, PAT, MAT, SEX, PHENOTYPE, X[, map$Chr == chr]), file.path(path.dat, paste0('hsphase_input_chr', chr, '.raw')), col.names = T, row.names = F, quote = F) } ``` ## Part II: Main analysis ```{r parallel-computing} cl <- makeCluster(nclust) registerDoParallel(cl) ``` ### Steps 1-5: Calculation of pairwise recombination rates ```{r recombination-rate} out <- foreach(chr = 1:nchr, .packages = 'hsrecombi') %dopar% { # 1: Physical map map <- read.table(file.path(path.dat, paste0('map_chr', chr, '.map')), col.names = c('Chr', 'SNP', 'locus_Mb', 'locus_bp')) # 2: Genotype matrix genomatrix <- data.table::fread(file.path(path.dat, paste0('hsphase_input_chr', chr, '.raw'))) X <- as.matrix(genomatrix[, -c(1:6)]) X[is.na(X)] <- 9 # required for hsphase # 3: Assign daughters to sire IDs daughterSire <- genomatrix$PAT # 4: Estimate sire haplotypes and format data hap <- makehappm(unique(daughterSire), daughterSire, X) save('hap', file = file.path(path.res, paste0('hsphase_output_chr', chr, '.Rdata'))) # Check order and dimension io <- sapply(1:nrow(map), function(z){grepl(x = colnames(X)[z], pattern = map$SNP[z])}) if(sum(io) != nrow(map)) stop("ERROR in dimension") # 5: Estimate recombination rates res <- hsrecombi(hap, X) final <- editraw(res, map) save(list = c('final', 'map'), file = file.path(path.res, paste0("Results_chr", chr, ".Rdata"))) ifelse(nrow(final) > 0, 'OK', 'no result') } print(which(unlist(out) == 'OK')) ``` In step 4, function `makehap` would be sufficient but `makehappm` allows comparison with a deterministic approach shown later. Note that sire haplotypes can also be obtained by some other (external) software. Then sire haplotypes and sire-to-daughter information need to be reshaped with `makehaplist` in step 4. ### 6: Check for candidates of misplacement ```{r misplaced} # 6a: Filter SNPs with unusually large recombination rate to neighbouring (30) SNPs excl <- foreach(chr = 1:nchr, .packages = 'hsrecombi') %dopar% { load(file.path(path.res, paste0("Results_chr", chr, ".Rdata"))) checkCandidates(final, map) } # 6b: Heatmap plot of recombination rates for visual verification, e.g.: chr <- 2 load(file.path(path.res, paste0("Results_chr", chr, ".Rdata"))) cand <- excl[[chr]][1] win <- match(cand, map$SNP) + (-100:100) win <- win[(win >= 1) & (win <= nrow(map))] target <- final[(final$SNP1 %in% map$SNP[win]) & (final$SNP2 %in% map$SNP[win]), ] target$SNP1 <- match(target$SNP1, map$SNP) target$SNP2 <- match(target$SNP2, map$SNP) ggplot(data = target, aes(SNP2, SNP1, fill = theta)) + geom_tile() + xlab("Locus 2") + ylab("Locus 1") + coord_equal() + scale_y_continuous(trans = "reverse") + theme(panel.background = element_blank(), panel.grid.major = element_line(colour = "grey", linewidth = 0.1), panel.grid.minor = element_line(colour = "grey")) + theme(text = element_text(size = 18)) + scale_fill_gradientn(colours = c('yellow', 'red'), limits = c(0, 1+1e-10), na.value = 'white') # -> nothing conspicious excl[[1]] <- excl[[2]] <- NA ``` Orange colour in the SNP neighbourhood implies an increased recombination rate. In case of a misplaced marker or a cluster of misplaced markers, the heatmap is characterised by an orange band. ### 7: Approximation of genetic positions ```{r genetic-position} pos <- foreach(chr = 1:nchr, .packages = 'hsrecombi') %dopar% { load(file.path(path.res, paste0("Results_chr", chr, ".Rdata"))) geneticPosition(final, map, exclude = excl[[chr]]) } ``` Genetic position is reported as NA if the corresponding SNP has not been considered in the analysis (e.g., when no sire was heterozygous at the corresponding SNP). ```{r stop-parallel} stopCluster(cl) ``` ### 8: Plot of physical versus genetic-map positions ```{r plot-1} data <- data.frame(Chr = rep(1:length(pos), times = unlist(list.select(pos, length(pos.Mb)))), Mbp = unlist(list.select(pos, pos.Mb)), cM = unlist(list.select(pos, pos.cM))) ggplot(data, aes(x = Mbp, y = cM)) + geom_point(na.rm = T) + facet_grid(Chr ~ .) ``` ### 9: Plot of genetic mapping function The genetic mapping function that fits best to the estimated data can be retrieved from a system of mapping functions [(Rao et al. 1977, Human Hered)](https://doi.org/10.1159/000152856); putatively misplaced markers need to be excluded. Given a parameter $p$, recombination rate $\theta \in (0,0.5)$ is mapped to Morgan units by $$ f(\theta; p) = \frac{1}{6}(p(2p - 1)(1 - 4p) log(1 - 2\theta) + 16p(p - 1)(2p - 1) \text{atan}(2\theta) + 2p(1 - p)(8p + 2)\text{atanh}(2\theta) + 6(1 - p)(1 - 2p)(1 - 4p)\theta) $$ ```{r plot-2} for (chr in 1:nchr) { load(file.path(path.res, paste0("Results_chr", chr, ".Rdata"))) final$theta[(final$SNP1 %in% excl[[chr]]) | (final$SNP2 %in% excl[[chr]])] <- NA final$dist_M <- (pos[[chr]]$pos.cM[final$SNP2] - pos[[chr]]$pos.cM[final$SNP1]) / 100 out <- bestmapfun(theta = final$theta, dist_M = final$dist_M) plot(final$dist_M, final$theta, xlab = 'genetic distance (Morgan)', ylab = 'recombination rate', col = 8) x <- seq(0, 0.5, 0.01); y <- rao(out$mixing, x); points(y, x, type = 'l', lwd = 2) legend('bottomright', legend = paste0('map function (p=', round(out$mixing, 3), ')'), lwd = 2, lty = 1, bty = 'n') } ``` The mixing parameter $p$ is estimated via quadratic optimisation in `bestmapfun`. A value of $p=0$ would match to Morgan's mapping function, $p=0.25$ to Carter, $p=0.5$ to Kosambi and $p=1$ resembles Haldane's mapping function. ## Part III: Visual comparison with simulated data and deterministic approach ```{r selection} chr <- 2 ``` ### 1: Check with simulated data ```{r check-simulated} load(file.path(path.dat, 'pop.Rdata')) sim.cM <- cumsum(probRec[[chr]]) * 100 ``` The position of a recombination event is traced in `SP$recHist` which has a nested list structure (individual, chromosome, female/male haplotype). The recombination rate between adjacent SNPs is determined as the proportion of recombinant progeny and the positions are derived as cumulative sum over recombination rates. ### 2: Check with deterministic approach ```{r check-deterministic} load(file.path(path.res, paste0('hsphase_output_chr', chr, '.Rdata'))) hsphase.cM <- hap$gen ``` The results are compared to the deterministic approach of [Ferdosi et al. (2014)](https://doi.org/10.1186/1297-9686-46-11) provided in the R package [*hsphase*]( https://CRAN.R-project.org/package=hsphase). Functions of that package are called in `makehap` and `makehappm`. ### 3: Comparative plot of physical versus genetic-map positions ```{r plot-3} par(mar=c(5.1, 4.1, 4.1, 9.1), xpd = TRUE) plot(pos[[chr]]$pos.Mb, pos[[chr]]$pos.cM, xlab = 'physical position (Mbp)', ylab = 'genetic position (cM)', ylim = range(c(sim.cM, hsphase.cM, pos[[chr]]$pos.cM), na.rm = T), pch = 20) points(pos[[chr]]$pos.Mb, sim.cM, pch = 20, col = 4) points(pos[[chr]]$pos.Mb, hsphase.cM, pch = 20, col = 8) legend('topleft', inset=c(1.01,0), legend = c('simulated', 'likelihood-based', 'deterministic'), pch = 20, col = c(4, 1, 8), bty = 'n') ``` The likelihood-based approach requires a sufficiently large sample size if SNP density is high. A verification of the bias of genetic-map positions can be found in the Supplemental Material of [Qanbari & Wittenburg (2020, Genet Sel Evol)](https://doi.org/10.1186/s12711-020-00593-z). ```{r cleanup} unlink(path.dat, recursive = TRUE) unlink(path.res, recursive = TRUE) Sys.time() ```