library(mpbr)
#>
#> Attaching package: 'mpbr'
#> The following object is masked from 'package:stats':
#>
#> filterSNPdata object
The functions in this package require a SNPdata object.
Objects of this class are generated using the get_snpdata()
function with the following arguments:
Type ?get_snpdata for more details.
snpdata <- get_snpdata(
vcf_file = vcf,
meta_file = meta_file,
output_dir = output_dir,
gof = gof,
gff = gff
)SNPdata object
Use the print() function to visualize the structure of
the SNPdata object created from the
get_snpdata().
print(snpdata)the compute_maf() requires a SNPdata
object, the name of the genotype table from which the MAF should be
calculated (either “GT” for the raw genotypes,
“Phased” for phased genotypes,
“Phased_Imputed” for phased and imputed
genotypes).include_het argument determines whether to account for
heterozygous sites during MAF calculation. The function returns a
SNPdata object with the following 2 extra columns in the
details table:
type ?compute_MAF for more details.
snpdata <- compute_maf(
snpdata,
include_het = FALSE,
mat_name = "GT"
)
head(snpdata$details)Use the filter_snps_samples() function to remove the
SNPs and samples that do not satisfy the user criteria. In the example
below, we intend to discard SNPs and samples where the number of missing
genotypes represent 20% or more the total number of observations. SNPs
with MAF < 1% or a quality score < 10 will also be removed.
snpdata <- filter(snpdata,
min_qual = 10,
max_missing_sites = 0.2,
max_missing_samples = 0.2,
maf_cutoff = 0.01)All loci and samples that do not satisfy the defined conditions will
be dropped from the input SNPdata object. Details about the
filtered object can be viewed using:
print(snpdata)The distribution of the MAF in the filtered SNPdata
object looks like below:
Mixed genotypes are not accounted for in many population genetics
methods as they introduce a extra layer of complexity. Genotype phasing
is the most common way of dealing with mixed genotype. In this package,
we use the phase_mixed_genotypes() to phase the mixed
genotypes based on their allelic depth (AD) and the total allele count
on the specific locus through the following procedure:
The process above is repeated nsim times and a new
genotype file is generated and stored in the output directory. After all
the iterations, the correlation between the MAF before phasing and the
MAF from each phased genotype data is calculated. The phased genotype
with the highest correlation will be retained and stored in the output
SNPdata object. We recommend nsim = 50 at
least to maximize the change of getting the highest MAF correlation.
snpdata <- phase_mixed_genotypes(snpdata, nsim = 100)
head(snpdata$Phased[, 1:5])Some population genetics methods require genotype data with no
missing allele. In {mpbr}, missing genotypes can be imputed using the
impute_missing_genotypes() use the
impute_missing_genotypes function to impute the missing
genotypes. The genotype argument determines which genotype
table should be used when imputing the missing data
(“GT” to impute from raw data,
“Phased” to impute from the phased data).
For every missing allele, we compute the reference and alternate allele frequencies at the given locus and identify the minor allele frequency (MAF). The missing allele will be replaced based on a Bernoulli simulation with a probability that is set at the resulting MAF value.
This process is repeated nsim times and a new genotype
file is generated and stored in the output directory. After all the
iterations, the correlation between the MAF before imputation and the
MAF from each imputed genotype data is calculated. The imputed genotype
with the highest correlation will be retained and stored in the output
SNPdata object. We recommend nsim = 50 at
least to maximize the change of getting the highest MAF correlation.
snpdata <- impute_missing_genotypes(snpdata, genotype = "GT", nsim = 100)
head(snpdata$Imputed[, 1:5])The package also contains functions that can be used to subset the
SNPdata object. To generate a SNPdata object
with data from chromosome 7 only (“Pf3D7_07_v3”) from an existing object
of the same class, use the select_chrom() function. Note
that the meta table in the output object will not have
the percentage_missing_sites, Fws, COI
columns. This is due to the fact that the existing values in these
columns are calculated from the entire dataset and are not specific the
chosen chromosome.
## subset only chromosome 7 data
chrom7_snpdata <- select_chrom(snpdata, chrom = "Pf3D7_07_v3")
print(chrom7_snpdata)If you want to remove some loci from the current SNPdata
object, use the drop_snps() function and provide it with a
data frame with 2 columns named as: Chrom and Pos”,
representing the genomic coordinates of the loci to be dropped. Another
alternative will be to provide the chrom (chromosome name),
start (the start position of the region to be discarded)
and end (the ens position of the region to be
discarded).
## drop a set of SNPs based on their genomic coordinates
snp_to_be_dropped <- chrom7_snpdata$details %>% dplyr::select(Chrom, Pos)
idx <- sample(seq_len(nrow(snp_to_be_dropped)), 100,
replace = FALSE)
snp_to_be_dropped <- snp_to_be_dropped[idx, ]
# there is a problem in the way the index of the vcf file is set when
# dropping snps.
# solution: consider replacing index with an appropriated value
reduced_snpdata <- drop_snps(chrom7_snpdata, snp_to_be_dropped)
print(reduced_snpdata)
# drop a region from a chromosome
reduced_snpdata <- drop_snps(chrom7_snpdata,
chrom = "Pf3D7_07_v3",
start = 110031, end = 154906)
print(reduced_snpdata)When you want to remove some samples from the SNPdata
object, use the drop_samples() function and provide it with
a vector of sample IDs to drop.
idx <- sample(seq_len(nrow(reduced_snpdata$meta)), 10,
replace = FALSE)
samples_to_be_dropped <- reduced_snpdata$meta$sample[idx]
reduced_snpdata <- drop_samples(reduced_snpdata, samples_to_be_dropped)
print(reduced_snpdata)