library(mpbr)
#>
#> Attaching package: 'mpbr'
#> The following object is masked from 'package:stats':
#>
#> filter
SNPdata
object
The functions in this package require a SNPdata
object.
Objects of this class are generated using the get_snpdata()
function with the following arguments:
Type ?get_snpdata
for more details.
snpdata <- get_snpdata(
vcf_file = vcf,
meta_file = meta_file,
output_dir = output_dir,
gof = gof,
gff = gff
)
SNPdata
object
Use the print()
function to visualize the structure of
the SNPdata
object created from the
get_snpdata()
.
print(snpdata)
the compute_maf()
requires a SNPdata
object, the name of the genotype table from which the MAF should be
calculated (either “GT” for the raw genotypes,
“Phased” for phased genotypes,
“Phased_Imputed” for phased and imputed
genotypes).include_het
argument determines whether to account for
heterozygous sites during MAF calculation. The function returns a
SNPdata
object with the following 2 extra columns in the
details table:
type ?compute_MAF
for more details.
snpdata <- compute_maf(
snpdata,
include_het = FALSE,
mat_name = "GT"
)
head(snpdata$details)
Use the filter_snps_samples()
function to remove the
SNPs and samples that do not satisfy the user criteria. In the example
below, we intend to discard SNPs and samples where the number of missing
genotypes represent 20% or more the total number of observations. SNPs
with MAF < 1% or a quality score < 10 will also be removed.
snpdata <- filter(snpdata,
min_qual = 10,
max_missing_sites = 0.2,
max_missing_samples = 0.2,
maf_cutoff = 0.01)
All loci and samples that do not satisfy the defined conditions will
be dropped from the input SNPdata
object. Details about the
filtered object can be viewed using:
print(snpdata)
The distribution of the MAF in the filtered SNPdata
object looks like below:
Mixed genotypes are not accounted for in many population genetics
methods as they introduce a extra layer of complexity. Genotype phasing
is the most common way of dealing with mixed genotype. In this package,
we use the phase_mixed_genotypes()
to phase the mixed
genotypes based on their allelic depth (AD) and the total allele count
on the specific locus through the following procedure:
The process above is repeated nsim
times and a new
genotype file is generated and stored in the output directory. After all
the iterations, the correlation between the MAF before phasing and the
MAF from each phased genotype data is calculated. The phased genotype
with the highest correlation will be retained and stored in the output
SNPdata
object. We recommend nsim = 50
at
least to maximize the change of getting the highest MAF correlation.
snpdata <- phase_mixed_genotypes(snpdata, nsim = 100)
head(snpdata$Phased[, 1:5])
Some population genetics methods require genotype data with no
missing allele. In {mpbr}, missing genotypes can be imputed using the
impute_missing_genotypes()
use the
impute_missing_genotypes
function to impute the missing
genotypes. The genotype
argument determines which genotype
table should be used when imputing the missing data
(“GT” to impute from raw data,
“Phased” to impute from the phased data).
For every missing allele, we compute the reference and alternate allele frequencies at the given locus and identify the minor allele frequency (MAF). The missing allele will be replaced based on a Bernoulli simulation with a probability that is set at the resulting MAF value.
This process is repeated nsim
times and a new genotype
file is generated and stored in the output directory. After all the
iterations, the correlation between the MAF before imputation and the
MAF from each imputed genotype data is calculated. The imputed genotype
with the highest correlation will be retained and stored in the output
SNPdata
object. We recommend nsim = 50
at
least to maximize the change of getting the highest MAF correlation.
snpdata <- impute_missing_genotypes(snpdata, genotype = "GT", nsim = 100)
head(snpdata$Imputed[, 1:5])
The package also contains functions that can be used to subset the
SNPdata
object. To generate a SNPdata
object
with data from chromosome 7 only (“Pf3D7_07_v3”) from an existing object
of the same class, use the select_chrom()
function. Note
that the meta table in the output object will not have
the percentage_missing_sites, Fws, COI
columns. This is due to the fact that the existing values in these
columns are calculated from the entire dataset and are not specific the
chosen chromosome.
## subset only chromosome 7 data
chrom7_snpdata <- select_chrom(snpdata, chrom = "Pf3D7_07_v3")
print(chrom7_snpdata)
If you want to remove some loci from the current SNPdata
object, use the drop_snps()
function and provide it with a
data frame with 2 columns named as: Chrom and Pos”,
representing the genomic coordinates of the loci to be dropped. Another
alternative will be to provide the chrom
(chromosome name),
start
(the start position of the region to be discarded)
and end
(the ens position of the region to be
discarded).
## drop a set of SNPs based on their genomic coordinates
snp_to_be_dropped <- chrom7_snpdata$details %>% dplyr::select(Chrom, Pos)
idx <- sample(seq_len(nrow(snp_to_be_dropped)), 100,
replace = FALSE)
snp_to_be_dropped <- snp_to_be_dropped[idx, ]
# there is a problem in the way the index of the vcf file is set when
# dropping snps.
# solution: consider replacing index with an appropriated value
reduced_snpdata <- drop_snps(chrom7_snpdata, snp_to_be_dropped)
print(reduced_snpdata)
# drop a region from a chromosome
reduced_snpdata <- drop_snps(chrom7_snpdata,
chrom = "Pf3D7_07_v3",
start = 110031, end = 154906)
print(reduced_snpdata)
When you want to remove some samples from the SNPdata
object, use the drop_samples()
function and provide it with
a vector of sample IDs to drop.
idx <- sample(seq_len(nrow(reduced_snpdata$meta)), 10,
replace = FALSE)
samples_to_be_dropped <- reduced_snpdata$meta$sample[idx]
reduced_snpdata <- drop_samples(reduced_snpdata, samples_to_be_dropped)
print(reduced_snpdata)