Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.
Reading VCF files into R.
The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).
It contains headers
- CHROM
- POS
- ID
- REF
- ALT
- QUAL
- FILTER
- INFO
- FORMAT
- The name of the chromosome.
- The starting position of the variant indicated.
- Identifier
- Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
- Alternate allele
- Quality score out of 100.
- Pass/Fail. Did it pace quality filters.
- Information about the following columns.
- Format of the columns.
The following libraries needed to load and process VCF files.
install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)
To load VCF files. Because of their size, VCF file are typically zipped
vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)
Techniques for processing VCF files include:
chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)
#Extract GenoTypes
vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))
extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")
#Extract the VCF Header Information
extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")
A great way to time how long it takes load genome data is to use system.time. For example:
system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))
FASTQ file format
The FASTQ files contain entire genome sequencing and can be very large.
install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)
These are the files that align sequencing data with referencing genome data and be converted to CSV files.
fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))
BAM Files
BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.
To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))
seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf) -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))
reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read
gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))
hist(reads[[1]]$pos)
readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))
(ga <- readGAlignments(bf)) # allocation of memory issue. If the BAM file is too large.
References
https://gatk.broadinstitute.org/hc/en-us
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/
https://pcingola.github.io/SnpEff/
https://pcingola.github.io/SnpEff/features/
https://github.com/abyzovlab/CNVnator
https://ggplot2.tidyverse.org/reference/geom_histogram.html
#install if necessary | |
source("http://bioconductor.org/biocLite.R") | |
biocLite("Rsamtools") | |
#load library | |
library(Rsamtools) | |
#read in entire BAM file | |
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam") | |
#names of the BAM fields | |
names(bam[[1]]) | |
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar" | |
# [9] "mrnm" "mpos" "isize" "seq" "qual" | |
#distribution of BAM flags | |
table(bam[[1]]$flag) | |
# 0 4 16 | |
#1472261 775200 1652949 | |
#function for collapsing the list of lists into a single list | |
#as per the Rsamtools vignette | |
.unlist <- function (x){ | |
## do.call(c, …) coerces factor to integer, which is undesired | |
x1 <- x[[1L]] | |
if (is.factor(x1)){ | |
structure(unlist(x), class = "factor", levels = levels(x1)) | |
} else { | |
do.call(c, x) | |
} | |
} | |
#store names of BAM fields | |
bam_field <- names(bam[[1]]) | |
#go through each BAM field and unlist | |
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y))) | |
#store as data frame | |
bam_df <- do.call("DataFrame", list) | |
names(bam_df) <- bam_field | |
dim(bam_df) | |
#[1] 3900410 13 |