Programming R for Human Genome Variation Data

Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

The following libraries needed to load and process VCF files.

install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)

To load VCF files. Because of their size, VCF file are typically zipped

vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)

Techniques for processing VCF files include:

chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)

#Extract GenoTypes

vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
 

system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))

extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")

 
#Extract the VCF Header Information

extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")

A great way to time how long it takes load genome data is to use system.time. For example:

system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large.

install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)

These are the files that align sequencing data with referencing genome data and be converted to CSV files.

fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))

BAM Files

BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.

To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))

seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf)  -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))

reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read

gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))

hist(reads[[1]]$pos)

readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))

(ga <- readGAlignments(bf)) # allocation of memory issue.  If the BAM file is too large.

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
#load library
library(Rsamtools)
#read in entire BAM file
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam")
#names of the BAM fields
names(bam[[1]])
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar"
# [9] "mrnm" "mpos" "isize" "seq" "qual"
#distribution of BAM flags
table(bam[[1]]$flag)
# 0 4 16
#1472261 775200 1652949
#function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function (x){
## do.call(c, …) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)){
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
#store names of BAM fields
bam_field <- names(bam[[1]])
#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))
#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field
dim(bam_df)
#[1] 3900410 13
view raw read_bam.R hosted with ❤ by GitHub

Leave a Reply