Building Visualizations with Human Genome Data

Introduction

The human genome is a treasure trove of data that can be used to study diseases, develop medicines and more. However, analyzing the data can be difficult without the right tools and knowledge. In this tutorial we’ll explore how to build visualizations using Python and Jupyter Notebook in order to make sense of genetic information.

Background

The word “gene” was first used in 1905 by Danish botanist Wilhelm Johannsen to describe heritable units of inheritance. DNA, or deoxyribonucleic acid, is the molecule that carries genetic information and is made up of nucleotides. These nucleotides are four different molecules: adenine (A), cytosine (C), guanine (G) and thymine (T). The sequence of these base pairs make up the human genome.

In 1953, James Watson and Francis Crick discovered the double helix structure of DNA when they studied X-ray diffraction images taken from crystalized samples of tobacco mosaic virus which had been wrapped around rods of nickel chromium steel (NiCr).

Introduction

The human genome is a sequence of chemicals called DNA that contains all the information for building and maintaining an organism. The human genome has 3 billion units, or base pairs, of DNA and weighs approximately 150 pounds in total!

A genome can be sequenced by determining the exact order of nucleotides (A, C, T, G) on a strand of mtDNA or nDNA. The Human Genome Project was an international effort to explore our unique genetic makeup used to map out this information in full detail (see below).

Analysis

Once you’ve imported your data, you can start filtering and visualizing it. Filtering means that you pick out the parts of your dataset that are most interesting to look at. For example, if we were analyzing people’s genes to see what makes a person tall or short, we could filter out all the people who aren’t tall (or aren’t short).

Visualizing is how we convert our filtered data into something we can see on screen. This will usually involve making some kind of graph from your dataset—a bar chart showing height measurements for example.

Interpreting is where things get interesting! When interpreting your visualization, think about what it means in terms of real life situations (e.g., why do taller people have shorter lifespans?). If there are any outliers in the data set—people who don’t fit with what’s expected—why might this be?

Discussion

The end result is a data visualization that can help you understand the human genome. The visualization shows a clear relationship between the number of genes in a particular region and the number of people with a certain genotype in that region.

If you need some motivation, just know that the hard work will be well worth it in the end.

If you’re looking for some motivation, just know that the hard work will be well worth it in the end. Not only will you gain valuable skills in data visualization, but you will also be able to:

  • Learn new tools and technologies – Whether it’s mapping software or a programming language like Python, there are many different methods of analyzing genomic data. By gaining experience with these tools and techniques, you’ll be able to explore diverse ways of visualizing your data.
  • Work with a team – The genome is one of humanity’s most valuable resources; it contains so much information about our pasts, presents and futures (and those of our children). Working on genome projects gives us the opportunity to share our findings with others through publications or presentations at conferences such as BioVis 2018!
  • Explore your own dataset – Because every individual has their own set of mutations on each chromosome (i.e., 10 percent heterozygous), there are many ways that we can represent this type of variation using colors rather than numbers alone! For example: If someone has red hair then they might have inherited both copies from Mommy Dearest while Dad was blond haired too; therefore we could represent them by having two colors where one would usually suffice!

Conclusion

The human genome is a fascinating topic. In this article, we’ve explored some of the most common ways that humans use DNA data and how it impacts our lives. But there are plenty more applications for genomics research than we’ve discussed here! We hope you have enjoyed learning about these topics as much as we have enjoyed researching them for this post.

Programming R for Human Genome Variation Data

Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

The following libraries needed to load and process VCF files.

install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)

To load VCF files. Because of their size, VCF file are typically zipped

vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)

Techniques for processing VCF files include:

chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)

#Extract GenoTypes

vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
 

system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))

extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")

 
#Extract the VCF Header Information

extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")

A great way to time how long it takes load genome data is to use system.time. For example:

system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large.

install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)

These are the files that align sequencing data with referencing genome data and be converted to CSV files.

fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))

BAM Files

BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.

To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))

seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf)  -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))

reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read

gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))

hist(reads[[1]]$pos)

readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))

(ga <- readGAlignments(bf)) # allocation of memory issue.  If the BAM file is too large.

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
#load library
library(Rsamtools)
#read in entire BAM file
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam")
#names of the BAM fields
names(bam[[1]])
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar"
# [9] "mrnm" "mpos" "isize" "seq" "qual"
#distribution of BAM flags
table(bam[[1]]$flag)
# 0 4 16
#1472261 775200 1652949
#function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function (x){
## do.call(c, …) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)){
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
#store names of BAM fields
bam_field <- names(bam[[1]])
#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))
#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field
dim(bam_df)
#[1] 3900410 13
view raw read_bam.R hosted with ❤ by GitHub

How to Transition from a Database Administrator Job to a Data Science or Data Engineering Job

I’ve been a Database Administrator for over 20 years. Throughout the 1990’s and 2000’s, database administration had become a somewhat lucrative, in-demand job for many people working in Information Technology. Even today, the role of Database Administrators (DBAs) is critical for daily operational goals and maintaining customer applications. Recently, there has been a major shift in what employers are looking for in job candidates for IT positions. Less companies are hosting their own databases; and the need for big data systems in the cloud have created more opportunities for people with skills in cloud architecture, data pipelines architecture and data science tools.

That being said, I feel like this shift has put a lot of DBAs in a precarious position. Being a dedicated DBA is challenging and very time consuming and requires a very broad set of skills. Being a DBA is a full time job in of itself, and database administration does not easily translate to data science or data engineering, so if you want to work towards a job role as a data engineer or data scientist, you probably have to take that initiative on your own and do off-hour work to acquire those skills. Data science is the ability to create meaningful business actions from sometimes messy, uncoordinated data. Data engineering is the ability to take very large volumes of data and make it readily available to business stakeholders regardless of the type of data, where it is stored, or how it is stored. Most DBAs spend their time making sure that the bare metal (local or NAS) storage or provisioned storage of data is consistent, available, and secured with an “engine” that can easily query or perform transactions on this data. All the mechanisms needed to do this quickly, reliably and efficiently with no data loss is the challenge most DBAs face on a daily basis.

This is the very high-level comparison between the fields. But there are some very powerful nuisances that need to be taken into consideration if you want to change roles. For one, being a DBA doesn’t necessarily mean that you understand how to work with data. Data is messy, and one of the strengths of a data scientist is his or her ability to take data and clean it, transform it, removing duplicates, removing anomalies, etc. You then need to have the ability to sample and partition data, create models and score your model. Many data scientists possess knowledge in mathematics and statistics that allow them to perform deep learning or complex machine learning and data analysis tasks.

One common bridge to go from database administration to data science and data engineering is SQL. SQL is a very powerful language for querying data in a relational databases. SQL is also considered one the most popular languages for data science. There are many functions available in SQL to perform data science functionality in databases. SQL is a powerful language this is by far the most popular way to extract data from a database and deliver it to the business.

Most DBAs have had some exposure to SQL, with another group who have had training in programming procedural structured languages like T-SQL, PL/SQL, PL/pgSQL amount others. Therefore, transitioning to languages such as Python and R typically used in data science is less of a journey than starting with little programming experience at all. Both languages have libraries that utilize SQL and database commands.

Along with learning Python and R, learning many of the popular data science and mathematics libraries such as SciKitLearn and NumPy is also helpful. R is a great language to practice data science techniques as well. Look for the many online resources for learning data science. Visit my articles on the data science conferences and data science resources. Take online classes on LinkedIn Learning, Udemy, Datacamp and Coursera which all have starting tracks for data science. A lot of success in moving into a new role involves self-learning. Particularly if you are in a job position that doesn’t have data science work to build skills.

For data engineering, it’s strongly recommended that you start a cloud account in Google Cloud, AWS and Azure. They offer “pay-as-you-go” options and are subscription based services based on the amount of compute time you accumulate. And with the many of the open data sets available free to the public, you can easily build test data pipelines in the cloud on your free time. You can also build pipelines in the cloud to help with you current DBA role. Most companies are transitioning to the cloud and offer their employees cloud access.

Post-graduate education is another path DBAs can take. There are many post-graduate and certificate programs in data science and big data engineering, with many more coming online. And these programs are flexible enough where you can learn outside your normal work hours.

Tools and Methods to Analyze Variant and Genotyping of Human Genome Data 12/1/2021

Recently I’ve been tasked with analyzing Biological and Genomic data. I’ve learned a lot about tools and libraries for R and Python. As part of this analysis, I’m helping scientists analyze genome variations and perform analysis of genotypes. The human genome has 23 chromosomes. That’s about 3 billion base pairs that contain around 30,000 genes. Every base has pair that can be coded with 2 bits. This equates to around 750 megabytes of data. The data that I have been analyzing is several terabytes of genome sequences for around eight humans. Because the data is so massive, there are several high-throughput tools available to perform genotyping and variation discovery that I will cover in a series of articles in the next few months.

One characteristic is that repetitive DNA sequences comprise approximately 50% of the human genome. Genome size 3,100 Mbps (mega-basepairs) per haploid genome. A base pair is two chemical bases bonded to one another forming the “rung” in the DNA. DNA strands look someone like ladder twisted around.

Variations

Variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions. One such variation is Single-nucleotide polymorphism (SNPs). It’s a type of copy number variation (CNV), Variations in DNA are actually a normal part of human genetics and can sometimes be a sign of the body adapting to various changes within the sequences or even adaptation for protecting and adapting.

SNP can be any nucleic acid substitution:

  1. Transition
    1. Interchange of the purine (Adenine/Guanine)
    2. Pyrimidine (Cytosine/Thymine) nucleic acids
  2. Transversion
    1. Interchange of purine and pyrimidine nucleic acid

Since variation discovery is very important in the biological sciences, many tools have been developed to assist in creating medicines and treatments for all type of mutations within cells.

The International HapMap Project was develop a describe of variation patterns in the human genome that finds variations that impact health, responses to drugs and an individual’s environment. Variations include small-scale and large-scale variations.

Copy number variation (CNV). With the number of copies of a particular gene varies from one individual to the next. Following the completion of he Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. They are categorized as long repeats or short repeats.

Insertions and Deletions (InDel) are a type of CNV: Insertion-deletion mutations refer to insertion and/or deletion of nucleotides into genomic DNA and include events less that 1Kb in length.

Other Definitions

Length of the base pairs (bp). One bp corresponds to approximately 3.4 A (340 pm) of length along the strand, and to roughly 618 or 643 daltons for DNA and RNA respectively.

Kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA or RNA.

Data Analysis

Most of analysis is performed in R. Here are some of the analysis done using Genome libraries:

install.packages("vcfR")
library(vcfR)
library(Rsamtools)
library(pasillaBamSubset)
# prepare for transaction data
install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)

library(pasillaBamSubset)
library(Rsamtools)
library(ggrepel)
install.packages("factoextra")
library(factoextra)

The above libraries are standard R libraries for analyzing Genomic data. Later in this document, I will discuss the multiple tools that produce the files necessary for these libraries.

The most advanced libraries can be downloaded from the BiocManager website.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")

Visualization

Function

Patient1A

Using R with Human Genome Variation Data

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large and represents the raw sequencing data.

BAM or CRAM file formats

These are the files that align sequencing data with referencing genome data.

Genomics Tools

Genome data is very large, and contains millions of base pairs for chromosomes. Although this data can be loaded into R, the complexity of looking at individual genes and chromosomes can be very daunting. One tool that makes reading genomic data more visual is Integrative Genomics Viewer or (IGV). IVG is a visualization tool that zooms in to the gene and chromosome level at the base length.

IGV efficiently pulls in BAM file indexes to locate genomic data.

Other tools for Genomics, structural biology and molecular biology is DNASTAR Lasergene Structural Biology Suite and Spartan.

Another tool for visualization is the Variant Effect Predictor (VEP), which determines the effect of variants on genes.

Other tools include SnpEff and SnpSuft. SnpEff provides genetic variant annotation and function effect prediction. It also annotates and predicts the effect of genetic variants on genetic variants on genes and protein.

SnpSift annotates genomic variants using database, filters, and annotated variants. Once you annotated your files using SnpEff, you can use SnpFift to help you filter large genomic datasets in order to find the most significant variants for your experiment. Microsoft Genomics: All SnpEff & SnpSift genomic database are kindly hosted by Microsoft Genomics and Azure

Microsoft Genomics service provides a cloud hosted solution that makes it easy to variant call your genomic samples. The service takes in genomic samples as two paired end read fastq (.fq.gz) files and produces .bam, .bai, or.vcf files, along with the associated log files.

The process uses a BWA / GATK data pipeline where Microsoft has improve the efficiency of both BWA and GATK producing results faster and with less overhead. There is also a secondary analysis .

GATK is the Genome Analysis Toolkit also used for variant discovery using a data pipeline which can be scaled in the Azure or Google cloud. GATK is a framework for Variant Discovery with high-throughput sequencing data.

Another tool is the CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

https://www.microsoft.com/en-us/genomics/