Building Visualizations with Human Genome Data

Introduction

The human genome is a treasure trove of data that can be used to study diseases, develop medicines and more. However, analyzing the data can be difficult without the right tools and knowledge. In this tutorial we’ll explore how to build visualizations using Python and Jupyter Notebook in order to make sense of genetic information.

Background

The word “gene” was first used in 1905 by Danish botanist Wilhelm Johannsen to describe heritable units of inheritance. DNA, or deoxyribonucleic acid, is the molecule that carries genetic information and is made up of nucleotides. These nucleotides are four different molecules: adenine (A), cytosine (C), guanine (G) and thymine (T). The sequence of these base pairs make up the human genome.

In 1953, James Watson and Francis Crick discovered the double helix structure of DNA when they studied X-ray diffraction images taken from crystalized samples of tobacco mosaic virus which had been wrapped around rods of nickel chromium steel (NiCr).

Introduction

The human genome is a sequence of chemicals called DNA that contains all the information for building and maintaining an organism. The human genome has 3 billion units, or base pairs, of DNA and weighs approximately 150 pounds in total!

A genome can be sequenced by determining the exact order of nucleotides (A, C, T, G) on a strand of mtDNA or nDNA. The Human Genome Project was an international effort to explore our unique genetic makeup used to map out this information in full detail (see below).

Analysis

Once you’ve imported your data, you can start filtering and visualizing it. Filtering means that you pick out the parts of your dataset that are most interesting to look at. For example, if we were analyzing people’s genes to see what makes a person tall or short, we could filter out all the people who aren’t tall (or aren’t short).

Visualizing is how we convert our filtered data into something we can see on screen. This will usually involve making some kind of graph from your dataset—a bar chart showing height measurements for example.

Interpreting is where things get interesting! When interpreting your visualization, think about what it means in terms of real life situations (e.g., why do taller people have shorter lifespans?). If there are any outliers in the data set—people who don’t fit with what’s expected—why might this be?

Discussion

The end result is a data visualization that can help you understand the human genome. The visualization shows a clear relationship between the number of genes in a particular region and the number of people with a certain genotype in that region.

If you need some motivation, just know that the hard work will be well worth it in the end.

If you’re looking for some motivation, just know that the hard work will be well worth it in the end. Not only will you gain valuable skills in data visualization, but you will also be able to:

  • Learn new tools and technologies – Whether it’s mapping software or a programming language like Python, there are many different methods of analyzing genomic data. By gaining experience with these tools and techniques, you’ll be able to explore diverse ways of visualizing your data.
  • Work with a team – The genome is one of humanity’s most valuable resources; it contains so much information about our pasts, presents and futures (and those of our children). Working on genome projects gives us the opportunity to share our findings with others through publications or presentations at conferences such as BioVis 2018!
  • Explore your own dataset – Because every individual has their own set of mutations on each chromosome (i.e., 10 percent heterozygous), there are many ways that we can represent this type of variation using colors rather than numbers alone! For example: If someone has red hair then they might have inherited both copies from Mommy Dearest while Dad was blond haired too; therefore we could represent them by having two colors where one would usually suffice!

Conclusion

The human genome is a fascinating topic. In this article, we’ve explored some of the most common ways that humans use DNA data and how it impacts our lives. But there are plenty more applications for genomics research than we’ve discussed here! We hope you have enjoyed learning about these topics as much as we have enjoyed researching them for this post.

Installing the U.S. Census API in R

The U.S. Census Bureau American Community Survey application program interface (API) is a valuable programming interface for quickly and efficiently adding census data to your R code for analysis. It’s an extremely efficient way of doing data analysis in R without having to pull, format and organize census tables into a CSV or spreadsheet.

In order to use the census API, you will need to apply for a census API key, which takes only a few minutes. You then will need to install the key using the function census_api_key.

census_api_key('<key>', install=TRUE)

Once the key is install, it doesn’t require to be installed again, unless it expires or you apply for new key. Going to the U.S. Census website and searching for the API is how to apply.

The get_acs R function gives you the ability to pull in multiple years and categories of survey information for demographic information including race, household income, marital status, employment, etc. It’s a power feature for analyzing survey and census data in tidy format.

ACS_2010 <- get_acs("state",  year=2010, variables="S1702_C02_001", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2011 <- get_acs("state", variables="S1702_C02_001", year=2011, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2012 <- get_acs("state", variables="S1702_C02_001", year=2012, output="tidy", geometry=TRUE) %>%
  select(-moe)
  
ACS_2013 <- get_acs("state", variables="S1702_C02_001", year=2013, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2014 <- get_acs("state", variables="S1702_C02_001", year=2014, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2015 <- get_acs("state", variables="S1702_C02_001", year=2015, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2016 <- get_acs("state", variables="S1702_C02_001", year=2016, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2017 <- get_acs("state", variables="S1702_C02_001", year=2017, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_S1702_C01_001_Family <- get_acs("county", variables="S1702_C01_001", output="tidy", geometry=TRUE) %>%
  select(-moe)
  
ACS_S1702_C01_013_Household_Work <-  get_acs("county", variables="S1702_C01_013", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_household_work <-  get_acs("county", variables="S1702_C01_013", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACE variable data can be enumerated in a list with with descriptive names for easier classification. All variables are indicated by a document id (for example S1702_c01_018, S1702_C01_019, etc.).


 ACS_Education <- get_acs("county", variables= c(nohighschool = "S1702_C01_018", 
                                                  highschool = "S1702_C01_019",
                                                 somecollege = "S1702_C01_020", 
                                                 collegeplus = "S1702_C01_021"),
                         output="tidy", geometry=TRUE) %>%
  select(-moe)

You can also choose to include margin of error (moe) or not. Visualizations can be in typical R GGplot library mode or included into maps. All ACS information includes IDs that categorizes survey data.

ACS_geo_2010 <- ACS_2010 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))


ACS_geo_2011 <- ACS_2011 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2012 <- ACS_2012 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2013 <- ACS_2013 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2014 <- ACS_2014 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2015 <- ACS_2015 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2016 <- ACS_2016 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))

ACS_geo_2017 <- ACS_2017 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))

Geographic Mapping of ACS Data

png(file="images/ACS_geo_2010.png")
tm_shape(ACS_geo_2010) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2010 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2011.png")
tm_shape(ACS_geo_2011) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2011 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2012.png")
tm_shape(ACS_geo_2012) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2012 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2013.png")
tm_shape(ACS_geo_2013) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2013 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2014.png")
tm_shape(ACS_geo_2014) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2014 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2015.png")
tm_shape(ACS_geo_2015) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2015 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2016.png")
tm_shape(ACS_geo_2016) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2016 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2017.png")
tm_shape(ACS_geo_2017) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2017 Post-Recession", asp=1)
dev.off()

Programming R for Human Genome Variation Data

Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

The following libraries needed to load and process VCF files.

install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)

To load VCF files. Because of their size, VCF file are typically zipped

vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)

Techniques for processing VCF files include:

chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)

#Extract GenoTypes

vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
 

system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))

extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")

 
#Extract the VCF Header Information

extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")

A great way to time how long it takes load genome data is to use system.time. For example:

system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large.

install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)

These are the files that align sequencing data with referencing genome data and be converted to CSV files.

fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))

BAM Files

BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.

To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))

seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf)  -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))

reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read

gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))

hist(reads[[1]]$pos)

readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))

(ga <- readGAlignments(bf)) # allocation of memory issue.  If the BAM file is too large.

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

#install if necessary
source("http://bioconductor.org/biocLite.R&quot;)
biocLite("Rsamtools")
#load library
library(Rsamtools)
#read in entire BAM file
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam")
#names of the BAM fields
names(bam[[1]])
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar"
# [9] "mrnm" "mpos" "isize" "seq" "qual"
#distribution of BAM flags
table(bam[[1]]$flag)
# 0 4 16
#1472261 775200 1652949
#function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function (x){
## do.call(c, …) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)){
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
#store names of BAM fields
bam_field <- names(bam[[1]])
#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))
#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field
dim(bam_df)
#[1] 3900410 13
view raw read_bam.R hosted with ❤ by GitHub

How to Transition from a Database Administrator Job to a Data Science or Data Engineering Job

I’ve been a Database Administrator for over 20 years. Throughout the 1990’s and 2000’s, database administration had become a somewhat lucrative, in-demand job for many people working in Information Technology. Even today, the role of Database Administrators (DBAs) is critical for daily operational goals and maintaining customer applications. Recently, there has been a major shift in what employers are looking for in job candidates for IT positions. Less companies are hosting their own databases; and the need for big data systems in the cloud have created more opportunities for people with skills in cloud architecture, data pipelines architecture and data science tools.

That being said, I feel like this shift has put a lot of DBAs in a precarious position. Being a dedicated DBA is challenging and very time consuming and requires a very broad set of skills. Being a DBA is a full time job in of itself, and database administration does not easily translate to data science or data engineering, so if you want to work towards a job role as a data engineer or data scientist, you probably have to take that initiative on your own and do off-hour work to acquire those skills. Data science is the ability to create meaningful business actions from sometimes messy, uncoordinated data. Data engineering is the ability to take very large volumes of data and make it readily available to business stakeholders regardless of the type of data, where it is stored, or how it is stored. Most DBAs spend their time making sure that the bare metal (local or NAS) storage or provisioned storage of data is consistent, available, and secured with an “engine” that can easily query or perform transactions on this data. All the mechanisms needed to do this quickly, reliably and efficiently with no data loss is the challenge most DBAs face on a daily basis.

This is the very high-level comparison between the fields. But there are some very powerful nuisances that need to be taken into consideration if you want to change roles. For one, being a DBA doesn’t necessarily mean that you understand how to work with data. Data is messy, and one of the strengths of a data scientist is his or her ability to take data and clean it, transform it, removing duplicates, removing anomalies, etc. You then need to have the ability to sample and partition data, create models and score your model. Many data scientists possess knowledge in mathematics and statistics that allow them to perform deep learning or complex machine learning and data analysis tasks.

One common bridge to go from database administration to data science and data engineering is SQL. SQL is a very powerful language for querying data in a relational databases. SQL is also considered one the most popular languages for data science. There are many functions available in SQL to perform data science functionality in databases. SQL is a powerful language this is by far the most popular way to extract data from a database and deliver it to the business.

Most DBAs have had some exposure to SQL, with another group who have had training in programming procedural structured languages like T-SQL, PL/SQL, PL/pgSQL amount others. Therefore, transitioning to languages such as Python and R typically used in data science is less of a journey than starting with little programming experience at all. Both languages have libraries that utilize SQL and database commands.

Along with learning Python and R, learning many of the popular data science and mathematics libraries such as SciKitLearn and NumPy is also helpful. R is a great language to practice data science techniques as well. Look for the many online resources for learning data science. Visit my articles on the data science conferences and data science resources. Take online classes on LinkedIn Learning, Udemy, Datacamp and Coursera which all have starting tracks for data science. A lot of success in moving into a new role involves self-learning. Particularly if you are in a job position that doesn’t have data science work to build skills.

For data engineering, it’s strongly recommended that you start a cloud account in Google Cloud, AWS and Azure. They offer “pay-as-you-go” options and are subscription based services based on the amount of compute time you accumulate. And with the many of the open data sets available free to the public, you can easily build test data pipelines in the cloud on your free time. You can also build pipelines in the cloud to help with you current DBA role. Most companies are transitioning to the cloud and offer their employees cloud access.

Post-graduate education is another path DBAs can take. There are many post-graduate and certificate programs in data science and big data engineering, with many more coming online. And these programs are flexible enough where you can learn outside your normal work hours.

Tools and Methods to Analyze Variant and Genotyping of Human Genome Data 12/1/2021

Recently I’ve been tasked with analyzing Biological and Genomic data. I’ve learned a lot about tools and libraries for R and Python. As part of this analysis, I’m helping scientists analyze genome variations and perform analysis of genotypes. The human genome has 23 chromosomes. That’s about 3 billion base pairs that contain around 30,000 genes. Every base has pair that can be coded with 2 bits. This equates to around 750 megabytes of data. The data that I have been analyzing is several terabytes of genome sequences for around eight humans. Because the data is so massive, there are several high-throughput tools available to perform genotyping and variation discovery that I will cover in a series of articles in the next few months.

One characteristic is that repetitive DNA sequences comprise approximately 50% of the human genome. Genome size 3,100 Mbps (mega-basepairs) per haploid genome. A base pair is two chemical bases bonded to one another forming the “rung” in the DNA. DNA strands look someone like ladder twisted around.

Variations

Variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions. One such variation is Single-nucleotide polymorphism (SNPs). It’s a type of copy number variation (CNV), Variations in DNA are actually a normal part of human genetics and can sometimes be a sign of the body adapting to various changes within the sequences or even adaptation for protecting and adapting.

SNP can be any nucleic acid substitution:

  1. Transition
    1. Interchange of the purine (Adenine/Guanine)
    2. Pyrimidine (Cytosine/Thymine) nucleic acids
  2. Transversion
    1. Interchange of purine and pyrimidine nucleic acid

Since variation discovery is very important in the biological sciences, many tools have been developed to assist in creating medicines and treatments for all type of mutations within cells.

The International HapMap Project was develop a describe of variation patterns in the human genome that finds variations that impact health, responses to drugs and an individual’s environment. Variations include small-scale and large-scale variations.

Copy number variation (CNV). With the number of copies of a particular gene varies from one individual to the next. Following the completion of he Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. They are categorized as long repeats or short repeats.

Insertions and Deletions (InDel) are a type of CNV: Insertion-deletion mutations refer to insertion and/or deletion of nucleotides into genomic DNA and include events less that 1Kb in length.

Other Definitions

Length of the base pairs (bp). One bp corresponds to approximately 3.4 A (340 pm) of length along the strand, and to roughly 618 or 643 daltons for DNA and RNA respectively.

Kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA or RNA.

Data Analysis

Most of analysis is performed in R. Here are some of the analysis done using Genome libraries:

install.packages("vcfR")
library(vcfR)
library(Rsamtools)
library(pasillaBamSubset)
# prepare for transaction data
install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)

library(pasillaBamSubset)
library(Rsamtools)
library(ggrepel)
install.packages("factoextra")
library(factoextra)

The above libraries are standard R libraries for analyzing Genomic data. Later in this document, I will discuss the multiple tools that produce the files necessary for these libraries.

The most advanced libraries can be downloaded from the BiocManager website.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")

Visualization

Function

Patient1A

Using R with Human Genome Variation Data

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large and represents the raw sequencing data.

BAM or CRAM file formats

These are the files that align sequencing data with referencing genome data.

Genomics Tools

Genome data is very large, and contains millions of base pairs for chromosomes. Although this data can be loaded into R, the complexity of looking at individual genes and chromosomes can be very daunting. One tool that makes reading genomic data more visual is Integrative Genomics Viewer or (IGV). IVG is a visualization tool that zooms in to the gene and chromosome level at the base length.

IGV efficiently pulls in BAM file indexes to locate genomic data.

Other tools for Genomics, structural biology and molecular biology is DNASTAR Lasergene Structural Biology Suite and Spartan.

Another tool for visualization is the Variant Effect Predictor (VEP), which determines the effect of variants on genes.

Other tools include SnpEff and SnpSuft. SnpEff provides genetic variant annotation and function effect prediction. It also annotates and predicts the effect of genetic variants on genetic variants on genes and protein.

SnpSift annotates genomic variants using database, filters, and annotated variants. Once you annotated your files using SnpEff, you can use SnpFift to help you filter large genomic datasets in order to find the most significant variants for your experiment. Microsoft Genomics: All SnpEff & SnpSift genomic database are kindly hosted by Microsoft Genomics and Azure

Microsoft Genomics service provides a cloud hosted solution that makes it easy to variant call your genomic samples. The service takes in genomic samples as two paired end read fastq (.fq.gz) files and produces .bam, .bai, or.vcf files, along with the associated log files.

The process uses a BWA / GATK data pipeline where Microsoft has improve the efficiency of both BWA and GATK producing results faster and with less overhead. There is also a secondary analysis .

GATK is the Genome Analysis Toolkit also used for variant discovery using a data pipeline which can be scaled in the Azure or Google cloud. GATK is a framework for Variant Discovery with high-throughput sequencing data.

Another tool is the CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

https://www.microsoft.com/en-us/genomics/

Creating Plan Stability in Microsoft SQL-Server

To ensure optimal performance of the database, an administrator must follow the most important database areas to maintenance including: 

  1. Rebuilding indexes 
  2. Minimizing procedure compilations and recompilations during heavy workloads. 
  3. Check page allocation, contention and aging. 

  Actions such as running the re-indexing script daily and checking for fragmentation are important for optimization health. 

Plan and SQL handles 

Optimization of code to reduce the amount of logical and physical lookups occurring in memory and disk should be the first method of troubleshooting performance issues.  When code has been released into a production environment, it is not always possible to change code to make it faster.  At times, what is needed is to monitor for plan changes and plan execution.  As part of this effort, it’s important to ensure that plans executed on the database maintain a consistent optimal execution plan or that no significant plan execution change occur.  In the Appendix, there are queries that will all an administrator observe plan changes and execution statistics.  Here is an example of how to investigate plan stability in a SQL-Server database.  The table below is a typical query used multiple times during the day with its respective plan handle which is a hash of the query plan.  You can map a particular SQL text to a plan handle using the dynamic management function (DMF) sys.dm_exec_sql_text(sql_handle | plan_handle). 

SQL Statement Plan Handle  
SELECT * FROM Sales.SalesOrderHeader AS h, Sales.Customer AS c, Sales.SalesTerritory AS t WHERE h.CustomerID = c.CustomerID AND c.TerritoryID = t.TerritoryID AND CountryRegionCode = @Country_region 0x0600060066E0F61940015D4D070000008800000000000000   

Table G5-1 

Having multiple plan handles is not necessarily an issue and doesn’t mean the plan is running errantly.  Investigation of the query plan can help determine if the plan has a high cost to the optimizer. 

Steps necessary to determine if query execution is optimized includes 

  1. Looking into the plan cache to determine cache plan reuse is occurring. 
  2. Monitor for multiple compilations and recompilation to make sure it is minimized and preplanned. 
  3. Report any issues with long running code to Itron Support. 

Checking for locking issues such has long PAGELATCHIO latch waits and deadlocks are critical.  Administrators must check for deadlock events either by running traces or dynamic management views. 

In the appendix below, there are examples of looking for specific plan handle performance and the use of plan guides.  Detail of this methods are beyond the scope of this report and should be researched and thoroughly tested before use. 

Plan Maintenance Analysis Queries









SET NOCOUNT ON 
GO 
 
/*  Query checks for memory usage in database o/s  */ 
 
SELECT 
(physical_memory_in_use_kb/1024.0/1024.0) AS Memory_usedby_Sqlserver_GB, 
(locked_page_allocations_kb/1024.0) AS Locked_pages_used_Sqlserver_MB, 
(total_virtual_address_space_kb/1024.0/1024.0) AS Total_VAS_in_GB, 
process_physical_memory_low, 
process_virtual_memory_low 
FROM sys.dm_os_process_memory; 


SELECT object_name, cntr_value 
  FROM sys.dm_os_performance_counters 
  WHERE counter_name IN ('Total Server Memory (KB)', 'Target Server Memory (KB)'); 
 
SELECT physical_memory_kb FROM sys.dm_os_sys_info; 
SELECT physical_memory_in_bytes FROM sys.dm_os_sys_info; 





SELECT top(5) sum(total_physical_reads) tot_prds, sum(max_physical_reads) tot_max_prds, sum(total_logical_reads) tot_lrds, sum(max_logical_reads) tot_max_lrds, 
 sum(total_logical_writes) tot_lwrts, sum(max_logical_writes) tot_max_lwrts 
 FROM   sys.dm_exec_query_stats stat 
GO  
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text] 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
p.plan_handle, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text], p.plan_handle 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 


/* Can use query to check for specific plan handles based on performance */ 
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
p.plan_handle 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (0x050005007E58905A307BC1EB1800000001000000000000000000000000000000000000000000000000000000) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text], p.plan_handle 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 
select * from sys.dm_exec_cached_plan_dependent_objects(0x050005007E58905A307BC1EB1800000001000000000000000000000000000000000000000000000000000000) 
go 
 
sp_create_plan_guide_from handle 
go 
 
 
SELECT TOP(10) total_physical_reads, last_physical_reads, min_physical_reads, max_physical_reads, 
min_physical_reads, max_physical_reads, total_logical_writes, last_logical_writes, 
min_logical_writes, max_logical_writes, total_logical_reads, last_logical_reads, 
min_logical_reads, max_logical_reads,  
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM     sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
ORDER BY total_physical_reads, Total_logical_reads DESC 
GO 
 
SELECT TOP(10) total_physical_reads, max_physical_reads, total_logical_writes, max_logical_writes, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM     sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
WHERE p.objtype <> 'Proc' 
ORDER BY total_physical_reads, Total_logical_reads DESC 
GO 
 
 
SELECT t1.session_id, t1.request_id, t1.task_alloc, 
t1.task_dealloc, t2.statement_start_offset,  
t2.statement_end_offset, REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM (Select session_id, request_id, 
SUM(internal_objects_alloc_page_count) AS task_alloc, 
SUM (internal_objects_dealloc_page_count) AS task_dealloc  
FROM sys.dm_db_task_space_usage  
GROUP BY session_id, request_id) AS t1,  
sys.dm_exec_requests AS t2 
INNER JOIN sys.dm_exec_query_stats stat ON t2.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
WHERE t1.session_id = t2.session_id 
AND (t1.request_id = t2.request_id) 
 
-- I/O query stats based on the sql text 
 
SELECT top (10) (stats.total_logical_reads/stats.execution_count), 
    (stats.total_logical_writes/stats.execution_count), 
    (stats.total_physical_reads/stats.execution_count), 
    stats.execution_count, stats.sql_handle, stats.plan_handle, 
    REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_query_stats stats 
CROSS APPLY sys.dm_exec_sql_text (stats.sql_handle) 
ORDER BY (total_logical_reads + total_logical_writes) DESC 
 
 
 
-- Missed Index information 
 
select TOP(50) * from sys.dm_db_missing_index_group_stats 
GO 
 
select object_name(object_id) object, equality_columns, included_columns, statement from sys.dm_db_missing_index_details 
go 


/* Queries below use an option technique known as plan guides */ 
  
sp_create_plan_guide_from_handle @name = N'plan_guide_name' 
 
-- Create a plan guide for the query by specifying the query plan in the plan cache. 
DECLARE @plan_handle varbinary(64); 
DECLARE @offset int; 
SELECT @plan_handle = plan_handle, @offset = qs.statement_start_offset 
FROM sys.dm_exec_query_stats AS qs 
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS st 
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp 
WHERE text LIKE N'SELECT WorkOrderID, p.Name, OrderQty, DueDate%'; 
 
EXECUTE sp_create_plan_guide_from_handle  
    @name =  N'Guide1', 
    @plan_handle = @plan_handle, 
    @statement_start_offset = @offset; 
GO 
-- Verify that the plan guide is created. 
SELECT * FROM sys.plan_guides 
WHERE scope_batch LIKE N'SELECT WorkOrderID, p.Name, OrderQty, DueDate%'; 
GO 

Create Plan Guides in SQL-Server

sp_create_plan_guide  
@name = N'Guide1', 
@stmt = N'SELECT * FROM Sales.SalesOrderHeader AS h, 
        Sales.Customer AS c, 
        Sales.SalesTerritory AS t 
        WHERE h.CustomerID = c.CustomerID  
            AND c.TerritoryID = t.TerritoryID 
            AND CountryRegionCode = @Country_region', 
@type = N'OBJECT', 
@module_or_batch = N'Sales.GetSalesOrderByCountry', 
@params = NULL, 
@hints = N'OPTION (OPTIMIZE FOR (@Country_region = N''US''))' 
 

Creating Query Plan Stability in Oracle Relational Databases with Cost-based Optimizers

What Query Plan Stability?

Query Plan Stability ensures predictable and appreciable database performance no matter how much data is being stored in the buffer cache or on disk.   It ensures that all queries are performing according to business requirements and processes are running in a timely manner during the all business hours.

The information presented in this document applies to object relational databases with a cost-base optimizer. The principal audience for this documentation are database administrators, database programmers and customer support analyst.

Whenever confronted with a poorly performing application.  The following typically occurs during troubleshooting

  1. Ask stakeholder when and at what frequency statistical collection is performed.  If stakeholder has not computed statistics recently or has never computed statistics, they must compute statistics before proceeding.
  2. Ask customers if there has been a recent change in data. If this has occurred, get as much detail on the changes
    1. When and where the changes made
    2. The quantity of changes.

During execution of processes  plan instability can occur due to the following factors:

  1. Lack of adherence to cardinality and primary and foreign key relationships.
  2. Lack of proper statistics.
  3. Misuse of predicates in query and plan binding.
  4. Flush of “good” plans from the shared pool (plan cache).

One of the best tools for ensuring plan stability in Oracle is the cost-based optimizer explain plan, which helps determine how costly it is to execute a plan based on its cost in CPU.

Query identifies in Oracle (SQL_ID) are indexes with alphanumeric characters (e.g. 3mx8whn1c5jbb). 

When reviewing query execution performance, it’s always best to create a query plan:

SQL>EXPLAIN PLAN FOR 
select XML_LOADID, XML_RECORDTEXT FROM OLAPSYS.XML_LOAD_RECORDS WHERE XML_RECORDID=2;

SQL>  select * from TABLE(DBMS_XPLAN.DISPLAY);

Explain plans show how the query is performing in the cost-based optimizer once parsed. It provides valuable information of where the highest cost are in your query. Query tuning is essential for maintaining performance in a cost-based optimizer.

If plan instability occurs in an Oracle database, there are several options in addition to the list in the preceding sections to correct the issue.   

  1.  Query performs very poorly or performs full table scans on very high cardinal and big table objects.
  2. Make sure that statistical collection script has been executed.  Make certain that statistics are up to date.   See appendix for script to check statistics.
  3. If Statistics are up to date.  Attempt to generate a plan analysis (SQL profile) of the current plan.  Note:  Oracle Diagnostic Pack and Automated Database Diagnostic Manager (ADDM) needs to be installed.  The profile can be generated via the OEM GUI or using scripts.

Command Line SQL Tuning and SQL Profiling.

The following steps allow an administrator to create SQL profile tuning sets as well as export out those profiles from a source database and import it into a new database.  This allows administrator to create and test better plans in test systems and then load them safely into production systems and enable/disable them when needed.

Run query to find currently in memory that has high execution time.

SET HEADING ON
SET PAGESIZE 20000
SET LINESIZE 2000
COL text FOR A50
COL ctext FOR A50


select vs.last_load_time, ao.OWNER, parsing_schema_name, first_load_time , ao.OBJECT_NAME, vs.program_line#, executions exe, vs.sqltype, vs.sql_id, vs.rows_processed rows_processed, concurrency_wait_time, elapsed_time/1000000 elapsed_secs, elapsed_time/1000000/(case when executions = 0 then 1 else executions end) elap_per_exec_secs, vs.sql_fulltext text, vs.sql_text ctext
from v$sql vs, all_objects ao
where vs.PROGRAM_ID = ao.OBJECT_ID and parsing_schema_name not in ('SYS','SYSTEM') and parsing_schema_name in ('<SCHEMA_OWNER>')
and owner not in ('SYS','SYSTEM','DBSNMP','SYSMAN','MDSYS')
order by vs.last_load_time desc , vs.parsing_schema_name, vs.first_load_time desc , program_id, vs.program_line#;

Execute v$sqlarea query to find plan and hash values that identify that query.


select address, hash_value, plan_hash_value from v$sqlarea where sql_id in ('2zd224kurqsr9');

ADDRESS          HASH_VALUE PLAN_HASH_VALUE
---------------- ---------- ---------------
00000003ACC32290 3419553494      1111780545

Prior to doing a purge of a PLAN_HASH from the shared_pool, it’s important to collect information about the bad plan.   You can do this by executing an DBMS_XPLAN outline on the Plans.  To preserve the plan prior to flushing if from the shared_pool, read section on Loading Plan from Cache to Baseline and return here to continue.

SQL> EXPLAIN PLAN FOR select XML_LOADID, XML_RECORDTEXT FROM OLAPSYS.XML_LOAD_RECORDS WHERE XML_RECORDID=2;

Explained.

SQL> select * from TABLE(DBMS_XPLAN.DISPLAY);

If plan exists in memory, but has a bad hash plan number, create tuning task and SQL set for the query.

Execute tuning task command to see if ADDM will generate a better plan (Do not purge from memory if using this step).

var stmt_task VARCHAR2(64);

SET SERVEROUTPUT ON LINESIZE 200 PAGESIZE 20000 LONG 9999

EXEC :stmt_task := DBMS_SQLTUNE.CREATE_TUNING_TASK(sql_id => '2zd224kurqsr9', task_name => 'sga_tuning_task_1');
	
SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('sga_tuning_task_1', 'TEXT', 'TYPICAL', 'FINDINGS') FROM DUAL;

 If no results are returned, then ADDM will not be able to tune.

SQL> SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('sga_tuning_task_1', 'TEXT', 'TYPICAL', 'FINDINGS') FROM DUAL;
ERROR:
ORA-13631: The most recent execution of task sga_tuning_task_1 contains no results.
ORA-06512: at "SYS.PRVT_ADVISOR", line 5739
ORA-06512: at "SYS.DBMS_SQLTUNE", line 1045
ORA-06512: at line 1

If no results are returned skip following query and go to step 6.  Always make sure not to purge SGA if you use these steps.

DECLARE
    
    sqlprofile_name VARCHAR2(30);

BEGIN
	 
    sqlprofile_name := DBMS_SQLTUNE.ACCEPT_SQL_PROFILE ( 
    task_name    => 'sga_tuning_task_1'
,   name         => 'sql_profile_1'
,   force_match  => true 
);
END;
/

Create a SQL Tuning Set from AWR.

Find the begin and end span for the SQL_ID and plan hash that you need from AWR (This is if the plan has been purged from the pool and exists in AWR repository.

SET LINESIZE 2000 PAGESIZE 20000
SET LONG 99999
     select
        SNAP_ID,
        TO_CHAR(begin_interval_time,'DD-MON-YYYY HH24') begin_interval_time,
        TO_CHAR(end_interval_time,'DD-MON-YYYY HH24') end_interval_time,
        SQL_ID,
        round((round((avg(ELAPSED_TIME_DELTA)/1000000),2)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as ELAPSED_TIME_SECS,
        round((round((avg(CPU_TIME_DELTA)/1000000),2)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as CPU_TIME_SECS,
        round((avg(BUFFER_GETS_DELTA)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as GETS_PER_EXEC
     from
        dba_hist_snapshot natural join dba_hist_sqlstat natural join dba_hist_sqltext
     where
      (elapsed_time_delta > 0 or elapsed_time_delta is not null)
      and SQL_ID = 'bbuc9hr5x4gqq'
     group by
        SNAP_ID,
        TO_CHAR(begin_interval_time,'DD-MON-YYYY HH24'),
        TO_CHAR(end_interval_time,'DD-MON-YYYY HH24'),
        SQL_ID
     order by
        snap_id asc
/

Create a SQL set.

Begin 
  DBMS_SQLTUNE.CREATE_SQLSET (sqlset_name => '2zd224kurqsr9_TUNING_SET_1');
END;
/

 Load SQL Tuning Set.

DECLARE
   Cur sys_refcursor;
 BEGIN
    OPEN cur FOR
     SELECT VALUE(P) FROM TABLE(dbms_sqltune.select_workload_repository(begin_snap=>13645, end_snap=>13649, Basic_filter=>'sql_id=''2zd224kurqsr9''', attribute_list => 'ALL' )
) p;
DBMS_SQLTUNE.LOAD_SQLSET(sqlset_name=> '2zd224kurqsr9_TUNING_SET_1', populate_cursor=> cur);
CLOSE cur;
End;
/

 Verify tuning set is correct.

SELECT * FROM TABLE(DBMS_SQLTUNE.SELECT_SQLSET(sqlset_name => '2zd224kurqsr9_TUNING_SET_1'));

 create the sql set staging table (Cannot be created in the SYS or SYSTEM schema).

execute DBMS_SQLTUNE.CREATE_STGTAB_SQLSET(table_name =>'TUNING_SET_1_STGTAB');

Pack the tuning set into the staging table.

execute DBMS_SQLTUNE.PACK_STGTAB_SQLSET(sqlset_name => 'bbuc9hr5x4gqq_TUNING_SET_1',staging_table_name => 'TUNING_SET_1_STGTAB', staging_schema_owner=>'<SCHEMA_OWNER_SOURCE>');

Check the count of the table to make sure it’s been populated.

SQL> select count(*) from <SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB;

  COUNT(*)
----------
         3

Export out the staging table.

[oracle]$ expdp system/******* DUMPFILE=TUNIING_SET_1.DMP DIRECTORY=BACKUPS TABLES='<SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB'

Export: Release 11.2.0.4.0 - Production on Fri Mar 9 04:10:34 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Starting "SYSTEM"."SYS_EXPORT_TABLE_01":  system/******** DUMPFILE=TUNIING_SET_1.DMP DIRECTORY=BACKUPS TABLES=<SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 9.562 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/MARKER
. . exported "<SCHEMA_OWNER_SOURCE>"."TUNING_SET_1_STGTAB"             54.01 KB       3 rows
Master table "SYSTEM"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYSTEM.SYS_EXPORT_TABLE_01 is:
  /data/backups/TUNIING_SET_1.DMP
Job "SYSTEM"."SYS_EXPORT_TABLE_01" successfully completed at Fri Mar 9 04:11:10 2018 elapsed 0 00:00:32


******** impdp staging table into target database  ******

Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Master table "SYSTEM"."SYS_IMPORT_FULL_01" successfully loaded/unloaded
Starting "SYSTEM"."SYS_IMPORT_FULL_01":  system/******** DUMPFILE=TUNIING_SET_1.DMP REMAP_SCHEMA=<schema_owner_source>:<schema_owner> DIRECTORY=BACKUPS
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . imported "<SCHEMA_OWNER>"."TUNING_SET_1_STGTAB"                54.01 KB       3 rows
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/MARKER
Job "SYSTEM"."SYS_IMPORT_FULL_01" successfully completed at Fri Mar 9 04:29:52 2018 elapsed 0 00:00:41

Purge the plan from the shared pool on source database.

exec sys.dbms_shared_pool.purge('00000003ACC32290,3419553494','C');

*** OR ***

set pagesize 20000
select 'exec sys.dbms_shared_pool.purge(''' || address || ',' ||  hash_value   || ''',''C'');' from v$sqlarea where sql_text like '%TASKTEMPLATEPARAMETERVALUE X  where (1=1)%'
/

'EXECSYS.DBMS_SHARED_POOL.PURGE('''||ADDRESS||','||HASH_VALUE||''',''C'');'
--------------------------------------------------------------------------------
exec sys.dbms_shared_pool.purge('00000003ACC32290,3419553494','C');
exec sys.dbms_shared_pool.purge('0000000127D41C68,3957004154','C');
exec sys.dbms_shared_pool.purge('00000001A851BB70,525044174','C');
exec sys.dbms_shared_pool.purge('000000016C757A98,4128744307','C');
exec sys.dbms_shared_pool.purge('000000025115E2A8,4132567071','C');

Installing new plan on target database.

Unpack the SQL set in the target database.


NAME
--------------------------------------------------------------------------------
CREATED             STATEMENT_COUNT
------------------- ---------------
bbuc9hr5x4gqq_TUNING_SET_1
2018-03-09 04:33:59               

Load plan into SQL Plan Management (SPM).

set serveroutput on
declare
my_integer pls_integer;
begin
my_integer := dbms_spm.load_plans_from_sqlset (
sqlset_name => '2zd224kurqsr9_TUNING_SET_1',
sqlset_owner => 'SYS',
fixed => 'YES',
enabled => 'YES'
);
DBMS_OUTPUT.PUT_line(my_integer);
end;
/

Gather stats on objects used by tuning set.

begin
dbms_stats.gather_table_stats( 
ownname=> '<schema_owner>', 
tabname=> 'TASKTEMPLATEPARAMETERVALUE', 
estimate_percent=> null, 
cascade=> TRUE, 
degree=> null, 
no_invalidate=> DBMS_STATS.AUTO_INVALIDATE, 
granularity=> 'AUTO', 
method_opt=> 'FOR ALL COLUMNS SIZE AUTO');
end;

Load known STS for baseline.

DECLARE
my_plans pls_integer;
BEGIN
  my_plans := DBMS_SPM.LOAD_PLANS_FROM_SQLSET(
     sqlset_name => '2zd224kurqsr9_TUNING_SET_1',
     basic_filter => 'plan_hash_value = '1111780545');
END;
/

List out baseline.

SELECT * FROM dba_sql_plan_baselines;

Please make sure that the plan we would like optimizer to use is fixed enabled.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => <attribute_name>, ‘ENABLED’, attribute_value =>’YES’);
END;
/

For the plan that was manually loaded, check to make sure it has been accepted.

select sql_handle, plan_name, enabled, accepted,fixed,origin from dba_sql_plan_baselines where signature=<signature>;

Load old plan(s) from cache into baseline to preserve it.

Query the pool.

Select sql_id, exact_matching_signature, force_matching_signature from v$sql where sql_text like ‘%<sql_text%’;

Create baseline from plan.

SET SERVEROUTPUT ON

declare
my_int pls_integer;
begin
my_int := DBMS_SPM.load_plans_from_cursor_cache(sql_id => '2zd224kurqsr9’, plan_hash_value => 1111780545, fixed => 'NO', enabled => 'NO');
DBMS_OUTPUT.PUT_line(my_int);
end;
/

List baseline to make sure it has been loaded.

SELECT * FROM dba_sql_plan_baselines where signature=<exact matching signature query>;

Make sure to load current plan from cursor cache into baseline and disable with the following command.  Also make certain that the baseline is not accepted.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => <attribute_name>, ‘ENABLED’, attribute_value =>’NO’);
END;

Purge the SQL from cursor Cache, so that the optimizer will pick the new plan from baseline.

select 'DBMS_SHARED_POOL.PURGE ('''||ADDRESS||', '||HASH_VALUE||''', ''C'');' from V$SQLAREA where SQL_ID = '2zd224kurqsr9';

exec DBMS_SHARED_POOL.PURGE ('000000051A0961B0, 1111780545', 'C');

Confirmation and rollback procedures

You can roll back the changes by enabling the previous plans within SQL Profile Manager and running the application to reload planes into the pool.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => ‘ENABLED’, 
attribute_value =>’YES’);
END;
/

Check to make sure old plan is enabled and accepted.

select sql_handle, plan_name, enabled, accepted,fixed,origin from dba_sql_plan_baselines where signature=<signature>;
Derek Moore
Derek Moore

Tableau Visualizations of the “Great Recession” (2007-2009)

Executive Summary

The Great Recession was one of the most turbulent economic periods of the past eighty years.  It was a global economic recession that impacted hundreds of banks, some of them responsible for financing the Gross Domestic Product (GDP) of entire countries. During the recession, many banks closed due to the speculation that flooded into the real estate market with new banking products that gave loans to millions of subprime consumers.  This allowed people to take out loans with low interest rates but very complicated terms that made it much easier to end up in foreclosure proceedings. During this period, home mortgages defaults skyrocketed in many states further depressing the market. This in turned caused many people to lose their life savings as well as their home and jobs. It is now popularly considered the longest period of economic downturn since the Great Depression of the 1930’s.

Since the great recession, many financial regulations and policies were put in place to prevent it from happening in the future.  The most popular of these were the Dodd-Frank Wall Street Reform and Consumer Protection act, also known as part of the Emergency Economic Stabilization Act, passed by Congress that amended many existing regulations to make it much easier to protect consumers by creating new government agents tasked with overseeing aspects of the financial banking system.  When Donald Trump was elected President in 2016, he signed a new law rolling back significant portions of the law.

One of the upsides of the recession is how much articles, books, and analysis done before, during and after the financial crisis.  This is of significant importance, because understanding how the financial markets got to that point in 2008, will help consumers better understand and better prepare for what not to do.  Another added benefit of understanding this period of time, is how many banks gambled with risky financial products, only to lose all the money it had.  Many consumers are now educated enough to know take on these risky loans, and banks scrutinized mortgage applications more closely.  During the recession, Inflation in the market ballooned housing prices until there was an inevitable crash.

The following report details state-by-state the impact of this period on the housing market, consumers, and overall housing values to give the audience a better understanding of quantitative impact monthly and yearly.  I will also show visualizations of housing prices detailing the loss of housing value, as a result causing many borrowers to be “under water” with their home mortgages.  The report will also focus on four states (North Carolina, California, Massachusetts and Florida) and impact on job losses and employment each month during the period of this recession.

Charts and Graphs

One the most crippling results of the Great Recession was on consumers.  Many consumers lost their homes as the downturn rippled throughout the country.  Figure 1 shows an aggregate picture of housing prices between 1996 and 2017 for the four states analyzed.  Notice how there was a steep growth in housing values and then a sudden drop during the period during and shortly after the recession.  Housing values would eventually rise again, but trillions of dollars of equity would be lost, with many communities not recovering.

Figure (1):  Average condo home prices in Florida and California were impacted more heavily by the economic downturn and drop in housing prices compared to North Carolina and Massachusetts.  Source Zillow.com

The impact of housing prices for California and Florida do not really come as a surprise.  Both of these states tend to have housing values greater than states where the cost of living is less.  Also, these states tend to have residents with higher incomes than the general population (retirees and white-collar workers).

When compared to the entire country, several states saw changes in the percentage of income compared to the change in home ownership.  This is an indication of how well some states weathered the economic downturn among their populations.  States such as California, Nevada, and many areas in the southwest saw a slow precipitous drop in home ownership along with a drop in income (with the exception of Utah).  Many states in the south did see drops in home ownership, but on average, the incomes of their populations remained the same.  North Carolina, for instance, saw no net loss of income, but an increase in home ownership.  States such as Nevada saw both large losses in income for their population as well as loss in home ownership in Figure 2.  Nevada was particularly hit hard by the economic recession.

Figure (2):  A heat map of the country comparing changes in income and home ownership.  Source GeoFred. https://geofred.stlouisfed.org/

The nest visualization shows the average unemployment rate.  This increased significantly during and after the great recession.  In figure 3, it shows how this unemployment became more long than the period of the recession itself, showing the economic impact on families beyond the initial wave of economic decline.

Figure (3):  Line graph of unemployment rates. Source U.S. Bureau of Labor Statistics.  https://www.bls.gov/cps/tables.htm

Conclusion

What these graphs show us is how, during the economic downturn, there was not only an impact on the housing market, but also income growth and unemployment.  Showing the scope and depth of the problem.  One of the unfortunate results of the recession was that many banks considered “too big to fail” received government funding to bail them out, and many homeowners and consumers were left to fill the blunt of the economic recession on their own.

Comparing Machine Learning Models in Determining Credit Worthiness for Bank Loans

The R language has a number of machine learning libraries to help determine for both supervised and unsupervised machine learning. This includes such ML techniques such as linear and logistic regression, decision trees, random forest, generalized boosted regression modeling among others. I strongly recommend learning how these models work and how they can be used to predictive analytics.

Part of the Machine Learning process includes the following:

  1. Sample: Create a sample set of data either through random sampling or top tier sampling.  Create a test, training and validation set of data.
  2. Explore: Use exploratory methods on the data.  This includes descriptive statistics, scatter plots, histograms, etc.
  3. Modify:  Clean, prepare, impute or filter data.  Perform cluster analysis, association and segmentation.
  4. Model:  Model the data using Logistic or Linear regression, Neural Networking, and Decision Trees.
  5. Assess:  Access the model by comparing it to other model types and again real data. Determine how close your model is to reality.  Test the data using hypothesis testing.

When creating machine learning models for any application, it is wise to following a process flow such as the following:

In the following example, we use machine learning to determine the credit worthiness of prospective borrowers for a bank loan.

The loan data consist of the following inputs

  1. Loan amount
  2. Interest rate
  3. Grade of credit
  4. Employment length of borrower
  5. Home ownership status
  6. Annual Income
  7. Age of borrower

The response variable or predictor to predict the default rate

  1. Loan status (0 or 1).

After loading the data into R, we partition the data for training or testing sets.

loan <- read.csv("loan.csv", stringsAsFactors = TRUE)

str(loan)

## Split the data into 70% training and 30% test datasets

library(rsample)
set.seed(634)

loan_split <- initial_split(loan, prop = 0.7)

loan_training <- training(loan_split)
loan_test <- testing(loan_split)

Create a over-sample training data based on ROSE library. This checks for over-sampling of the data.

str(loan_training)

table(loan_training$loan_status)

library(ROSE)

loan_training_both <- ovun.sample(loan_status ~ ., data = loan_training, method = "both", p = 0.5)$data

table(loan_training_both$loan_status)

Build a logistic regression model and a classification tree to predict loan default.

loan_logistic <- glm(loan_status ~ . , data = loan_training_both, family = "binomial")

library(rpart)

loan_ctree <- rpart(loan_status ~ . , data = loan_training_both, method = "class")

library(rpart.plot)

rpart.plot(loan_ctree, cex=1)

Build the ensemble models (random forest, gradient boosting) to predict loan default.

library(randomForest)

loan_rf <- randomForest(as.factor(loan_status) ~ ., data = loan_training_both, ntree = 200, importance=TRUE)

plot(loan_rf)

varImpPlot(loan_rf)

library(gbm)

Summarize gradient boosting model

loan_gbm <- gbm(loan_status ~ ., data = loan_training_both, n.trees = 200, distribution = "bernoulli")
summary(loan_gbm)

Use the ROC (receiver operating curve) and compute the AUC (area under the curve) to check the specificity and sensitivity of the models.

# Step 1. Predicting on test data

predicted_logistic <- loan_logistic %>% 
  predict(newdata = loan_test, type = "response")

predicted_ctree <- loan_ctree %>% 
  predict(newdata = loan_test, type = "prob")

predicted_rf <- loan_rf %>% 
  predict(newdata = loan_test, type = "prob")

predicted_gbm <- loan_gbm %>% 
  predict(newdata = loan_test, type = "response")

# Step 3. Create ROC and Compute AUC

library(cutpointr)

roc_logistic <- roc(loan_test, x= .fitted_logistic, class = loan_status, pos_class = 1 , neg_class = 0)
roc_ctree<- roc(loan_test, x= .fitted_ctree, class = loan_status, pos_class = 1 , neg_class = 0)
roc_rf<- roc(loan_test, x= .fitted_rf, class = loan_status, pos_class = 1 , neg_class = 0)
roc_gbm<- roc(loan_test, x= .fitted_gbm, class = loan_status, pos_class = 1 , neg_class = 0)

plot(roc_logistic) + 
  geom_line(data = roc_logistic, color = "red") + 
  geom_line(data = roc_ctree, color = "blue") + 
  geom_line(data = roc_rf, color = "green") + 
  geom_line(data = roc_gbm, color = "black")

auc(roc_logistic)
auc(roc_ctree)
auc(roc_rf)
auc(roc_gbm)

These help you compare and score which model works best for the type of data presented in the test set. When looking at the ROC chart, you can see that the gradient boost model has the best performance of all the model as it is closer to 1.00 than the other models. Classifiers that are closer to 1.00 for the top left where Sensitivity is 1.00 and Specificity is closer to 0.00 have the best performance.