Programming R for Human Genome Variation Data

Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

The following libraries needed to load and process VCF files.

install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)

To load VCF files. Because of their size, VCF file are typically zipped

vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)

Techniques for processing VCF files include:

chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)

#Extract GenoTypes

vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
 

system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))

extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")

 
#Extract the VCF Header Information

extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")

A great way to time how long it takes load genome data is to use system.time. For example:

system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large.

install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)

These are the files that align sequencing data with referencing genome data and be converted to CSV files.

fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))

BAM Files

BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.

To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))

seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf)  -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))

reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read

gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))

hist(reads[[1]]$pos)

readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))

(ga <- readGAlignments(bf)) # allocation of memory issue.  If the BAM file is too large.

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
#load library
library(Rsamtools)
#read in entire BAM file
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam")
#names of the BAM fields
names(bam[[1]])
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar"
# [9] "mrnm" "mpos" "isize" "seq" "qual"
#distribution of BAM flags
table(bam[[1]]$flag)
# 0 4 16
#1472261 775200 1652949
#function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function (x){
## do.call(c, …) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)){
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
#store names of BAM fields
bam_field <- names(bam[[1]])
#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))
#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field
dim(bam_df)
#[1] 3900410 13
view raw read_bam.R hosted with ❤ by GitHub

How to Transition from a Database Administrator Job to a Data Science or Data Engineering Job

I’ve been a Database Administrator for over 20 years. Throughout the 1990’s and 2000’s, database administration had become a somewhat lucrative, in-demand job for many people working in Information Technology. Even today, the role of Database Administrators (DBAs) is critical for daily operational goals and maintaining customer applications. Recently, there has been a major shift in what employers are looking for in job candidates for IT positions. Less companies are hosting their own databases; and the need for big data systems in the cloud have created more opportunities for people with skills in cloud architecture, data pipelines architecture and data science tools.

That being said, I feel like this shift has put a lot of DBAs in a precarious position. Being a dedicated DBA is challenging and very time consuming and requires a very broad set of skills. Being a DBA is a full time job in of itself, and database administration does not easily translate to data science or data engineering, so if you want to work towards a job role as a data engineer or data scientist, you probably have to take that initiative on your own and do off-hour work to acquire those skills. Data science is the ability to create meaningful business actions from sometimes messy, uncoordinated data. Data engineering is the ability to take very large volumes of data and make it readily available to business stakeholders regardless of the type of data, where it is stored, or how it is stored. Most DBAs spend their time making sure that the bare metal (local or NAS) storage or provisioned storage of data is consistent, available, and secured with an “engine” that can easily query or perform transactions on this data. All the mechanisms needed to do this quickly, reliably and efficiently with no data loss is the challenge most DBAs face on a daily basis.

This is the very high-level comparison between the fields. But there are some very powerful nuisances that need to be taken into consideration if you want to change roles. For one, being a DBA doesn’t necessarily mean that you understand how to work with data. Data is messy, and one of the strengths of a data scientist is his or her ability to take data and clean it, transform it, removing duplicates, removing anomalies, etc. You then need to have the ability to sample and partition data, create models and score your model. Many data scientists possess knowledge in mathematics and statistics that allow them to perform deep learning or complex machine learning and data analysis tasks.

One common bridge to go from database administration to data science and data engineering is SQL. SQL is a very powerful language for querying data in a relational databases. SQL is also considered one the most popular languages for data science. There are many functions available in SQL to perform data science functionality in databases. SQL is a powerful language this is by far the most popular way to extract data from a database and deliver it to the business.

Most DBAs have had some exposure to SQL, with another group who have had training in programming procedural structured languages like T-SQL, PL/SQL, PL/pgSQL amount others. Therefore, transitioning to languages such as Python and R typically used in data science is less of a journey than starting with little programming experience at all. Both languages have libraries that utilize SQL and database commands.

Along with learning Python and R, learning many of the popular data science and mathematics libraries such as SciKitLearn and NumPy is also helpful. R is a great language to practice data science techniques as well. Look for the many online resources for learning data science. Visit my articles on the data science conferences and data science resources. Take online classes on LinkedIn Learning, Udemy, Datacamp and Coursera which all have starting tracks for data science. A lot of success in moving into a new role involves self-learning. Particularly if you are in a job position that doesn’t have data science work to build skills.

For data engineering, it’s strongly recommended that you start a cloud account in Google Cloud, AWS and Azure. They offer “pay-as-you-go” options and are subscription based services based on the amount of compute time you accumulate. And with the many of the open data sets available free to the public, you can easily build test data pipelines in the cloud on your free time. You can also build pipelines in the cloud to help with you current DBA role. Most companies are transitioning to the cloud and offer their employees cloud access.

Post-graduate education is another path DBAs can take. There are many post-graduate and certificate programs in data science and big data engineering, with many more coming online. And these programs are flexible enough where you can learn outside your normal work hours.

Tools and Methods to Analyze Variant and Genotyping of Human Genome Data 12/1/2021

Recently I’ve been tasked with analyzing Biological and Genomic data. I’ve learned a lot about tools and libraries for R and Python. As part of this analysis, I’m helping scientists analyze genome variations and perform analysis of genotypes. The human genome has 23 chromosomes. That’s about 3 billion base pairs that contain around 30,000 genes. Every base has pair that can be coded with 2 bits. This equates to around 750 megabytes of data. The data that I have been analyzing is several terabytes of genome sequences for around eight humans. Because the data is so massive, there are several high-throughput tools available to perform genotyping and variation discovery that I will cover in a series of articles in the next few months.

One characteristic is that repetitive DNA sequences comprise approximately 50% of the human genome. Genome size 3,100 Mbps (mega-basepairs) per haploid genome. A base pair is two chemical bases bonded to one another forming the “rung” in the DNA. DNA strands look someone like ladder twisted around.

Variations

Variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions. One such variation is Single-nucleotide polymorphism (SNPs). It’s a type of copy number variation (CNV), Variations in DNA are actually a normal part of human genetics and can sometimes be a sign of the body adapting to various changes within the sequences or even adaptation for protecting and adapting.

SNP can be any nucleic acid substitution:

  1. Transition
    1. Interchange of the purine (Adenine/Guanine)
    2. Pyrimidine (Cytosine/Thymine) nucleic acids
  2. Transversion
    1. Interchange of purine and pyrimidine nucleic acid

Since variation discovery is very important in the biological sciences, many tools have been developed to assist in creating medicines and treatments for all type of mutations within cells.

The International HapMap Project was develop a describe of variation patterns in the human genome that finds variations that impact health, responses to drugs and an individual’s environment. Variations include small-scale and large-scale variations.

Copy number variation (CNV). With the number of copies of a particular gene varies from one individual to the next. Following the completion of he Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. They are categorized as long repeats or short repeats.

Insertions and Deletions (InDel) are a type of CNV: Insertion-deletion mutations refer to insertion and/or deletion of nucleotides into genomic DNA and include events less that 1Kb in length.

Other Definitions

Length of the base pairs (bp). One bp corresponds to approximately 3.4 A (340 pm) of length along the strand, and to roughly 618 or 643 daltons for DNA and RNA respectively.

Kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA or RNA.

Data Analysis

Most of analysis is performed in R. Here are some of the analysis done using Genome libraries:

install.packages("vcfR")
library(vcfR)
library(Rsamtools)
library(pasillaBamSubset)
# prepare for transaction data
install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)

library(pasillaBamSubset)
library(Rsamtools)
library(ggrepel)
install.packages("factoextra")
library(factoextra)

The above libraries are standard R libraries for analyzing Genomic data. Later in this document, I will discuss the multiple tools that produce the files necessary for these libraries.

The most advanced libraries can be downloaded from the BiocManager website.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")

Visualization

Function

Patient1A

Using R with Human Genome Variation Data

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large and represents the raw sequencing data.

BAM or CRAM file formats

These are the files that align sequencing data with referencing genome data.

Genomics Tools

Genome data is very large, and contains millions of base pairs for chromosomes. Although this data can be loaded into R, the complexity of looking at individual genes and chromosomes can be very daunting. One tool that makes reading genomic data more visual is Integrative Genomics Viewer or (IGV). IVG is a visualization tool that zooms in to the gene and chromosome level at the base length.

IGV efficiently pulls in BAM file indexes to locate genomic data.

Other tools for Genomics, structural biology and molecular biology is DNASTAR Lasergene Structural Biology Suite and Spartan.

Another tool for visualization is the Variant Effect Predictor (VEP), which determines the effect of variants on genes.

Other tools include SnpEff and SnpSuft. SnpEff provides genetic variant annotation and function effect prediction. It also annotates and predicts the effect of genetic variants on genetic variants on genes and protein.

SnpSift annotates genomic variants using database, filters, and annotated variants. Once you annotated your files using SnpEff, you can use SnpFift to help you filter large genomic datasets in order to find the most significant variants for your experiment. Microsoft Genomics: All SnpEff & SnpSift genomic database are kindly hosted by Microsoft Genomics and Azure

Microsoft Genomics service provides a cloud hosted solution that makes it easy to variant call your genomic samples. The service takes in genomic samples as two paired end read fastq (.fq.gz) files and produces .bam, .bai, or.vcf files, along with the associated log files.

The process uses a BWA / GATK data pipeline where Microsoft has improve the efficiency of both BWA and GATK producing results faster and with less overhead. There is also a secondary analysis .

GATK is the Genome Analysis Toolkit also used for variant discovery using a data pipeline which can be scaled in the Azure or Google cloud. GATK is a framework for Variant Discovery with high-throughput sequencing data.

Another tool is the CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

https://www.microsoft.com/en-us/genomics/

Creating Plan Stability in Microsoft SQL-Server

To ensure optimal performance of the database, an administrator must follow the most important database areas to maintenance including: 

  1. Rebuilding indexes 
  2. Minimizing procedure compilations and recompilations during heavy workloads. 
  3. Check page allocation, contention and aging. 

  Actions such as running the re-indexing script daily and checking for fragmentation are important for optimization health. 

Plan and SQL handles 

Optimization of code to reduce the amount of logical and physical lookups occurring in memory and disk should be the first method of troubleshooting performance issues.  When code has been released into a production environment, it is not always possible to change code to make it faster.  At times, what is needed is to monitor for plan changes and plan execution.  As part of this effort, it’s important to ensure that plans executed on the database maintain a consistent optimal execution plan or that no significant plan execution change occur.  In the Appendix, there are queries that will all an administrator observe plan changes and execution statistics.  Here is an example of how to investigate plan stability in a SQL-Server database.  The table below is a typical query used multiple times during the day with its respective plan handle which is a hash of the query plan.  You can map a particular SQL text to a plan handle using the dynamic management function (DMF) sys.dm_exec_sql_text(sql_handle | plan_handle). 

SQL Statement Plan Handle  
SELECT * FROM Sales.SalesOrderHeader AS h, Sales.Customer AS c, Sales.SalesTerritory AS t WHERE h.CustomerID = c.CustomerID AND c.TerritoryID = t.TerritoryID AND CountryRegionCode = @Country_region 0x0600060066E0F61940015D4D070000008800000000000000   

Table G5-1 

Having multiple plan handles is not necessarily an issue and doesn’t mean the plan is running errantly.  Investigation of the query plan can help determine if the plan has a high cost to the optimizer. 

Steps necessary to determine if query execution is optimized includes 

  1. Looking into the plan cache to determine cache plan reuse is occurring. 
  2. Monitor for multiple compilations and recompilation to make sure it is minimized and preplanned. 
  3. Report any issues with long running code to Itron Support. 

Checking for locking issues such has long PAGELATCHIO latch waits and deadlocks are critical.  Administrators must check for deadlock events either by running traces or dynamic management views. 

In the appendix below, there are examples of looking for specific plan handle performance and the use of plan guides.  Detail of this methods are beyond the scope of this report and should be researched and thoroughly tested before use. 

Plan Maintenance Analysis Queries









SET NOCOUNT ON 
GO 
 
/*  Query checks for memory usage in database o/s  */ 
 
SELECT 
(physical_memory_in_use_kb/1024.0/1024.0) AS Memory_usedby_Sqlserver_GB, 
(locked_page_allocations_kb/1024.0) AS Locked_pages_used_Sqlserver_MB, 
(total_virtual_address_space_kb/1024.0/1024.0) AS Total_VAS_in_GB, 
process_physical_memory_low, 
process_virtual_memory_low 
FROM sys.dm_os_process_memory; 


SELECT object_name, cntr_value 
  FROM sys.dm_os_performance_counters 
  WHERE counter_name IN ('Total Server Memory (KB)', 'Target Server Memory (KB)'); 
 
SELECT physical_memory_kb FROM sys.dm_os_sys_info; 
SELECT physical_memory_in_bytes FROM sys.dm_os_sys_info; 





SELECT top(5) sum(total_physical_reads) tot_prds, sum(max_physical_reads) tot_max_prds, sum(total_logical_reads) tot_lrds, sum(max_logical_reads) tot_max_lrds, 
 sum(total_logical_writes) tot_lwrts, sum(max_logical_writes) tot_max_lwrts 
 FROM   sys.dm_exec_query_stats stat 
GO  
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text] 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
p.plan_handle, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text], p.plan_handle 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 


/* Can use query to check for specific plan handles based on performance */ 
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
p.plan_handle 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (0x050005007E58905A307BC1EB1800000001000000000000000000000000000000000000000000000000000000) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text], p.plan_handle 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 
select * from sys.dm_exec_cached_plan_dependent_objects(0x050005007E58905A307BC1EB1800000001000000000000000000000000000000000000000000000000000000) 
go 
 
sp_create_plan_guide_from handle 
go 
 
 
SELECT TOP(10) total_physical_reads, last_physical_reads, min_physical_reads, max_physical_reads, 
min_physical_reads, max_physical_reads, total_logical_writes, last_logical_writes, 
min_logical_writes, max_logical_writes, total_logical_reads, last_logical_reads, 
min_logical_reads, max_logical_reads,  
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM     sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
ORDER BY total_physical_reads, Total_logical_reads DESC 
GO 
 
SELECT TOP(10) total_physical_reads, max_physical_reads, total_logical_writes, max_logical_writes, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM     sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
WHERE p.objtype <> 'Proc' 
ORDER BY total_physical_reads, Total_logical_reads DESC 
GO 
 
 
SELECT t1.session_id, t1.request_id, t1.task_alloc, 
t1.task_dealloc, t2.statement_start_offset,  
t2.statement_end_offset, REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM (Select session_id, request_id, 
SUM(internal_objects_alloc_page_count) AS task_alloc, 
SUM (internal_objects_dealloc_page_count) AS task_dealloc  
FROM sys.dm_db_task_space_usage  
GROUP BY session_id, request_id) AS t1,  
sys.dm_exec_requests AS t2 
INNER JOIN sys.dm_exec_query_stats stat ON t2.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
WHERE t1.session_id = t2.session_id 
AND (t1.request_id = t2.request_id) 
 
-- I/O query stats based on the sql text 
 
SELECT top (10) (stats.total_logical_reads/stats.execution_count), 
    (stats.total_logical_writes/stats.execution_count), 
    (stats.total_physical_reads/stats.execution_count), 
    stats.execution_count, stats.sql_handle, stats.plan_handle, 
    REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_query_stats stats 
CROSS APPLY sys.dm_exec_sql_text (stats.sql_handle) 
ORDER BY (total_logical_reads + total_logical_writes) DESC 
 
 
 
-- Missed Index information 
 
select TOP(50) * from sys.dm_db_missing_index_group_stats 
GO 
 
select object_name(object_id) object, equality_columns, included_columns, statement from sys.dm_db_missing_index_details 
go 


/* Queries below use an option technique known as plan guides */ 
  
sp_create_plan_guide_from_handle @name = N'plan_guide_name' 
 
-- Create a plan guide for the query by specifying the query plan in the plan cache. 
DECLARE @plan_handle varbinary(64); 
DECLARE @offset int; 
SELECT @plan_handle = plan_handle, @offset = qs.statement_start_offset 
FROM sys.dm_exec_query_stats AS qs 
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS st 
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp 
WHERE text LIKE N'SELECT WorkOrderID, p.Name, OrderQty, DueDate%'; 
 
EXECUTE sp_create_plan_guide_from_handle  
    @name =  N'Guide1', 
    @plan_handle = @plan_handle, 
    @statement_start_offset = @offset; 
GO 
-- Verify that the plan guide is created. 
SELECT * FROM sys.plan_guides 
WHERE scope_batch LIKE N'SELECT WorkOrderID, p.Name, OrderQty, DueDate%'; 
GO 

Create Plan Guides in SQL-Server

sp_create_plan_guide  
@name = N'Guide1', 
@stmt = N'SELECT * FROM Sales.SalesOrderHeader AS h, 
        Sales.Customer AS c, 
        Sales.SalesTerritory AS t 
        WHERE h.CustomerID = c.CustomerID  
            AND c.TerritoryID = t.TerritoryID 
            AND CountryRegionCode = @Country_region', 
@type = N'OBJECT', 
@module_or_batch = N'Sales.GetSalesOrderByCountry', 
@params = NULL, 
@hints = N'OPTION (OPTIMIZE FOR (@Country_region = N''US''))' 
 

Creating Query Plan Stability in Oracle Relational Databases with Cost-based Optimizers

What Query Plan Stability?

Query Plan Stability ensures predictable and appreciable database performance no matter how much data is being stored in the buffer cache or on disk.   It ensures that all queries are performing according to business requirements and processes are running in a timely manner during the all business hours.

The information presented in this document applies to object relational databases with a cost-base optimizer. The principal audience for this documentation are database administrators, database programmers and customer support analyst.

Whenever confronted with a poorly performing application.  The following typically occurs during troubleshooting

  1. Ask stakeholder when and at what frequency statistical collection is performed.  If stakeholder has not computed statistics recently or has never computed statistics, they must compute statistics before proceeding.
  2. Ask customers if there has been a recent change in data. If this has occurred, get as much detail on the changes
    1. When and where the changes made
    2. The quantity of changes.

During execution of processes  plan instability can occur due to the following factors:

  1. Lack of adherence to cardinality and primary and foreign key relationships.
  2. Lack of proper statistics.
  3. Misuse of predicates in query and plan binding.
  4. Flush of “good” plans from the shared pool (plan cache).

One of the best tools for ensuring plan stability in Oracle is the cost-based optimizer explain plan, which helps determine how costly it is to execute a plan based on its cost in CPU.

Query identifies in Oracle (SQL_ID) are indexes with alphanumeric characters (e.g. 3mx8whn1c5jbb). 

When reviewing query execution performance, it’s always best to create a query plan:

SQL>EXPLAIN PLAN FOR 
select XML_LOADID, XML_RECORDTEXT FROM OLAPSYS.XML_LOAD_RECORDS WHERE XML_RECORDID=2;

SQL>  select * from TABLE(DBMS_XPLAN.DISPLAY);

Explain plans show how the query is performing in the cost-based optimizer once parsed. It provides valuable information of where the highest cost are in your query. Query tuning is essential for maintaining performance in a cost-based optimizer.

If plan instability occurs in an Oracle database, there are several options in addition to the list in the preceding sections to correct the issue.   

  1.  Query performs very poorly or performs full table scans on very high cardinal and big table objects.
  2. Make sure that statistical collection script has been executed.  Make certain that statistics are up to date.   See appendix for script to check statistics.
  3. If Statistics are up to date.  Attempt to generate a plan analysis (SQL profile) of the current plan.  Note:  Oracle Diagnostic Pack and Automated Database Diagnostic Manager (ADDM) needs to be installed.  The profile can be generated via the OEM GUI or using scripts.

Command Line SQL Tuning and SQL Profiling.

The following steps allow an administrator to create SQL profile tuning sets as well as export out those profiles from a source database and import it into a new database.  This allows administrator to create and test better plans in test systems and then load them safely into production systems and enable/disable them when needed.

Run query to find currently in memory that has high execution time.

SET HEADING ON
SET PAGESIZE 20000
SET LINESIZE 2000
COL text FOR A50
COL ctext FOR A50


select vs.last_load_time, ao.OWNER, parsing_schema_name, first_load_time , ao.OBJECT_NAME, vs.program_line#, executions exe, vs.sqltype, vs.sql_id, vs.rows_processed rows_processed, concurrency_wait_time, elapsed_time/1000000 elapsed_secs, elapsed_time/1000000/(case when executions = 0 then 1 else executions end) elap_per_exec_secs, vs.sql_fulltext text, vs.sql_text ctext
from v$sql vs, all_objects ao
where vs.PROGRAM_ID = ao.OBJECT_ID and parsing_schema_name not in ('SYS','SYSTEM') and parsing_schema_name in ('<SCHEMA_OWNER>')
and owner not in ('SYS','SYSTEM','DBSNMP','SYSMAN','MDSYS')
order by vs.last_load_time desc , vs.parsing_schema_name, vs.first_load_time desc , program_id, vs.program_line#;

Execute v$sqlarea query to find plan and hash values that identify that query.


select address, hash_value, plan_hash_value from v$sqlarea where sql_id in ('2zd224kurqsr9');

ADDRESS          HASH_VALUE PLAN_HASH_VALUE
---------------- ---------- ---------------
00000003ACC32290 3419553494      1111780545

Prior to doing a purge of a PLAN_HASH from the shared_pool, it’s important to collect information about the bad plan.   You can do this by executing an DBMS_XPLAN outline on the Plans.  To preserve the plan prior to flushing if from the shared_pool, read section on Loading Plan from Cache to Baseline and return here to continue.

SQL> EXPLAIN PLAN FOR select XML_LOADID, XML_RECORDTEXT FROM OLAPSYS.XML_LOAD_RECORDS WHERE XML_RECORDID=2;

Explained.

SQL> select * from TABLE(DBMS_XPLAN.DISPLAY);

If plan exists in memory, but has a bad hash plan number, create tuning task and SQL set for the query.

Execute tuning task command to see if ADDM will generate a better plan (Do not purge from memory if using this step).

var stmt_task VARCHAR2(64);

SET SERVEROUTPUT ON LINESIZE 200 PAGESIZE 20000 LONG 9999

EXEC :stmt_task := DBMS_SQLTUNE.CREATE_TUNING_TASK(sql_id => '2zd224kurqsr9', task_name => 'sga_tuning_task_1');
	
SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('sga_tuning_task_1', 'TEXT', 'TYPICAL', 'FINDINGS') FROM DUAL;

 If no results are returned, then ADDM will not be able to tune.

SQL> SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('sga_tuning_task_1', 'TEXT', 'TYPICAL', 'FINDINGS') FROM DUAL;
ERROR:
ORA-13631: The most recent execution of task sga_tuning_task_1 contains no results.
ORA-06512: at "SYS.PRVT_ADVISOR", line 5739
ORA-06512: at "SYS.DBMS_SQLTUNE", line 1045
ORA-06512: at line 1

If no results are returned skip following query and go to step 6.  Always make sure not to purge SGA if you use these steps.

DECLARE
    
    sqlprofile_name VARCHAR2(30);

BEGIN
	 
    sqlprofile_name := DBMS_SQLTUNE.ACCEPT_SQL_PROFILE ( 
    task_name    => 'sga_tuning_task_1'
,   name         => 'sql_profile_1'
,   force_match  => true 
);
END;
/

Create a SQL Tuning Set from AWR.

Find the begin and end span for the SQL_ID and plan hash that you need from AWR (This is if the plan has been purged from the pool and exists in AWR repository.

SET LINESIZE 2000 PAGESIZE 20000
SET LONG 99999
     select
        SNAP_ID,
        TO_CHAR(begin_interval_time,'DD-MON-YYYY HH24') begin_interval_time,
        TO_CHAR(end_interval_time,'DD-MON-YYYY HH24') end_interval_time,
        SQL_ID,
        round((round((avg(ELAPSED_TIME_DELTA)/1000000),2)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as ELAPSED_TIME_SECS,
        round((round((avg(CPU_TIME_DELTA)/1000000),2)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as CPU_TIME_SECS,
        round((avg(BUFFER_GETS_DELTA)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as GETS_PER_EXEC
     from
        dba_hist_snapshot natural join dba_hist_sqlstat natural join dba_hist_sqltext
     where
      (elapsed_time_delta > 0 or elapsed_time_delta is not null)
      and SQL_ID = 'bbuc9hr5x4gqq'
     group by
        SNAP_ID,
        TO_CHAR(begin_interval_time,'DD-MON-YYYY HH24'),
        TO_CHAR(end_interval_time,'DD-MON-YYYY HH24'),
        SQL_ID
     order by
        snap_id asc
/

Create a SQL set.

Begin 
  DBMS_SQLTUNE.CREATE_SQLSET (sqlset_name => '2zd224kurqsr9_TUNING_SET_1');
END;
/

 Load SQL Tuning Set.

DECLARE
   Cur sys_refcursor;
 BEGIN
    OPEN cur FOR
     SELECT VALUE(P) FROM TABLE(dbms_sqltune.select_workload_repository(begin_snap=>13645, end_snap=>13649, Basic_filter=>'sql_id=''2zd224kurqsr9''', attribute_list => 'ALL' )
) p;
DBMS_SQLTUNE.LOAD_SQLSET(sqlset_name=> '2zd224kurqsr9_TUNING_SET_1', populate_cursor=> cur);
CLOSE cur;
End;
/

 Verify tuning set is correct.

SELECT * FROM TABLE(DBMS_SQLTUNE.SELECT_SQLSET(sqlset_name => '2zd224kurqsr9_TUNING_SET_1'));

 create the sql set staging table (Cannot be created in the SYS or SYSTEM schema).

execute DBMS_SQLTUNE.CREATE_STGTAB_SQLSET(table_name =>'TUNING_SET_1_STGTAB');

Pack the tuning set into the staging table.

execute DBMS_SQLTUNE.PACK_STGTAB_SQLSET(sqlset_name => 'bbuc9hr5x4gqq_TUNING_SET_1',staging_table_name => 'TUNING_SET_1_STGTAB', staging_schema_owner=>'<SCHEMA_OWNER_SOURCE>');

Check the count of the table to make sure it’s been populated.

SQL> select count(*) from <SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB;

  COUNT(*)
----------
         3

Export out the staging table.

[oracle]$ expdp system/******* DUMPFILE=TUNIING_SET_1.DMP DIRECTORY=BACKUPS TABLES='<SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB'

Export: Release 11.2.0.4.0 - Production on Fri Mar 9 04:10:34 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Starting "SYSTEM"."SYS_EXPORT_TABLE_01":  system/******** DUMPFILE=TUNIING_SET_1.DMP DIRECTORY=BACKUPS TABLES=<SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 9.562 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/MARKER
. . exported "<SCHEMA_OWNER_SOURCE>"."TUNING_SET_1_STGTAB"             54.01 KB       3 rows
Master table "SYSTEM"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYSTEM.SYS_EXPORT_TABLE_01 is:
  /data/backups/TUNIING_SET_1.DMP
Job "SYSTEM"."SYS_EXPORT_TABLE_01" successfully completed at Fri Mar 9 04:11:10 2018 elapsed 0 00:00:32


******** impdp staging table into target database  ******

Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Master table "SYSTEM"."SYS_IMPORT_FULL_01" successfully loaded/unloaded
Starting "SYSTEM"."SYS_IMPORT_FULL_01":  system/******** DUMPFILE=TUNIING_SET_1.DMP REMAP_SCHEMA=<schema_owner_source>:<schema_owner> DIRECTORY=BACKUPS
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . imported "<SCHEMA_OWNER>"."TUNING_SET_1_STGTAB"                54.01 KB       3 rows
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/MARKER
Job "SYSTEM"."SYS_IMPORT_FULL_01" successfully completed at Fri Mar 9 04:29:52 2018 elapsed 0 00:00:41

Purge the plan from the shared pool on source database.

exec sys.dbms_shared_pool.purge('00000003ACC32290,3419553494','C');

*** OR ***

set pagesize 20000
select 'exec sys.dbms_shared_pool.purge(''' || address || ',' ||  hash_value   || ''',''C'');' from v$sqlarea where sql_text like '%TASKTEMPLATEPARAMETERVALUE X  where (1=1)%'
/

'EXECSYS.DBMS_SHARED_POOL.PURGE('''||ADDRESS||','||HASH_VALUE||''',''C'');'
--------------------------------------------------------------------------------
exec sys.dbms_shared_pool.purge('00000003ACC32290,3419553494','C');
exec sys.dbms_shared_pool.purge('0000000127D41C68,3957004154','C');
exec sys.dbms_shared_pool.purge('00000001A851BB70,525044174','C');
exec sys.dbms_shared_pool.purge('000000016C757A98,4128744307','C');
exec sys.dbms_shared_pool.purge('000000025115E2A8,4132567071','C');

Installing new plan on target database.

Unpack the SQL set in the target database.


NAME
--------------------------------------------------------------------------------
CREATED             STATEMENT_COUNT
------------------- ---------------
bbuc9hr5x4gqq_TUNING_SET_1
2018-03-09 04:33:59               

Load plan into SQL Plan Management (SPM).

set serveroutput on
declare
my_integer pls_integer;
begin
my_integer := dbms_spm.load_plans_from_sqlset (
sqlset_name => '2zd224kurqsr9_TUNING_SET_1',
sqlset_owner => 'SYS',
fixed => 'YES',
enabled => 'YES'
);
DBMS_OUTPUT.PUT_line(my_integer);
end;
/

Gather stats on objects used by tuning set.

begin
dbms_stats.gather_table_stats( 
ownname=> '<schema_owner>', 
tabname=> 'TASKTEMPLATEPARAMETERVALUE', 
estimate_percent=> null, 
cascade=> TRUE, 
degree=> null, 
no_invalidate=> DBMS_STATS.AUTO_INVALIDATE, 
granularity=> 'AUTO', 
method_opt=> 'FOR ALL COLUMNS SIZE AUTO');
end;

Load known STS for baseline.

DECLARE
my_plans pls_integer;
BEGIN
  my_plans := DBMS_SPM.LOAD_PLANS_FROM_SQLSET(
     sqlset_name => '2zd224kurqsr9_TUNING_SET_1',
     basic_filter => 'plan_hash_value = '1111780545');
END;
/

List out baseline.

SELECT * FROM dba_sql_plan_baselines;

Please make sure that the plan we would like optimizer to use is fixed enabled.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => <attribute_name>, ‘ENABLED’, attribute_value =>’YES’);
END;
/

For the plan that was manually loaded, check to make sure it has been accepted.

select sql_handle, plan_name, enabled, accepted,fixed,origin from dba_sql_plan_baselines where signature=<signature>;

Load old plan(s) from cache into baseline to preserve it.

Query the pool.

Select sql_id, exact_matching_signature, force_matching_signature from v$sql where sql_text like ‘%<sql_text%’;

Create baseline from plan.

SET SERVEROUTPUT ON

declare
my_int pls_integer;
begin
my_int := DBMS_SPM.load_plans_from_cursor_cache(sql_id => '2zd224kurqsr9’, plan_hash_value => 1111780545, fixed => 'NO', enabled => 'NO');
DBMS_OUTPUT.PUT_line(my_int);
end;
/

List baseline to make sure it has been loaded.

SELECT * FROM dba_sql_plan_baselines where signature=<exact matching signature query>;

Make sure to load current plan from cursor cache into baseline and disable with the following command.  Also make certain that the baseline is not accepted.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => <attribute_name>, ‘ENABLED’, attribute_value =>’NO’);
END;

Purge the SQL from cursor Cache, so that the optimizer will pick the new plan from baseline.

select 'DBMS_SHARED_POOL.PURGE ('''||ADDRESS||', '||HASH_VALUE||''', ''C'');' from V$SQLAREA where SQL_ID = '2zd224kurqsr9';

exec DBMS_SHARED_POOL.PURGE ('000000051A0961B0, 1111780545', 'C');

Confirmation and rollback procedures

You can roll back the changes by enabling the previous plans within SQL Profile Manager and running the application to reload planes into the pool.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => ‘ENABLED’, 
attribute_value =>’YES’);
END;
/

Check to make sure old plan is enabled and accepted.

select sql_handle, plan_name, enabled, accepted,fixed,origin from dba_sql_plan_baselines where signature=<signature>;
Derek Moore
Derek Moore

Effectively using Associative Tables in Relational Database Design

 

When creating entities in an entity-relational (ER) diagram, there are times where multiple entity attributes (also known as fields in tables) have associations to could create redundancy in the entity instances of a relational database model.  An associative entity is an entity type that associates the instances of one or more entity types that contains attributes in the relationships between those entities.   Associative entities  helps prevent joining two or more entities directly with multiple relationships that would create duplication of instances.  In databases this can sometime result in Cartesian joins and can produce duplication in table rows; sometimes many times over in a result set.

In entities where the instances are not very unique, such as the days of the week, months in a year, and positions in a company; it’s not a good idea to create a direct association with attributes on that entity to another entity that has very unique instances.  For example, meetings in a company, full names, age and address of people in New York City, or people working in a company are examples of uniqueness.  Although these are a very basic example, you will soon discover that you could almost create a level of duplication that can exponentially grow the number of instances in the design of your database model.

Below is an use case for an associative entity.  The EMPLOYEE  and SKILLS tables are the entities that contain the employee information and types of skills and titles within the company, respectively.  In this example a company ranks it’s employee’s skills by titles such as Principal, Senior, III, II,  and I.  However, they also want to know which skills are managed by which employees.  Since we want to prevent as much duplication as possible, we create an associative table called ORGANIZATION that will have a primary identifier attribute (ORG_ID) and joins together the EMPLOYEE and SKILLS table. The MANAGED_SKILL_ID attribute would create a composition identifier that would allow an employee to have multiple instances of the skills for which he or she would manage without duplicating data in the EMPLOYEE or SKILLS entities.

2019-11-10_6-13-37

 

Leveraging Memory in PL/SQL for Big Data Applications

When writing PL/SQL code for Big Data applications it’s always a good idea to leverage the database as much as possible.  This is particularly true for ETL or warehousing processing.

In my experience, I’ve tried to quantify objects in Oracle programming that work best to temporarily  store data.  As part of your Big Data application, anytime you need to leverage relational databases, here is a table that I typically use as a guideline for storing data for loading to or extracting data from a database.

Typically, I judge the what type of cursor object to used based on factors such as performance, the number of rows that need to be read, the operation that I am performing and how much memory that I would like to leverage.

Number of Rows Explicit Cursor  Temporary Table PL/SQL Memory Collections Record Collections  Typical Memory Used
<= 1000 check_mark7 check_mark7 10 MB
1000 to 500,000 check_mark7 check_mark7 check_mark7 100-500 MB
500,00 to 2,000,000+ check_mark7 check_mark7 500MB +

Although, explicit cursors can be used for much larger data sizes, if performance is your goal, a few thousand rows are optimal to maintain good performance, especially if you are opening and closing many cursors in a small window of time.