Using Machine Learning and Data Science for Performance Tuning in Oracle

Like many DBAs, I have experienced the ups and downs of database performance tuning. Many battles have been fought to tune and maintain an Oracle database yielding mostly victories, but also a few hard-fought lessons learned.  Performance tuning is a constant in my job and I’ve learned a lot; but for every early morning phone call I’ve received, every severity one conference call that I’ve been on, and every emergency patch that I had to deployed; there is one mantra that has been etched into my mind: Database performance tuning is as much an art as it is a skill.  There is no easy one-size-fits-all solution to it all. It requires an understanding of various architectures outside the database itself as well as deep knowledge of Oracle internals.

Data Science and Machine Learning

The popularity of data science has opened new possibilities with database performance tuning. Along with building innovative products it has created excitement in areas such as the Internet of Things and cloud computing where very large volumes of data are mined for value.  For those who are new to the concept of Machine Learning, essentially it is defined as follows:

Machine Learning is a subset of Artificial Intelligence that focuses on creating models that learn and predict events based on past data without a human computer programmer having to change code to adapt to new events. An example would be a spam filter learning new exploits and then blocking those exploits.

Data scientists have a lot in common with database professionals, such as building extraction-transformation-loading routines, constructing business intelligence applications, and data wrangling. The difference is data science also provides more qualitative functionality and programming for data and business analytics. In many respects, the use of analytical and descriptive statistics has always been a tool for DBAs to manage performance by looking at averages and variances in performance over time. But what essentially has become an easy mathematical tool has expanded into more advanced analytics.

Many DBAs are familiar with the Oracle Diagnostic Pack (By the way, I’m covering Oracle versions 10g, 11g and 12c).  This feature contains the Automated Workflow Repository support which essentially stores system performance data for everything from active session history, to system, segment and o/s statistics. It also stores information on shared pool statistics and query plan execution statistics. It’s a massive repository used to build Automated Workflow reports and can be used to provide historical trending data. Typically, the default to store this data is 15 days with system performance data being sampled every hour. But the sample rate can be 15 minutes with historical data stored for entire months. Anyone who has done performance tuning with Oracle SQL has used v$, x$ or DBA_HIST views to troubleshooting or address performance issues in Oracle. The nice thing about the automated workflow repository is that you can write your own queries against it. By extension, you can also write PL/SQL procedures

to mine the data and build predictive and prescriptive statistical models. Below is an example of a query that builds a pivot table of AWR tables based on process and memory stats.

SELECT jn.snap_id, jn.stat_name, avg(value), avg(readtim), avg(writetim), avg(phyrds), avg(phywrts), avg(wait_count), avg(time), avg(pc.num_processes) from (

select dbid, snap_id, stat_name, value from dba_hist_snapshot natural join dba_hist_osstat

)

PIVOT (

              avg(value)

              for stat_name in ('BUSY_TIME','AVG_BUSY_TIME','IDLE_TIME','AVG_IDLE_TIME','NUM_CPU_CORES','NUM_CPUS','NUM_VCPUS','NUM_LCPUS','VM_OUT_BYTES','AVG_USER_TIME','AVG_SYS_TIME','OS_CPU_WAIT_TIME','IOWAIT_TIME','AVG_IOWAIT_TIME','PHYSICAL_MEMORY_BYTES')

) jn

JOIN

DBA_HIST_PROCESS_MEM_SUMMARY pc

ON (jn.dbid = pc.dbid and jn.snap_id = pc.snap_id)

 WHERE

pc.category in ('PL/SQL','SQL')

group by jn.snap_id, jn.stat_name

I first became interested in using statistical analysis in performance tuning after reading a paper entitled An Industrial Engineer’s Approach to Managing Oracle Databases by Robyn Sand. It described the use of statistical methods and engineering process control in DB performance tuning. With the advent of Big Data analytics, data science and machine learning, there are rich opportunities to gain meaningful insight into performance management in Oracle; but there also must be an abundance of caution.

Beyond descriptive statistics, the use of data science and machine learning would, in theory, allow me to “predict” potential detrimental performance in Oracle. Using historical data collected in the AWR, there are literally thousands of possible insights that can be gleaned about Oracle RDBMS performance.

Here were some of the successes, pitfalls and lessons I learned…

 Histograms, Standard Deviations, Distributions and other statistical methods

Oracle has a multitude of built-in analytical and grouping functions to support dimensional data structures and perform data mining techniques. These built-in aggregation features are great for visualizing large volumes of performance data on charts, dashboards and graphs. I estimate that for 98% of all performance tuning needs, these functions will work just fine.

The image above is a histogram of the number of concurrent processes by utilization. The mean appears to be around 300 concurrent processes which generate around a 9% average utilization in an oracle database. The second axis is the mean utilization of the frequency of 300 concurrent processes.

There are three main applications for which data science and machine learning can be applied to Oracle database management.

  1.  Capacity Planning and IT Asset Planning.
  2.  Performance Management
  3.  Business Process Analysis

I’d like to put some added emphasis on Business Process Analysis (3). It doesn’t do any good to present data analysis that hasn’t been qualitatively reviewed by stakeholders who know the business value of their IT assets; meaning, before making a decision on whether to purchase new hardware, or invest real dollars based on any type of quantitative analysis, the results must be presented in a way for the business stakeholders to make a sound final decision. As a database or IT professional, it is our job to present quantitative analysis for all stakeholders to make proper business decisions. We must also have anecdotal, near-real time and historical evidence to provide an unbiased objective analysis. Business decisions should never be made on data analysis alone.

 Unsupervised Learning and Clustering Analysis

There are two subdivisions of machine learning: supervised and unsupervised learning. Supervised learning requires guidance from a programmer by creating training, testing or validation sets of data to build analytical models and to assess and score those models to the proper predictive results from the input data. Unsupervised learning is built on inferences of the data itself; how specific data points and how variables relate to each other. Cluster analysis is an example of unsupervised learning technique. Clustering is popular in the analysis of demographic information (age, sex, height, race, location) and the segmenting customers who according to how likely they purchase a new type of product. If you are looking for a way cluster database properties or system configurations by specific inputs, clustering analysis can be a great tool. It can be used to build specific capacity planning algorithms base on user inputs.

Clustering model of the number of processes per utilization on AWR data. The size of the circle represents the amount of sample data. The circles are clustered by various database configurations. The best database configuration cluster would be CLUSTER 6, which can support higher number of concurrent processes with lower amount of centralized utilization.

 Regression Analysis

Linear regression analysis is a “predictive” technique that is good for taking data points and finding the least amount of error between the data points and a fitted line that best represents that data. I find that the best regression models for Oracle performance data are linear regressions using the least squares method for error reduction. However, there are other regression methods such as logistic regression, which uses a logarithmic functions for error reduction.

A linear regression analysis to predict how many waits will occur based on disk reads. This type of analysis is based on least square fitting techniques to get a high R-square value of accuracy. Very good for quick data analysis of a ton of scattered data points.

 Engineering Process Control

Statistical process control is very much a hit and miss proposition, and I’ll explain why. It’s very tempting to take Oracle data and run it through a control chart, calculate upper and lower control limits and say, “if it goes above or below this line, the process is out of control”. But the first question a DBA should ask him/herself is “What exactly is the process that I want to control?” Statistical Process Control is about using statistics to determine upper and lower control limits of process oriented data. Engineering Process Control is about maintaining those processes so  tolerances are not exceeded. In an Oracle database, this is very hard to do, even with the best running system, because unless you have the same amount of data and the same number of users; and have control for all other variables, you’ll have a very low probability of getting meaningful performance management. Control charts are useful for manufacturing jet engines where all the same parts are on an assembly line must be within millimeters of each other in dimensional measures. To put it simply, Oracle like many other databases, has plenty of noise that’s not worth panicking over.

An R-mean and X-mean control of SQL elapsed time in seconds. With databases, so many queries are executing, many DBAs ponder the question, “How can I determine when queries will execute a bad execution plan before the end user is impacted?” I found control charts to be problematic, because of the level of granularity in its control and high sensitivity of these types of charts.

I do, however, see the benefits of using statistical distributions for which control charts are based on. It is acceptable, with the correct amount of baseline data to create a binomial distribution that is devoid of outliers and determine when a particular query or set of queries has exceed a specific threshold. In this case, due to a binomial (standard) distribution of three standard deviations from the mean or 3. I find this helps if you have enough baseline data of good performance to form the proper distribution to compare to.

Neural Networks

Neural networks have been around for decades, but it hasn’t been until recently, that its application has been broadening in the world of data analytics and data science. Neural networks are a branch of deep learning that uses activation formula at multiple layers called Hidden Layers as weights. The more layers you have the more rigorous the computations can be. If you apply the proper inputs, connect layers and activation formulas to get the desired output, you can do things such as image processing, speech recognition, text search, object recognition, etc. It’s loosely based on how the brain learns using neurons to communicate with other neurons. Neural networks are tied closely into to artificial intelligence, which has been around for decades. What is different today is the processing power required to execute neural networks has improved dramatically, as with the availability to data, algorithms and software that do neural networks.

When it comes to data performance, I believe the verdict is still out there. Neural networks are already used in network intrusion and detection services and to monitor for DOS attacks.

I believe to predict anomalies in database performance that are bad, you really need to understand the processes and define what is bad performance. Simply unleashing neural networks on performance metrics without understanding the relationship between those inputs will undesired results.

 Conclusion

In conclusion, this has been a very high level discussion, so feel free to reach out to me and connect on LinkedIn to discuss my research. I’ve been studying this for around two years so far and I’m looking forward to writing more articles in 2019.

I believe the best way to go about using data science and machine learning in database performance tuning is the following:

  1.   Have at least five or more consecutive days of baseline performance data from which to train, test and validate your models. Whatever represents a week-in-the-life of a business.
  2. Talk with the business and understand their pain points so that you can collect the right metrics and right statistics.
  3.  Use a data analysis process that includes describing the data, building histograms, training the models, testing the models, and score different models
  4.  As an added step, use hypothesis testing, error checking or other statistical testing methods.
  5. Understand the business processes that you are monitoring so that you can select the correct metrics, variables and inputs from the database performance data and statistics.

As of this writing, I have been researching the use of Convolutional Neural Networks (CNN) popular with anomaly detection. Oracle database statistics have plenty of noise due to concurrent processes, a somewhat complex database engine and data constantly moving in and out of the SGA, PGA and buffer cache. I hope to have an update on my progress with deep learning and neural networks soon.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s