Deep Learning, Oracle Database Performance and the Future of Autonomous Databases

Posted on March 29, 2019 by rtpopendata

“The goal is to have databases in the Cloud run autonomously. The Cloud should be about scale, elasticity, statelessness and ease of operation and interoperability. Cloud infrastructures are about moving processes into microservices and the agile deployment of business services. Deep Learning has the potential to give databases innovative and powerful level autonomy in a multitenant environment, allowing DBAs the freedom to offer expertise in system architecture and design…”.

Introduction

This article details initial research performed using deep learning algorithms to detect anomalies in Oracle performance. It does not serve as a “deep” dive into deep learning and machine learning algorithms. Currently, there are many really good resources available from experts on the subject matter and I strongly recommend those who are interested in learning more about these topics to check out the list of references at the end of this article. Mathematical terminology is used throughout this article (it’s almost impossible to avoid), but I attempted to keep the descriptions brief, as it’s best that people interested in these topics seek to out the rich resources available online to get a better breadth of information on individual subjects.

In this final article on Oracle performance tuning and machine learning, I will discuss the application of deep learning models in predicting performance and detecting anomalies in Oracle. Deep Learning is a branch of Machine Learning method that uses intensive Artificial Intelligence (AI) techniques with data to learn iteratively; while deploying optimization and minimization functions. Applications for these techniques include natural language processing, image recognition, self-driving cars, anomaly and fraud detection. With the number of applications for deep learning models growing substantially in the last few years, it was only a matter of time that it would find its way into relational databases. Relational databases have sort of become the workhorses of the IT industry and still generate massive amounts of revenue. Many data-driven applications still use some type of relational database; even with the growth of Hadoop and NoSQL databases. It’s been a business goal of Oracle Corporation, one of the largest relational database software companies in the world, to create database services that are easier manage, secure and operate.

As I mentioned in my previous article, Oracle Enterprise Edition has a workload data repository that it already uses to produce great analysis for performance and workload. Microsoft SQL-Server also has a warehouse that can store performance data, but I’ve decided to devote my research into Oracle.

For this analysis, the focus was specifically on the Oracle Program Global Area (PGA).

Oracle Program Global Area

The Program Global Area (PGA) is a private memory in the database that contains information for server processes. Each user session gets a private memory region within the PGA. Oracle will read and write information to the PGA based on requests from server processes. The PGA performance metrics accessed for this article are based on Oracle Automatic Shared Memory Management (ASMM).

As a DBA, when troubleshooting PGA performance, I typically look at the PGA advisor, which are a series of modules that collects monitoring and performance data from PGA. It recommends how large the PGA should be in order to fulfill process requests for private memory and is based on the Cache Hit Percentage value.

Methodology

The database was staged in a Microsoft Azure virtual machine processing large scale data from a data generator. Other data was compiled from public portals such as EAI (Energy Administration Institute) and PJM Interconnection, an eastern regional transmission organization.

Tools used to perform the analysis include SAS Enterprise Miner, Azure Machine Learning studio and the SciKit Learn with TensorFlow machine learning libraries. I’ve focused my research on a few popular techniques for which I continuously do research. These include

Recurrent Neural Networks
Autoencoders
K-Nearest Neighbors
Naїve Bayes
Principal Component Analysis
Decision Trees
Support Vector Machines
Convolutional Neural Network
Random Forest

For this research into databases, I focused primarily on SVM, PCA and CNN. The first step was to look at the variable worth (the variables that had the greatest weight on the model) for data points per sample.

The analysis of Oracle Performance data on Process Memory within dedicated process memory in Oracle in the program global area of the database.

Once the data was collected, cleaned, imputed and partitioned, Azure ML studio was used to build two types of classifiers for anomaly detection.

Support Vector Machine (SVM): Implements a binary classifier where the training data consists of examples of only one class (normal data). The model attempts to separate the collection of training data from the origin using maximum margin.

Principal Component Analysis (PCA): Create subspace spanned by orthonormal eigenvectors associated with the top eigenvalues of the data covariance matrix for approximation of classifiers.

For prediction, I compared Artificial Neural Networks and Regression models. For Deep Learning, I researched the use of CNN specifically for anomaly detection.

Deep Learning and Oracle Database Performance Tuning

My article Using Machine Learning and Data Science for Performance Tuning in Oracle discusses the use of Oracle’s automated workflow repository, a data warehouse which stores snapshots of views for SQL, O/S and system state and active session history among many other areas of system performance. Standard data science methods require having a strong understanding of business processes through qualitative and quantitative methods, cleaning data to find outliers and missing values, and applying data partitioning strategies to get better data validation and scoring of models. As a final step, a review of the results would be required to determine its hypothetical testing accuracy.

Deep Learning has changed these methodologies a bit by applying artificial intelligence into building models. These models learn from iteratively training as data moves from hidden layers with activation functions from input to output. The hidden layers in this article are convolutional and are specific to spatial approximations such as convolution, pooling and fully connected layers (FCL). This has opened many opportunities to automate a lot of the steps typically used in typical data science models. If there is data generated which would require interpretation by a human operator, this can now be interpreted using deep neural networks at much higher rates that can possibly be done by a human operator.

Deep Learning is a subset of Machine Learning which is loosely based on how neurons learn in in the brain. Neural networks have been around for decades but have just recently gained popularity in the information technology for its ability to identify and classify images. Image data has exploded with the increase in social media platforms, digital images and image data storage. Imaging data, along with text data how a multitude of applications in the real world, so there is no shortage of work being done in this area. The latest popularity of neural networks can be attributed to Alexnet, a deep neural network that on the ImageNet classification challenge for achieving low error rates on the ImageNet dataset.

With anomaly detection, the idea is to train a deep learning models to detect anomalies without overfitting data. As the model iterates through the layers of a deep neural network, cost functions help to determine how close it is classifying real-world data. The model should have no prior knowledge of the processes and should be iteratively trained in the data for the cost functions from input arrays and activation functions of other previous layers [7].

Anomaly detection is the process of detecting outliers in the data streams such as financial transactions and network traffic. It can also be applied to deviations in system performance for the purpose of this article.

Predictive Analysis versus Anomaly Detection

Using predictive analytics to model targets through supervised learning techniques is most useful in planning for capacity and performing aggregated analysis of resource consumption and database performance. For the model, we analyzed regression and neural network models to determine how well each one scored based on inputs from PGA metrics.

Predictive analysis requires cleansing of data, supervised and non-supervised classification, imputation and variable worth selection to create model. Most applications can be scored best with linear or logistic regression. In the analysis on PGA performance, I found a logistic regression model scored better than an artificial neural network for predictive ability.

In my previous article, I mentioned the role that machine learning and data science can play in Oracle performance data.

Capacity Planning and IT Asset Planning.
Performance Management
Business Process Analysis

The fourth application for data science and machine learning in Oracle is anomaly detection. Which specifically means applying artificial intelligence to the training of algorithms mostly used in image recognition and language processing and credit fraud detection. It’s also a possibly less efficient way of detecting performance problems in Oracle performance. To attempt to obtain accuracy in the algorithm presents a risk itself, since such models could result in overfitting and high dimensionality that you want to avoid in deep neural networks. Getting accuracy that is comparable to what I human operator can do, works better because basically you don’t want the process to overthink things. The result of an overfitting model is a lot of false positives. You want the most accurate signs of an anomaly, not a model that is oversensitive. Deep Learning techniques also perform intense resource consumption to generate output in a neural network. Most business scale applications require GPUs to build them efficiently.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) are designed for high dimensional data such as images and signals. It’s used for computer vision as well as network intrusion detection and anomaly detection. Oracle performance data is designed as normal text (ASCII) data and contains many different ranges of metrics like seconds versus bytes of memory. Using a mathematical normalization formula, text data can be converted in vector arrays that can be mapped, pooled and compressed. Convolutional Neural Networks are good distinguishing features in an image matrix. Computationally, it is efficient to represent images as multi-dimensional arrays.

The first step is to normalize the PGA data, which contains multiple scales and features. Below is a sample of the data.

Normalizing the data can be done with the following formula[8]:

The second step is to convert this data into image format. This would require building a dimensional array of all the features. Filtering the array can be done by removing small variances and nonlinear features to generate an overall neutral vector. The goal is to normalize and create a multidimensional array of the data.

CNN is often used to identify the NMIST data, which is a set of handwritten numbers. It contains 60,000 training images and 10,000 testing images. Researchers have used CNN to get an error rate on the NMIST data of less than 1%.

Convolution Neural Networks have five basic components, input layer, convolution layer, pooling layer, fully connected layer and output layer. Below is a visual of how CNN works to recognize an image of a bird versus and image of a cat.

The activation function uses a popular rectified linear unit ReLU, which is typical used for CNN. Popular activation functions include logistic sigmoid and hyperbolic tangents. ReLU is defined as a linear y=x for positive values and linear y=0 for negative values. It’s great as an activation function for CNN, due to it’s simplicity and because it helps the time it takes to iterate in the neural network.

Comparing Support Vector Machines (SVM) and Principal Component Analysis (PCA)

Support Vector Machines or SVM are good for finding large margin classifications and identifying vectors of data that are related. The nice thing about SVM is that it has features to deal with outliers built into it. Support Vector Machines is a feature-rich supervised machine learning technique used for classification of observations by their coordinates. I compared the SVM with principal component analysis (PCA) to approximate. PCA creates subspaces spanned by orthonormal eigenvectors associated with the top eigenvalues of the data covariance matrix. PCA based methods help to remove redundancy and reduce dimensionality that is persistent in performance data. Once data was split into training and testing, we used SVM and PCA to optimize multiple dimensions in the data.

Evaluation of Machine Learning Models for Oracle Workloads

For this test, we compared neural networking regression models and ANN. Deep Learning of patterns concerned with anomalies within a database require AI style learning techniques. Finding the correct classifier for performance metrics to improve the accuracy of an Oracle anomaly detection system can include ANN, naive Bayes, k-nearest neighbors and general algorithms.

There are several classification methods that can be used when evaluating anomaly detection models

RoC Curve
Area under RoC
Precision-Recall Curve
Mean average precision (mAP)
Accuracy of classification

Below is a RoC chart used to score PCA and SVM models. RoC charts plot false positive rates against true positive rates. When comparing the PCA and the SVM model, PCA had a higher true positive rate.

Summary: The Future of Autonomous Databases

Oracle has released its first deep learning database, marketed as “The world’s first self-driving database”. Oracle has announced 18c as a new autonomous database that requires no human labor for daily operational task, can provide more security, and automate most database processes. The database will self-tune, self-upgrade and self-patch – all while maintaining %99.995 availability with machine learning. For many companies, especially those working on cloud and PaaS infrastructures, this will mean lower costs. With Exadata, this would include compression techniques that would add further benefits to very large and enterprise level workloads.

Will there be more databases that will be completely run by Artificial Intelligence and Deep Learning algorithms? As a DBA, maintaining a database can be arduous, but many of my DBA colleagues enjoy the respect and prestige of database management and database tuning. With the role of a DBA evolving rapidly, autonomous database may provide the freedom for DBAs to provide database design and development to corporate teams.

It remains to be seen if databases as a service (DBaaS) will reach the reality of full autonomy. It’s bound to happen before automobiles become level 5 autonomous. Selecting the service on this platform could provide opportunities of minimal configurations – and you’re done. Everything else is taken care of. There would be no operator, either in the hosted environment or on premise, nor would anyone ever touch the database for any reason except for application and software development.

In summary, this is a very high-level article on techniques for using deep learning and machine learning on Oracle performance data. I hope that this cursory introduction will inspire DBAs and operators to do their own research and apply it to their toolbox.

References

¹http://deeplearning.net/reading-list/

²https://www.analyticsvidhya.com/

³http://www.kdnuggets.com/

⁴http://www.ieee.org/

⁵https://www.computer.org/

⁶https://www.udacity.com/course/deep-learning-nanodegree-nd101se/deep-learning-nanodegree–nd101

⁷https://www.fast.ai

⁸“A Novel Intrusion Detection Model for Massive Network Using Convolutional Neural Networks” Kehe Wu; Zuge Chen; Wei Li. IEEE Access. Received July 29, 2018.

⁹“Enhanced Network Anomaly Detection Based on Deep Neural Networks”. Naseer, Sheraz; Saleem, Yasir; Khalid, Shezad, Bashir, Muhammad Khawar; Jihun Han, Iqbal, Muhammad Munwar; Kijun Han. IEEE Acwwwcess Received June 3, 2018. Accepted July 16, 2018.

¹⁰https://www.pyimagesearch.com Dr. Adrian Rosebrock

¹¹U.S. Energy Information Administration. https://www.eia.gov/.

¹²PJM Interconnection. https://www.pjm.com/markets-and-operations.aspx

¹³Oracle Corporation. https://www.oracle.com/index.html

My Favorite Publicly Available Datasets

Posted on February 3, 2019 by rtpopendata

I’ve been working with data for decades, searching for insights, converting it, managing it, and now performing data analytics. We have access to unbelievable treasure troves of public data to analyze. Many of the blogs I write are based on these datasets, as I don’t have access to large computing systems. Here is a list of my favorite publicly available datasets. Enjoy!

PJM Interconnection Data Dictionary for electrical grids, distribution and transmission. https://www.pjm.com/markets-and-operations/data-dictionary.aspx
University of California Irvin (UCI) has a huge machine learning repository to practice techniques. This repository can be accessed at archive.ics.uci.edu/ml/index.php
Amazon Web Services datasets are available to the public. https://aws.amazon.com/datasets/.
Kaggle is a data science competition website that rewards prizes to teams for the best ML models. Datasets are located at https://www.kaggle.com/datasets
University of Michigan Sentiment Data.
The time series data repositories are located at https://fred.stlouisfed.org/categories.
Canadian Institute of Cyber Security. https://www.unb.ca/cic/datasets/nsl.html.
Datasets for “The Elements of Statistical Learning”. https://web.stanford.edu/~hastie/ElemStatLearn/.
Government Open Data Portal. https://data.gov

Using Machine Learning and Data Science for Performance Tuning in Oracle

Posted on January 23, 2019 by rtpopendata

Like many DBAs, I have experienced the ups and downs of database performance tuning. Many battles have been fought to tune and maintain an Oracle database yielding mostly victories, but also a few hard-fought lessons learned. Performance tuning is a constant in my job and I’ve learned a lot; but for every early morning phone call I’ve received, every severity one conference call that I’ve been on, and every emergency patch that I had to deployed; there is one mantra that has been etched into my mind: Database performance tuning is as much an art as it is a skill. There is no easy one-size-fits-all solution to it all. It requires an understanding of various architectures outside the database itself as well as deep knowledge of Oracle internals.

Data Science and Machine Learning

The popularity of data science has opened new possibilities with database performance tuning. Along with building innovative products it has created excitement in areas such as the Internet of Things and cloud computing where very large volumes of data are mined for value. For those who are new to the concept of Machine Learning, essentially it is defined as follows:

Machine Learning is a subset of Artificial Intelligence that focuses on creating models that learn and predict events based on past data without a human computer programmer having to change code to adapt to new events. An example would be a spam filter learning new exploits and then blocking those exploits.

Data scientists have a lot in common with database professionals, such as building extraction-transformation-loading routines, constructing business intelligence applications, and data wrangling. The difference is data science also provides more qualitative functionality and programming for data and business analytics. In many respects, the use of analytical and descriptive statistics has always been a tool for DBAs to manage performance by looking at averages and variances in performance over time. But what essentially has become an easy mathematical tool has expanded into more advanced analytics.

Many DBAs are familiar with the Oracle Diagnostic Pack (By the way, I’m covering Oracle versions 10g, 11g and 12c). This feature contains the Automated Workflow Repository support which essentially stores system performance data for everything from active session history, to system, segment and o/s statistics. It also stores information on shared pool statistics and query plan execution statistics. It’s a massive repository used to build Automated Workflow reports and can be used to provide historical trending data. Typically, the default to store this data is 15 days with system performance data being sampled every hour. But the sample rate can be 15 minutes with historical data stored for entire months. Anyone who has done performance tuning with Oracle SQL has used v$, x$ or DBA_HIST views to troubleshooting or address performance issues in Oracle. The nice thing about the automated workflow repository is that you can write your own queries against it. By extension, you can also write PL/SQL procedures

to mine the data and build predictive and prescriptive statistical models. Below is an example of a query that builds a pivot table of AWR tables based on process and memory stats.

SELECT jn.snap_id, jn.stat_name, avg(value), avg(readtim), avg(writetim), avg(phyrds), avg(phywrts), avg(wait_count), avg(time), avg(pc.num_processes) from (

select dbid, snap_id, stat_name, value from dba_hist_snapshot natural join dba_hist_osstat

)

PIVOT (

              avg(value)

              for stat_name in ('BUSY_TIME','AVG_BUSY_TIME','IDLE_TIME','AVG_IDLE_TIME','NUM_CPU_CORES','NUM_CPUS','NUM_VCPUS','NUM_LCPUS','VM_OUT_BYTES','AVG_USER_TIME','AVG_SYS_TIME','OS_CPU_WAIT_TIME','IOWAIT_TIME','AVG_IOWAIT_TIME','PHYSICAL_MEMORY_BYTES')

) jn

JOIN

DBA_HIST_PROCESS_MEM_SUMMARY pc

ON (jn.dbid = pc.dbid and jn.snap_id = pc.snap_id)

 WHERE

pc.category in ('PL/SQL','SQL')

group by jn.snap_id, jn.stat_name

I first became interested in using statistical analysis in performance tuning after reading a paper entitled An Industrial Engineer’s Approach to Managing Oracle Databases by Robyn Sand. It described the use of statistical methods and engineering process control in DB performance tuning. With the advent of Big Data analytics, data science and machine learning, there are rich opportunities to gain meaningful insight into performance management in Oracle; but there also must be an abundance of caution.

Beyond descriptive statistics, the use of data science and machine learning would, in theory, allow me to “predict” potential detrimental performance in Oracle. Using historical data collected in the AWR, there are literally thousands of possible insights that can be gleaned about Oracle RDBMS performance.

Here were some of the successes, pitfalls and lessons I learned…

Histograms, Standard Deviations, Distributions and other statistical methods

Oracle has a multitude of built-in analytical and grouping functions to support dimensional data structures and perform data mining techniques. These built-in aggregation features are great for visualizing large volumes of performance data on charts, dashboards and graphs. I estimate that for 98% of all performance tuning needs, these functions will work just fine.

The image above is a histogram of the number of concurrent processes by utilization. The mean appears to be around 300 concurrent processes which generate around a 9% average utilization in an oracle database. The second axis is the mean utilization of the frequency of 300 concurrent processes.

There are three main applications for which data science and machine learning can be applied to Oracle database management.

Capacity Planning and IT Asset Planning.
Performance Management
Business Process Analysis

I’d like to put some added emphasis on Business Process Analysis (3). It doesn’t do any good to present data analysis that hasn’t been qualitatively reviewed by stakeholders who know the business value of their IT assets; meaning, before making a decision on whether to purchase new hardware, or invest real dollars based on any type of quantitative analysis, the results must be presented in a way for the business stakeholders to make a sound final decision. As a database or IT professional, it is our job to present quantitative analysis for all stakeholders to make proper business decisions. We must also have anecdotal, near-real time and historical evidence to provide an unbiased objective analysis. Business decisions should never be made on data analysis alone.

Unsupervised Learning and Clustering Analysis

There are two subdivisions of machine learning: supervised and unsupervised learning. Supervised learning requires guidance from a programmer by creating training, testing or validation sets of data to build analytical models and to assess and score those models to the proper predictive results from the input data. Unsupervised learning is built on inferences of the data itself; how specific data points and how variables relate to each other. Cluster analysis is an example of unsupervised learning technique. Clustering is popular in the analysis of demographic information (age, sex, height, race, location) and the segmenting customers who according to how likely they purchase a new type of product. If you are looking for a way cluster database properties or system configurations by specific inputs, clustering analysis can be a great tool. It can be used to build specific capacity planning algorithms base on user inputs.

Clustering model of the number of processes per utilization on AWR data. The size of the circle represents the amount of sample data. The circles are clustered by various database configurations. The best database configuration cluster would be CLUSTER 6, which can support higher number of concurrent processes with lower amount of centralized utilization.

Regression Analysis

Linear regression analysis is a “predictive” technique that is good for taking data points and finding the least amount of error between the data points and a fitted line that best represents that data. I find that the best regression models for Oracle performance data are linear regressions using the least squares method for error reduction. However, there are other regression methods such as logistic regression, which uses a logarithmic functions for error reduction.

A linear regression analysis to predict how many waits will occur based on disk reads. This type of analysis is based on least square fitting techniques to get a high R-square value of accuracy. Very good for quick data analysis of a ton of scattered data points.

Engineering Process Control

Statistical process control is very much a hit and miss proposition, and I’ll explain why. It’s very tempting to take Oracle data and run it through a control chart, calculate upper and lower control limits and say, “if it goes above or below this line, the process is out of control”. But the first question a DBA should ask him/herself is “What exactly is the process that I want to control?” Statistical Process Control is about using statistics to determine upper and lower control limits of process oriented data. Engineering Process Control is about maintaining those processes so tolerances are not exceeded. In an Oracle database, this is very hard to do, even with the best running system, because unless you have the same amount of data and the same number of users; and have control for all other variables, you’ll have a very low probability of getting meaningful performance management. Control charts are useful for manufacturing jet engines where all the same parts are on an assembly line must be within millimeters of each other in dimensional measures. To put it simply, Oracle like many other databases, has plenty of noise that’s not worth panicking over.

An R-mean and X-mean control of SQL elapsed time in seconds. With databases, so many queries are executing, many DBAs ponder the question, “How can I determine when queries will execute a bad execution plan before the end user is impacted?” I found control charts to be problematic, because of the level of granularity in its control and high sensitivity of these types of charts.

I do, however, see the benefits of using statistical distributions for which control charts are based on. It is acceptable, with the correct amount of baseline data to create a binomial distribution that is devoid of outliers and determine when a particular query or set of queries has exceed a specific threshold. In this case, due to a binomial (standard) distribution of three standard deviations from the mean or 3. I find this helps if you have enough baseline data of good performance to form the proper distribution to compare to.

Neural Networks

Neural networks have been around for decades, but it hasn’t been until recently, that its application has been broadening in the world of data analytics and data science. Neural networks are a branch of deep learning that uses activation formula at multiple layers called Hidden Layers as weights. The more layers you have the more rigorous the computations can be. If you apply the proper inputs, connect layers and activation formulas to get the desired output, you can do things such as image processing, speech recognition, text search, object recognition, etc. It’s loosely based on how the brain learns using neurons to communicate with other neurons. Neural networks are tied closely into to artificial intelligence, which has been around for decades. What is different today is the processing power required to execute neural networks has improved dramatically, as with the availability to data, algorithms and software that do neural networks.

When it comes to data performance, I believe the verdict is still out there. Neural networks are already used in network intrusion and detection services and to monitor for DOS attacks.

I believe to predict anomalies in database performance that are bad, you really need to understand the processes and define what is bad performance. Simply unleashing neural networks on performance metrics without understanding the relationship between those inputs will undesired results.

Conclusion

In conclusion, this has been a very high level discussion, so feel free to reach out to me and connect on LinkedIn to discuss my research. I’ve been studying this for around two years so far and I’m looking forward to writing more articles in 2019.

I believe the best way to go about using data science and machine learning in database performance tuning is the following:

Have at least five or more consecutive days of baseline performance data from which to train, test and validate your models. Whatever represents a week-in-the-life of a business.
Talk with the business and understand their pain points so that you can collect the right metrics and right statistics.
Use a data analysis process that includes describing the data, building histograms, training the models, testing the models, and score different models
As an added step, use hypothesis testing, error checking or other statistical testing methods.
Understand the business processes that you are monitoring so that you can select the correct metrics, variables and inputs from the database performance data and statistics.

As of this writing, I have been researching the use of Convolutional Neural Networks (CNN) popular with anomaly detection. Oracle database statistics have plenty of noise due to concurrent processes, a somewhat complex database engine and data constantly moving in and out of the SGA, PGA and buffer cache. I hope to have an update on my progress with deep learning and neural networks soon.

Campaign Management using Advanced Analytics

Posted on November 7, 2018 by rtpopendata

Campaign management is a strategy to use marketing campaigns to create sales and leads. The Internet has been a treasure trove of consumer behavior for decades, it is only recently that web analytics has become a tool to create powerful business insight. Two popular strategies deal specifically with pattern recognition. Two pattern recognition strategies include customer segmentation (clustering) and market basket analysis (association).

This project will help determine how consumer Internet behavior analysis can be used in marketing strategies based on event time, web page views, real-time social media feeds and other information constantly being tracked through agent software, social websites, and web traffic logs. To determine the usefulness of such business strategy in business decisions about media campaign ads, my project team at the University of North Carolina at Greensboro (Chi-Squared) collected twitter feed using a twitter developer account and simulated click stream information based on real-world content management metadata to create an association market basket, cluster models, and an eventual regression model. The goal was to demonstrate the use of internet data analytics (web analytics) using popular analytical method to predict how revenue streams can be determined.

The project was developed to determine the qualitative value of user online behavior and patterns to help business leaders make decisions about campaign ads and campaign management.

Click Stream data from websites and feeds, and the accessibility of more powerful analytical tools, has driven analytical methods such as forecasting and search engine optimization in retail markets. Popular and powerful Map Reduce databases such as Hadoop and MongoDB are opening up a world of possibilities in the area of web analytics. Web analytics is a subset of business analytics and a feature of data analysis acquired through web server logs, programs, service agents and interfaces mainly collected in real-time on potentially millions or billions of events. The massive array of internet traffic such as the number unique visitors on a website and social media feed from twitter that capture this data, has promoted further support of web semantics by the world wide web consortium and has created new services such as Google analytics and Amazon Web Services. “Big Data” web analytics is an area that will continue to create a wealth of opportunities for corporate decision-makers.

Content campaigning is a powerful tool in the hands of marketing professionals. Web and mobile content media and catalog metadata is crucial to provider revenue stream. Simply put, legal digital movie and music downloads represent the main revenue stream for retail portals. Predictive analytics allows internet companies to get insight into what customers are likely to purchase and also determine what content is likely to become more popular on social media on a particular day or period of time. Market basket and association analysis help to create campaigns for new content that unique visitors and customers are likely to be interested in and purchase (ideally). It is a very relevant topic in the growing business practice of understanding media content sales on the internet and how human and machine event tracking can play a role in generating revenue stream. Information from the web will drive future campaigns and revenue streams. This project serves as a way to demonstrate basic analytical methods for generating successful ad campaigns.

In the project it was observed that associations are best for determining market advertising on the websites. Click stream behavior is a very good way of developing implied rules of what type of content visitors would like to see on a particular website. It was observed that analysis of click stream data (navigational data) provides recommender systems for many online businesses. These recommender systems provide benefits for making business decisions which can be used to generate revenue for businesses. By analyzing twitter data it was observed that targeting specific segments of followers on twitter can increase the campaign success of music albums, songs or artists. This information provides valuable insights for generic ad campaigns. What was determine was the number of followers an account had is a better predictor of if and how often an artist is mentioned than the number of friends and listings, which was much more sporadic. Also, it was determine that creating segments of followership improved the regression model by reducing the potentially massive number of outliers and focusing on the majority of accounts rather than focus on just a few accounts with very large followers.

As the Big Data growing as the main data source today, one of the data type is defined attributes to the big data, which is the clickstream of the ad banner or other media files on the webpage. Philip Russom states in Big Data Analysis “One of the things that makes big data really big is that it’s coming from a greater variety of sources than ever before. Many of the newer ones are Web sources, including logs, clickstreams, and social media. User organizations have been collecting Web data for years. However, for most organizations, it’s been a kind of hoarding. We’ve seen similar untapped big data collected and hoarded, such as RFID data from supply chain applications, text data from call center applications, semi structured data from various business-to-business processes, and geospatial data in logistics. What’s changed is that far more users are now analyzing big data instead of merely hoarding it. The few organizations that have been analyzing this data now do so at a more complex and sophisticated level. Big data isn’t new, but the effective analytical leveraging of big data is 5.”

In her research on click stream analysis, Sule Gündüz explains about how a web page prediction model is based on click stream tree representation of user behavior. She demonstrates that predicting the next request of a user as she visits Web pages has gained importance as Web-based activity increases. Markov models and their variations, or models based on sequence mining have been found well suited for this problem. However, higher order Markov models are extremely complicated due to their large number of states whereas lower order Markov models do not capture the entire behavior of a user in a session. The models that are based on sequential pattern mining only consider the frequent sequences in the data set, making it difficult to predict the next request following a page that is not in the sequential pattern. Furthermore, it is hard to find models for mining two different kinds of information of a user session. She proposes a new model that considers both the order information of pages in a session and the time spent on them. She also clusters user sessions based on their pair-wise similarity and represent the resulting clusters by a click-stream tree. The new user session is then assigned to a cluster based on a similarity measure. The click-stream tree of that cluster is used to generate the recommendation set. The model can be used as part of a cache prefetching system as well as a recommendation model.

Satya Menon & Dilip Soman also mentions about the prediction model from the clickstream data in their article, “Managing the Power of Curiosity for Effective Web Advertising Strategies” that investigates the effect of curiosity on the effectiveness of Internet advertising. In particular, they identify processes that underlie curiosity resolution and study its impact on consumer motivation and learning. The dataset from our simulated Internet experiment includes process tracking variables (i.e., click stream data from ad-embedded links), traditional attitude and behavioral intention measures, and open-ended protocols. They find that a curiosity-generating advertising strategy increases interest and learning relative to a strategy that provides detailed product information. Furthermore, though curiosity does not dramatically increase the observed quantity of search in our study, it seems to improve the quality of search substantially (i.e., time spent and attention devoted to specific information), resulting in better and more focused memory and comprehension of new product information. To enhance the effectiveness of Internet advertising of new products, we recommend a curiosity advertising strategy based on four elements: (1) curiosity generation by highlighting a gap in extant knowledge, (2) the presence of a hint to guide elaboration for curiosity resolution, (3) sufficient time to try and resolve curiosity as well as the assurance of curiosity-resolving information, and (4) the use of measures of consumer elaboration and learning to gauge advertising effectiveness. As they mention about the curiosity, actually all the curiosity statistics comes from the click stream data from the ad-embedded links. Essentially, their assumption of the prediction model comes from the click stream data.

For winning the ad campaign, the user’s behavior is very important to tracking and building the prediction model. Based on Randolph Bucklin and Catarina Sismeiro, using the clickstream data record in Web server log files, is to develop and estimate a model of the browsing behavior of visitors to a Web site. Two basic aspects of browsing behavior are examined: the visitor’s decisions to continue browsing (by submitting an additional page request) or to exit the site and the length of time spent viewing each page. They propose a type II models that captures both aspects of browsing behavior and handles the limitations of server log-file data. In their article, they notify that the part of log-file data coming from the click stream, so that they also use the click stream as the data for their prediction model.

Overall, many of other researchers use the click stream data building their prediction model, as same as us, we are using the click stream data from twitter feed to predict the ad campaign in the real-world.

Data-driven analysis has become important for decision makers. It helps improve productivity and operations. Most data driven companies, especially those that exist on the internet are in the position to optimize their product revenue through their main channel or portal on the internet. With standardizations on internet protocols and web services, it’s now much more relevant to use search engines, page views and click streams to gauge possible new revenue streams. In the competitive work of data analytics, the internet is presently ground zero. Standard techniques such as customer-driven innovation (CDI) have create such search engine optimization and social media analytics. Anywhere where there is a prevalence of information generated by humans or machines, innovations are spawned to capitalize on this. What’s even more encouraging is that data is becoming much more accessible and open. With Google Analytics anyone can have access to click stream data from their own website for free. Twitter and other social media data is available for any developer to capture and analyze. Cloud services such as Microsoft Azure, Google Cloud Services and Amazon Web Services allow developers to build data analytics free of charge depending on the level of service, type of applications accessed and amount of data necessary to analyze. Websites such as http://data.gov provide tens of thousands of datasets free of charge. Public policy considerations have provided new opportunities for science, government and its citizens. It has literally become a data driven world for which anyone has access to the world’s data. It is no longer a question of if web and data analytics is necessary to be competitive. According to William McComb, CEO of Fifth & Pacific, a upscale brand company, regarding branding and ads. “…the girl…that we want to target is even more digitally obsessed and lives her life on mobile devices. We believe that we’re at a point, the economy is at a point, and the consumer has evolved to a point where she doesn’t need to have a physical store.” It’s this economy that is driven by data analysis.

The data source for these models consist of live twitter feed data collected from a twitter social media account JSON file. This data was parsed for popular content (specifically for artist name) and record based on the number of occurrences, time and the actual twitter text that contains the target value.

To marketer’s twitter is significant in terms of determining the popularity of a brand based on tweets, also known as brand tweets and the amount of follower engagement measured by the number of followers per person. To many companies, this can translate to clicks on their twitter bit.ly link or more followers added and is part of engagement breakdown. Engagement breakdown measures the number of replies, tweets, re-tweets, mentions and favorites. The click stream data in this project is engineered using seeded catalog metadata specific to music media and simulated web logs that are used by companies to collect user page view behavior on websites. There are four main ways of capturing click stream data: Apache Web logs, web beacons, JavaScript tags and packet sniffing.

Nowadays, various companies have realized the potential of online ad campaigns in effectively increasing their customer base and bringing in more profits. Typically most of the companies have limited budgets set aside for their ad campaigns and naturally the budgets can only pay for a limited number of ads in a given time period. Thus the major challenge is to make this limited number of ads very effective by targeting them as much as possible to the right segment of people. Click stream data is a valuable resource in this direction. It has the potential to give companies valuable insights on knowing the right set of people to target their ads towards and thus get maximum profits out of their ad campaigns. In the same way, content providers can also use click stream data analysis to appropriately target the ads of their customers. Twitter feed data analysis also works in a similar vein. While click stream data analysis analyses data collected from behavioral habits of web users, twitter feed data analysis more specifically concentrates on analyzing data from users twitter feeds. The analysis we have done through click stream data and twitter feed analysis in this project gave us valuable insights that can be used by companies to more effectively target their ads.

Team members and authors: Derek Moore, Ashwini Wani, Junyi Hu, Ramya Nimmagadda.
Class: ISM 678 Business Analytics
Instructor: Dr. Hamid Nemati, Department of Information Systems and Operation Management, Bryan School of Business and Economics
University of North Carolina at Greensboro

Example of an Entry-Level Data Scientist Job Post

Posted on February 9, 2018 by rtpopendata

The following is from a posting from a Data Science Intern for Lenovo. A good example of what to expect with a typical Data Science position.

The Data Scientist Intern will be expected to perform data mining, statistical learning, predictive modeling, mathematical and simulation modeling, forecasting, data visualization in support of key strategic projects.This position is dedicated to performing best in class data management and business analytics. Key responsibilities include:

Participate in projects, tasks, and activities related to data integration, data cleaning, descriptive analyses, exploratory analyses, predictive modeling, data mining, text analytics, rapid prototyping, and data visualization
Collaborate with colleagues throughout the business to collect, store, access, and analyze data from a variety of sources.
Assist in developing static and interactive data visualizations
Develop predictive models and simulations using a variety of software and tools
Leverage data / big data to discover patterns and solve strategic analytic business problems using both structured and unstructured data sets across many environments
Develop analytic capabilities that drive better outcomes for both customers and the company, informing business decisions across a broad range of functions.

Position Requirements

The Data Scientist Intern should have the experience and skills needed to successfully execute the key position objectives. Requirements include:

Motivated self-starter with a desire to develop solutions for the data analytics space using cutting edge computing technology
Experience with analytic projects and programs
Strong organizational and communication skills, the ability to work in a collaborative environment, and a desire to improve skills are essential
Ability to extract, merge and analyze data from a wide variety of sources (e.g., relational databases, text and unstructured files, sensor data, image and video files).
Ability to quickly and easily learn new open source software
Experience with static and/or interactive data visualization methods such as Qlik or Tableau
Experience with SQL and NoSQL databases, including any of these: MySQL, PostgreSQL, SQLite, MongoDB, and Neo4j.
Experience with techniques and technologies for accessing and analyzing “big data” using Hadoop, Kafka, Spark, Cassandra, Splunk, etc.
Programming experience in Java, R, Python, Hive, Pig, etc
Experience with machine learning or AI tools such as Mahout, Weka, etc.

Data Science Project tools for 2022

Posted on January 30, 2018 by rtpopendata

Software that will be used

Azure IoT Hub

Python

R Studio

Azure ML Studio

SAS Enterprise Miner

Python SciKit-Learn for Machine Learning (ML)

Python SciPy Numpy for data analytics (DA)

Python Parallel Processing and Distributed Computing

Hardware that will be used

Raspberry Pie 3

Arduino and Sensors

Mathematical Modeling techniques

PCA

SVM

Cluster Analysis

Deep Learning/NN

Data Pipelines

Kubernetes

Scala/Spark

Azure Events Hub

Kafka

Comparing Statistical Control and Machine Learning Models in Evaluating Performance in IoT Systems

Posted on January 25, 2018 by rtpopendata

Excerpt of submission to the Southern Data Science Conference that will be held in Atlanta GA. This proposal is a first in a series of proposals in IoT and Analytics research that will posted on Data Flux. For more info go to https://www.southerndatascience.com/submission-guidline

SUMMARY

Internet-enabled devices and the Internet of Things (IoT) will continue to become a major component of networked computing systems. Such systems leverage “big data” processes that collect, clean, analyze and model large data streams. This project demonstrates techniques and strategies in maintaining baselines for system performance metrics for IoT. Statistics and probability are fundamental to statistical process control (SPC) and quality improvement in engineering systems. Machine learning (ML) can be used to find anomalies and patterns in the performance of IoT systems by using large datasets to learn and predict events. The purpose of this project is to compare and contrast these two strategies qualitatively and quantitatively while providing guidance for IoT system optimization and monitoring.

Statistical techniques such as normalization, hypothesis testing and error minimization; and ML strategies such as regression modeling, neural networking and classification are used. Business applications for this project include system sizing, system health checks, and baseline performance monitoring. IoT systems must meet business, as well as, technical requirements to perform in the real world. This project performs analysis on a series of metrics across multiple layers in an IoT architecture. The Open Systems Interconnection model (OSI model) of IoT will serve as a dichotomy. Quantitative and qualitative analysis of results will allow businesses to determine scale, performance, accessibility and availability of these networked systems.

PROBLEM AND MOTIVATION

Rapid advancements in IoT and “big data” analytics has created opportunities in performance measurement of multi-tiered architectures. These types of architectures utilize a variety of platforms including physical, virtual, and cloud for a complete end-to-end business solution. As the market to industrialize IoT platforms continues to expand, information technology (IT) systems will play a crucial role in collecting, aggregating, and analyzing data from these new endpoints. IT and business will need to become more aligned in corporate practices and strategies with IoT. IT managers, in turn, will need to rely on analytics-based system performance models that demonstrate system capabilities in order to satisfy service-level and reliability requirements.

Information systems log and monitor all aspects of utilization, throughput, resource management and user access. Evaluating and modeling performance will require benchmarks for IoT components such as internet-connected physical endpoints, cloud based services, aggregation systems, networks and collection systems. This latest effort is to compare SPC and ML models that extend beyond basic performance metrics for utilization, throughput and resource management to areas such as anomaly detection, process control, and forecasting.

APPROACH AND UNIQUENESS

This project collected IT system performance data including network monitoring tools, database monitoring tools, web logs, file system logs, and data from sensors and Internet-enabled devices from large multi-tiered systems to demo IoT systems. The project approach includes:

Build ML and SPC models using IoT system performance data in Azure ML Studio, SAS Enterprise Miner and Python Scikit-Learn.
Train and score ML and SPC models.
Build IoT prototype system using Raspberry Pi, Microsoft Azure IoT Suite, and Python Distributed Parallel Processing Programming.
Train and score models for prototype performance data.

Big Data as the Next Major Utility: Musings on the Future of Autonomous Vehicles and CASE.

Posted on December 27, 2017 by rtpopendata

“Big Data” is everywhere. It powers business solutions as well as drives economic opportunity. Is it possible that “Big Data” will become the next major utility? By utility, I don’t mean its usefulness to businesses. Can data be a utility like electricity, gas or water which is distributed reliably through major cities for customer demand? With the Smart City initiatives, that certainly appears to becoming more and more a reality, but smart cities programs do not necessarily build the B2C model that major utilities do. Autonomous vehicles (AV) and Machine Learning (ML) may fill the gap that makes “Big Data” a utility. One possible business model includes customers who pay for how much data they use and the times they use it. Since AV technology will have data from internal and external sensors to evaluate road conditions and anomalies, the utility business model may come into play as a way to pay for such computation and classification. Machine learning algorithms will help create reinforcement of anomaly and object detection scenarios for AV.

Currently, cars on the market have Advanced Driver Assistance Systems (ADAS) development and includes driver assist technology such as accident avoidance sensors, drowsiness warnings, pedestrian detection, and lane departure warnings. Today’s driver-less cars are actually vehicles that are retrofitted with components that allow drivers to remove their hands from the steering wheel. To have fully autonomous vehicles, there must be a supply of historical and near real-time data to train ML models that will guide future AV. Like the generation of electrical power from a turbine, there has to be a supply and distribution approach to ML systems that is continuously providing reinforcement learning to AV. The generation of AV data must be ongoing every hour of the day for years in order to continuously train the ML models to build reliability in future AV algorithms and models.

The future on Autonomous Vehicles

CASE stands for Connected, Autonomous, Shared, Electrification (Vehicles). In many regards, its the evolution of modern transportation: A vehicle that doesn’t need a human operator, but transports people or goods to different destinations effectively, safely and efficiently with little or no impact on the environment. But not only will this vehicle be able to transport, but it will serve as a data collector and generator that could be used to determine road conditions, connect with businesses and establish business to customer or customer to business relationships.

The development of AV must be based on electrification (electric vehicles). Direct digital control and feedback systems of electrical consumption is ideal for clean and efficient generation of power. The autonomous capabilities of vehicles would not only control direction and speed but also the granularity of electrical consumption needed by the AV that would be imperceptible to a live human operator. Metrics could then be displayed to the passenger, owner or the manufacturer of the AV as feedback of its efficiency.

The main focus of the future generation of fully autonomous vehicles will be the ability keep a driver safe and successfully navigate any condition or obstacle as the AV transports its passengers to their destination…from leaving their home to getting into the vehicle, to walking into the destination. Services will be available to businesses that will allow AVs to follow exact directions where the business is and have approved parking spaces that the vehicle will navigate to. Most interfacing will be conducted through the passenger(s) smart phone(s).

Here is an example. David picks up his smart phone and clicks on an app to request reservations at a restaurant for his wedding anniversary. The service request is paired with an AV smart phone application that also sends the request to the cloud and the restaurant reservation API. The ML system in the cloud then programs the AV to navigate to the restaurant as well as park in a designated parking space (no valet needed). When the dinner is complete, David clicks on the app to pick up him and his wife and return home.

Future autonomous vehicles will not have manual overrides or speed up to make it to that movie on time.

In order for autonomous vehicles to build trust within the driving community, it must maintain consistent patterns and make decisions that ensure the safety and comfort of all its passengers. What you don’t want is the AV to immediately speed up to make a light or make sharp or quick turns to avoid oncoming traffic. This mean the automobile needs to have AI and machine learning capabilities that obeys all traffic laws and makes correct predictions on any anomaly or object. Future AV and CASE will not have steering wheels or brake pedals because that represents a manual override which in turn erodes trust with the occupants.

The future generations of AV should not have steering wheels. Most modern cars rely on a steering system that includes a “rack and pinion” assembly by which a live operator (driver) can turn the car right or left when needed. Removing the steering mechanism will allow for passenger only occupancy and create a system that is principally controlled by computerized systems instead of mechanisms that require human intervention. In the event that the vehicle requires override control by an operator, that operator will be in a vehicle control and command operations center (VOC). The center will be maned by trained commercial drivers. Such command operation centers would be third-party, provided by the manufacturer of the vehicle, or by a municipality.

Future autonomous vehicles will be fully connected mobile platforms.

Think of a smart phone and everything that it does. Now, imagine an autonomous vehicle as essentially a large smart phone that can transport passengers who are connected to what’s happening outside the car. These riders will expect to map the course to their destination through connected devices, data, cloud computing and sensors that will then be shared with businesses and users before, during and after they reach their destination. The applications for such connectivity are tremendous.

The impact of Big Data on autonomous vehicles.

As 5G wireless networks come online, smart cities and autonomous vehicles will fully utilize data to the cloud and back. 5G will facilitate unprecedented communication speed from the vehicle to the outside world allowing sensing and tracking of nearly 5,000 GBs of data per vehicle per day, making vehicles more efficient and safe. New computer processor architectures will test, train and build Machine Learning and Deep Learning models faster than in the past and help train AVs to become better equipped to conditions in cities and on highways.

Maintaining a competitive advantage has become a important business strategy.

One of the things I love about data science and data analytics is that most of the innovation done in this area has been shared in open data and open source communities. Internet sites like Kaggle, Amazon and Google have offered public data to anyone wanting to perform Machine Learning, Predictive Analysis and Deep Learning (see my review of DataSciCon.Tech). Open Source software and platforms has grown quickly as well.

This is not the case for vendors invested in the future of AV. The data collected from sensors and IoT devices in the vehicle as well as in big data cloud systems are a well guarded secret. Development SDKs for AV technology is accessible only to clients of these AV manufacturers and their partners. What this will mean to the future for AV innovation is still up for debate; However, companies certainly have the right to safeguard their proprietary research in this area. It’s not completely known what impact this strategy will have on long-term adoption of AV.

DataSciCon.Tech 2017 Review

Posted on December 4, 2017 by rtpopendata

Saturday, December 2nd , 2017

DataSciCon.Tech is a data science conference held in Atlanta, Georgia Wednesday November 29th to Friday, December 1st and includes both workshops and conference lectures. It took place at the Global Learning Center on the campus of Georgia Tech. This was the first year of this conference, and I attended to get a sense of the data science scene in Atlanta. Overall, the experience was very enlightening and introduced me to the dynamic and intensive work being conducted in the area of data science.

Keynote speaker Rob High, CTO of IBM Watson, discussing IBM Watson and Artificial Intelligence (DataSciCon.Tech 2017).

DataSciCon.Tech Workshops

Four workshop tracks were held Wednesday including Introduction to Machine Learning with Python and TensorFlow, Tableau Hands-on Workshop, Data Science for Discover, Innovation and Value Creation and Data Science with R Workshop. I elected to attend the Machine Learning with Python with TensorFlow track. TensorFlow is an open source software library for numerical computations using data flow graphs for Machine Learning.

To prepare for the conference, I installed the TensorFlow module downloaded from https://www.tensorflow.org/install. In addition to TensorFlow, I downloaded Anaconda (https://www.anaconda.com/), a great Python development environment for those practicing data science programming and includes many of the Python data science packages such as Numpy and SciKit-Learn.

Among the predictive and classification modeling techniques discussed in the workshop:

Neural Networks
Naive Bayes
Linear Regression
k -nearest neighbor (kNN) analysis

These modeling techniques are popular for classifying data and predictive analysis. Few training sessions on Python, SciKit-Learn or Numpy go into these algorithms in detail due to the various math educational levels of the audience members. For the course, we used Jupyter Notebook, a web-based python development environment which allows you to share and present your code and results using web services. Jupyter Notebook can also be hosted in Microsoft Azure, as well as, in other cloud platforms such as Anaconda Cloud and AWS. To host Python Jupyter Notebook in Azure sign into https://notebooks.azure.com.

TensorFlow

TensorFlow has a series of functions that uses neural networks and machine learning to test, train and score models. The advantage of TensorFlow is its ability to train models faster than other modules, which is a very big advantage since splitting data for training models is a process intensive operation. It is particularly powerful on the Graphics Processing Unit (GPU) architecture popular for Machine Learning and Deep Learning.

Download Tensorflow from http://tensorflow.org. The website also includes a Neural Network Tensorflow sandlot at http://playground.tensorflow.org.

source: http://playground.tensorflow.org. tensorflow.org (DataSciCon.Tech)

DataSciCon.Tech Sessions

I’m going to break down the sessions I attended into the main topics that were covered. So this is a very high level, one hundred foot point-of-view of the topics covered at the conference. My plan is to create a few more blogs on the topic that will go into my work as an aspiring data scientist/data architect. All the information in this blog is based on information presented at the DataSciCon.Tech 2017 conference.

Machine Learning and Artificial Intelligence

The conference emphasized Artificial Intelligence and Machine Learning pretty heavily. Artificial Intelligence was discussed more in theory and direct applications than design and development. There were a few demonstrations of the popular IBM Watson Artificial Intelligence system; but I want to focus this blog primarily on Machine Learning, as it’s something that interests me and other data architects. Artificial Intelligence and Machine Learning are both based on computerized learning algorithms. Machine Learning uses past data to learn, predict events or identify anomalies.

Another key fact presented at the conference is the number of open source projects and companies that have produced software modules, libraries and packages devoted to the use and implementation of Machine Learning in business applications. I strongly recommend anyone interested in learning more to research the software solutions discussed in this blog and how they can be implemented.

For those who are new to the concept of Machine Learning (like me), essentially it is defined as follows:

Continue reading →