Using Machine Learning and Data Science for Performance Tuning in Oracle

Like many DBAs, I have experienced the ups and downs of database performance tuning. Many battles have been fought to tune and maintain an Oracle database yielding mostly victories, but also a few hard-fought lessons learned.  Performance tuning is a constant in my job and I’ve learned a lot; but for every early morning phone call I’ve received, every severity one conference call that I’ve been on, and every emergency patch that I had to deployed; there is one mantra that has been etched into my mind: Database performance tuning is as much an art as it is a skill.  There is no easy one-size-fits-all solution to it all. It requires an understanding of various architectures outside the database itself as well as deep knowledge of Oracle internals.

Data Science and Machine Learning

The popularity of data science has opened new possibilities with database performance tuning. Along with building innovative products it has created excitement in areas such as the Internet of Things and cloud computing where very large volumes of data are mined for value.  For those who are new to the concept of Machine Learning, essentially it is defined as follows:

Machine Learning is a subset of Artificial Intelligence that focuses on creating models that learn and predict events based on past data without a human computer programmer having to change code to adapt to new events. An example would be a spam filter learning new exploits and then blocking those exploits.

Data scientists have a lot in common with database professionals, such as building extraction-transformation-loading routines, constructing business intelligence applications, and data wrangling. The difference is data science also provides more qualitative functionality and programming for data and business analytics. In many respects, the use of analytical and descriptive statistics has always been a tool for DBAs to manage performance by looking at averages and variances in performance over time. But what essentially has become an easy mathematical tool has expanded into more advanced analytics.

Many DBAs are familiar with the Oracle Diagnostic Pack (By the way, I’m covering Oracle versions 10g, 11g and 12c).  This feature contains the Automated Workflow Repository support which essentially stores system performance data for everything from active session history, to system, segment and o/s statistics. It also stores information on shared pool statistics and query plan execution statistics. It’s a massive repository used to build Automated Workflow reports and can be used to provide historical trending data. Typically, the default to store this data is 15 days with system performance data being sampled every hour. But the sample rate can be 15 minutes with historical data stored for entire months. Anyone who has done performance tuning with Oracle SQL has used v$, x$ or DBA_HIST views to troubleshooting or address performance issues in Oracle. The nice thing about the automated workflow repository is that you can write your own queries against it. By extension, you can also write PL/SQL procedures

to mine the data and build predictive and prescriptive statistical models. Below is an example of a query that builds a pivot table of AWR tables based on process and memory stats.

SELECT jn.snap_id, jn.stat_name, avg(value), avg(readtim), avg(writetim), avg(phyrds), avg(phywrts), avg(wait_count), avg(time), avg(pc.num_processes) from (

select dbid, snap_id, stat_name, value from dba_hist_snapshot natural join dba_hist_osstat

)

PIVOT (

              avg(value)

              for stat_name in ('BUSY_TIME','AVG_BUSY_TIME','IDLE_TIME','AVG_IDLE_TIME','NUM_CPU_CORES','NUM_CPUS','NUM_VCPUS','NUM_LCPUS','VM_OUT_BYTES','AVG_USER_TIME','AVG_SYS_TIME','OS_CPU_WAIT_TIME','IOWAIT_TIME','AVG_IOWAIT_TIME','PHYSICAL_MEMORY_BYTES')

) jn

JOIN

DBA_HIST_PROCESS_MEM_SUMMARY pc

ON (jn.dbid = pc.dbid and jn.snap_id = pc.snap_id)

 WHERE

pc.category in ('PL/SQL','SQL')

group by jn.snap_id, jn.stat_name

I first became interested in using statistical analysis in performance tuning after reading a paper entitled An Industrial Engineer’s Approach to Managing Oracle Databases by Robyn Sand. It described the use of statistical methods and engineering process control in DB performance tuning. With the advent of Big Data analytics, data science and machine learning, there are rich opportunities to gain meaningful insight into performance management in Oracle; but there also must be an abundance of caution.

Beyond descriptive statistics, the use of data science and machine learning would, in theory, allow me to “predict” potential detrimental performance in Oracle. Using historical data collected in the AWR, there are literally thousands of possible insights that can be gleaned about Oracle RDBMS performance.

Here were some of the successes, pitfalls and lessons I learned…

 Histograms, Standard Deviations, Distributions and other statistical methods

Oracle has a multitude of built-in analytical and grouping functions to support dimensional data structures and perform data mining techniques. These built-in aggregation features are great for visualizing large volumes of performance data on charts, dashboards and graphs. I estimate that for 98% of all performance tuning needs, these functions will work just fine.

The image above is a histogram of the number of concurrent processes by utilization. The mean appears to be around 300 concurrent processes which generate around a 9% average utilization in an oracle database. The second axis is the mean utilization of the frequency of 300 concurrent processes.

There are three main applications for which data science and machine learning can be applied to Oracle database management.

  1.  Capacity Planning and IT Asset Planning.
  2.  Performance Management
  3.  Business Process Analysis

I’d like to put some added emphasis on Business Process Analysis (3). It doesn’t do any good to present data analysis that hasn’t been qualitatively reviewed by stakeholders who know the business value of their IT assets; meaning, before making a decision on whether to purchase new hardware, or invest real dollars based on any type of quantitative analysis, the results must be presented in a way for the business stakeholders to make a sound final decision. As a database or IT professional, it is our job to present quantitative analysis for all stakeholders to make proper business decisions. We must also have anecdotal, near-real time and historical evidence to provide an unbiased objective analysis. Business decisions should never be made on data analysis alone.

 Unsupervised Learning and Clustering Analysis

There are two subdivisions of machine learning: supervised and unsupervised learning. Supervised learning requires guidance from a programmer by creating training, testing or validation sets of data to build analytical models and to assess and score those models to the proper predictive results from the input data. Unsupervised learning is built on inferences of the data itself; how specific data points and how variables relate to each other. Cluster analysis is an example of unsupervised learning technique. Clustering is popular in the analysis of demographic information (age, sex, height, race, location) and the segmenting customers who according to how likely they purchase a new type of product. If you are looking for a way cluster database properties or system configurations by specific inputs, clustering analysis can be a great tool. It can be used to build specific capacity planning algorithms base on user inputs.

Slide8

Clustering model of the number of processes per utilization on AWR data. The size of the circle represents the amount of sample data. The circles are clustered by various database configurations. The best database configuration cluster would be CLUSTER 6, which can support higher number of concurrent processes with lower amount of centralized utilization.

 Regression Analysis

Linear regression analysis is a “predictive” technique that is good for taking data points and finding the least amount of error between the data points and a fitted line that best represents that data. I find that the best regression models for Oracle performance data are linear regressions using the least squares method for error reduction. However, there are other regression methods such as logistic regression, which uses a logarithmic functions for error reduction.

1-22-2014 12-53-18 AM

A linear regression analysis to predict how many waits will occur based on disk reads. This type of analysis is based on least square fitting techniques to get a high R-square value of accuracy. Very good for quick data analysis of a ton of scattered data points.

 Engineering Process Control

Statistical process control is very much a hit and miss proposition, and I’ll explain why. It’s very tempting to take Oracle data and run it through a control chart, calculate upper and lower control limits and say, “if it goes above or below this line, the process is out of control”. But the first question a DBA should ask him/herself is “What exactly is the process that I want to control?” Statistical Process Control is about using statistics to determine upper and lower control limits of process oriented data. Engineering Process Control is about maintaining those processes so  tolerances are not exceeded. In an Oracle database, this is very hard to do, even with the best running system, because unless you have the same amount of data and the same number of users; and have control for all other variables, you’ll have a very low probability of getting meaningful performance management. Control charts are useful for manufacturing jet engines where all the same parts are on an assembly line must be within millimeters of each other in dimensional measures. To put it simply, Oracle like many other databases, has plenty of noise that’s not worth panicking over.

2014-11-13_16-38-27

An R-mean and X-mean control of SQL elapsed time in seconds. With databases, so many queries are executing, many DBAs ponder the question, “How can I determine when queries will execute a bad execution plan before the end user is impacted?” I found control charts to be problematic, because of the level of granularity in its control and high sensitivity of these types of charts.

I do, however, see the benefits of using statistical distributions for which control charts are based on. It is acceptable, with the correct amount of baseline data to create a binomial distribution that is devoid of outliers and determine when a particular query or set of queries has exceed a specific threshold. In this case, due to a binomial (standard) distribution of three standard deviations from the mean or 3. I find this helps if you have enough baseline data of good performance to form the proper distribution to compare to.

B8B51FB6

Neural Networks

Neural networks have been around for decades, but it hasn’t been until recently, that its application has been broadening in the world of data analytics and data science. Neural networks are a branch of deep learning that uses activation formula at multiple layers called Hidden Layers as weights. The more layers you have the more rigorous the computations can be. If you apply the proper inputs, connect layers and activation formulas to get the desired output, you can do things such as image processing, speech recognition, text search, object recognition, etc. It’s loosely based on how the brain learns using neurons to communicate with other neurons. Neural networks are tied closely into to artificial intelligence, which has been around for decades. What is different today is the processing power required to execute neural networks has improved dramatically, as with the availability to data, algorithms and software that do neural networks.

When it comes to data performance, I believe the verdict is still out there. Neural networks are already used in network intrusion and detection services and to monitor for DOS attacks.

I believe to predict anomalies in database performance that are bad, you really need to understand the processes and define what is bad performance. Simply unleashing neural networks on performance metrics without understanding the relationship between those inputs will undesired results.

 Conclusion

In conclusion, this has been a very high level discussion, so feel free to reach out to me and connect on LinkedIn to discuss my research. I’ve been studying this for around two years so far and I’m looking forward to writing more articles in 2019.

I believe the best way to go about using data science and machine learning in database performance tuning is the following:

  1.   Have at least five or more consecutive days of baseline performance data from which to train, test and validate your models. Whatever represents a week-in-the-life of a business.
  2. Talk with the business and understand their pain points so that you can collect the right metrics and right statistics.
  3.  Use a data analysis process that includes describing the data, building histograms, training the models, testing the models, and score different models
  4.  As an added step, use hypothesis testing, error checking or other statistical testing methods.
  5. Understand the business processes that you are monitoring so that you can select the correct metrics, variables and inputs from the database performance data and statistics.

As of this writing, I have been researching the use of Convolutional Neural Networks (CNN) popular with anomaly detection. Oracle database statistics have plenty of noise due to concurrent processes, a somewhat complex database engine and data constantly moving in and out of the SGA, PGA and buffer cache. I hope to have an update on my progress with deep learning and neural networks soon.

 

Campaign Management using Advanced Analytics

Campaign management is a strategy to use marketing campaigns to create sales and leads.  The Internet has been a treasure trove of consumer behavior for decades, it is only recently that web analytics has become a tool to create powerful business insight.  Two popular strategies deal specifically with pattern recognition.  Two pattern recognition strategies include customer segmentation (clustering) and market basket analysis (association).

This project will help determine how consumer Internet behavior analysis can be used in marketing strategies based on event time, web page views, real-time social media feeds and other information constantly being tracked through agent software, social websites, and web traffic logs. To determine the usefulness of such business strategy in business decisions about media campaign ads, my project team at the University of North Carolina at Greensboro (Chi-Squared)  collected twitter feed using a twitter developer account and simulated click stream information based on real-world content management metadata to create an association market basket, cluster models, and an eventual regression model.  The goal was to demonstrate the use of internet data analytics (web analytics) using popular analytical method to predict how revenue streams can be determined.  

The project was developed to determine the qualitative value of user online behavior and patterns to help business leaders make decisions about campaign ads and campaign management.

Click Stream data from websites and feeds, and the accessibility of more powerful analytical tools, has driven analytical methods such as forecasting and search engine optimization in retail markets.  Popular and powerful Map Reduce databases such as Hadoop and MongoDB are opening up a world of possibilities in the area of web analytics. Web analytics is a subset of business analytics and a feature of data analysis acquired through web server logs, programs, service agents and interfaces mainly collected in real-time on potentially millions or billions of events.  The massive array of internet traffic such as the number unique visitors on a website and social media feed from twitter that capture this data, has promoted further support of web semantics by the world wide web consortium and has created new services such as Google analytics and Amazon Web Services. “Big Data” web analytics is an area that will continue to create a wealth of opportunities for corporate decision-makers.

Content campaigning is a powerful tool in the hands of marketing professionals. Web and mobile content media and catalog metadata is crucial to provider revenue stream.  Simply put, legal digital movie and music downloads represent the main revenue stream for retail portals.  Predictive analytics allows internet companies to get insight into what customers are likely to purchase and also determine what content is likely to become more popular on social media on a particular day or period of time.   Market basket and association analysis help to create campaigns for new content that unique visitors and customers are likely to be interested in and purchase (ideally).   It is a very relevant topic in the growing business practice of understanding media content sales on the internet and how human and machine event tracking can play a role in generating revenue stream.   Information from the web will drive future campaigns and revenue streams.  This project serves as a way to demonstrate basic analytical methods for generating successful ad campaigns.

In the project it was observed that associations are best for determining market advertising on the websites.  Click stream behavior is a very good way of developing implied rules of what type of content visitors would like to see on a particular website. It was observed that analysis of click stream data (navigational data) provides recommender systems for many online businesses. These recommender systems provide benefits for making business decisions which can be used to generate revenue for businesses. By analyzing twitter data it was observed that targeting specific segments of followers on twitter can increase the campaign success of music albums, songs or artists. This information provides valuable insights for generic ad campaigns.  What was determine was the number of followers an account had is a better predictor of if and how often an artist is mentioned than the number of friends and listings, which was much more sporadic.  Also, it was determine that creating segments of followership improved the regression model by reducing the potentially massive number of outliers and focusing on the majority of accounts rather than focus on just a few accounts with very large followers.

As the Big Data growing as the main data source today, one of the data type is defined attributes to the big data, which is the clickstream of the ad banner or other media files on the webpage. Philip Russom states in Big Data Analysis  “One of the things that makes big data really big is that it’s coming from a greater variety of sources than ever before. Many of the newer ones are Web sources, including logs, clickstreams, and social media. User organizations have been collecting Web data for years. However, for most organizations, it’s been a kind of hoarding. We’ve seen similar untapped big data collected and hoarded, such as RFID data from supply chain applications, text data from call center applications, semi structured data from various business-to-business processes, and geospatial data in logistics. What’s changed is that far more users are now analyzing big data instead of merely hoarding it. The few organizations that have been analyzing this data now do so at a more complex and sophisticated level. Big data isn’t new, but the effective analytical leveraging of big data is 5.”

In her research on click stream analysis, Sule Gündüz explains about how a web page prediction model is based on click stream tree representation of user behavior. She demonstrates that predicting the next request of a user as she visits Web pages has gained importance as Web-based activity increases. Markov models and their variations, or models based on sequence mining have been found well suited for this problem. However, higher order Markov models are extremely complicated due to their large number of states whereas lower order Markov models do not capture the entire behavior of a user in a session. The models that are based on sequential pattern mining only consider the frequent sequences in the data set, making it difficult to predict the next request following a page that is not in the sequential pattern. Furthermore, it is hard to find models for mining two different kinds of information of a user session. She proposes a new model that considers both the order information of pages in a session and the time spent on them. She also clusters user sessions based on their pair-wise similarity and represent the resulting clusters by a click-stream tree. The new user session is then assigned to a cluster based on a similarity measure. The click-stream tree of that cluster is used to generate the recommendation set. The model can be used as part of a cache prefetching system as well as a recommendation model.

Satya Menon & Dilip Soman also mentions about the prediction model from the clickstream data in their article, “Managing the Power of Curiosity for Effective Web Advertising Strategies” that investigates the effect of curiosity on the effectiveness of Internet advertising. In particular, they identify processes that underlie curiosity resolution and study its impact on consumer motivation and learning. The dataset from our simulated Internet experiment includes process tracking variables (i.e., click stream data from ad-embedded links), traditional attitude and behavioral intention measures, and open-ended protocols. They find that a curiosity-generating advertising strategy increases interest and learning relative to a strategy that provides detailed product information. Furthermore, though curiosity does not dramatically increase the observed quantity of search in our study, it seems to improve the quality of search substantially (i.e., time spent and attention devoted to specific information), resulting in better and more focused memory and comprehension of new product information. To enhance the effectiveness of Internet advertising of new products, we recommend a curiosity advertising strategy based on four elements: (1) curiosity generation by highlighting a gap in extant knowledge, (2) the presence of a hint to guide elaboration for curiosity resolution, (3) sufficient time to try and resolve curiosity as well as the assurance of curiosity-resolving information, and (4) the use of measures of consumer elaboration and learning to gauge advertising effectiveness. As they mention about the curiosity, actually all the curiosity statistics comes from the click stream data from the ad-embedded links. Essentially, their assumption of the prediction model comes from the click stream data.

For winning the ad campaign, the user’s behavior is very important to tracking and building the prediction model. Based on Randolph Bucklin and Catarina Sismeiro,  using the clickstream data record in Web server log files, is to develop and estimate a model of the browsing behavior of visitors to a Web site. Two basic aspects of browsing behavior are examined: the visitor’s decisions to continue browsing (by submitting an additional page request) or to exit the site and the length of time spent viewing each page. They propose a type II models that captures both aspects of browsing behavior and handles the limitations of server log-file data. In their article, they notify that the part of log-file data coming from the click stream, so that they also use the click stream as the data for their prediction model.

Overall, many of other researchers use the click stream data building their prediction model, as same as us, we are using the click stream data from twitter feed to predict the ad campaign in the real-world.

Data-driven analysis has become important for decision makers.  It helps improve productivity and operations.  Most data driven companies, especially those that exist on the internet are in the position to optimize their product revenue through their main channel or portal on the internet.  With standardizations on internet protocols and web services, it’s now much more relevant to use search engines, page views and click streams to gauge possible new revenue streams.  In the competitive work of data analytics, the internet is presently ground zero.   Standard techniques such as customer-driven innovation (CDI) have create such search engine optimization and social media analytics.  Anywhere where there is a prevalence of information generated by humans or machines, innovations are spawned to capitalize on this.  What’s even more encouraging is that data is becoming much more accessible and open.  With Google Analytics anyone can have access to click stream data from their own website for free.  Twitter and other social media data is available for any developer to capture and analyze.  Cloud services such as Microsoft Azure, Google Cloud Services and Amazon Web Services allow developers to build data analytics free of charge depending on the level of service, type of applications accessed and amount of data necessary to analyze.  Websites such as http://data.gov provide tens of thousands of datasets free of charge.  Public policy considerations have provided new opportunities for science, government and its citizens.  It has literally become a data driven world for which anyone has access to the world’s data.  It is no longer a question of if web and data analytics is necessary to be competitive.  According to William McComb, CEO of Fifth & Pacific, a upscale brand company, regarding branding and ads.  “…the girl…that we want to target is even more digitally obsessed and lives her life on mobile devices.  We believe that we’re at a point, the economy is at a point, and the consumer has evolved to a point where she doesn’t need to have a physical store.”  It’s this economy that is driven by data analysis.

The data source for these models consist of live twitter feed data collected from a twitter social media account JSON file.  This data was parsed for popular content (specifically for artist name) and record based on the number of occurrences, time and the actual twitter text that contains the target value.

To marketer’s twitter is significant in terms of determining the popularity of a brand based on tweets, also known as brand tweets and the amount of follower engagement measured by the number of followers per person.  To many companies, this can translate to clicks on their twitter bit.ly link or more followers added and is part of engagement breakdown.  Engagement breakdown measures the number of replies, tweets, re-tweets, mentions and favorites.  The click stream data in this project is engineered using seeded catalog metadata specific to music media and simulated web logs that are used by companies to collect user page view behavior on websites.  There are four main ways of capturing click stream data: Apache Web logs, web beacons, JavaScript tags and packet sniffing.

Nowadays, various companies have realized the potential of online ad campaigns in effectively increasing their customer base and bringing in more profits. Typically most of the companies have limited budgets set aside for their ad campaigns and naturally the budgets can only pay for a limited number of ads in a given time period. Thus the major challenge is to make this limited number of ads very effective by targeting them as much as possible to the right segment of people. Click stream data is a valuable resource in this direction. It has the potential to give companies valuable insights on knowing the right set of people to target their ads towards and thus get maximum profits out of their ad campaigns. In the same way, content providers can also use click stream data analysis to appropriately target the ads of their customers. Twitter feed data analysis also works in a similar vein. While click stream data analysis analyses data collected from behavioral habits of web users, twitter feed data analysis more specifically concentrates on analyzing data from users twitter feeds. The analysis we have done through click stream data and twitter feed analysis in this project gave us valuable insights that can be used by companies to more effectively target their ads.

Team members and authors:  Derek Moore,  Ashwini Wani, Junyi Hu, Ramya Nimmagadda.
Class:  ISM 678 Business Analytics
Instructor:  Dr. Hamid Nemati, Department of Information Systems and Operation Management, Bryan School of Business and Economics
University of North Carolina at Greensboro

Example of an Entry-Level Data Scientist Job Post

The following is from a posting from a Data Science Intern for Lenovo.  A good example of what to expect with a typical Data Science position.

The Data Scientist Intern will be expected to perform data mining, statistical learning, predictive modeling, mathematical and simulation modeling, forecasting, data visualization in support of key strategic projects.This position is dedicated to performing best in class data management and business analytics. Key responsibilities include:

  • Participate in projects, tasks, and activities related to data integration, data cleaning, descriptive analyses, exploratory analyses, predictive modeling, data mining, text analytics, rapid prototyping, and data visualization
  • Collaborate with colleagues throughout the business to collect, store, access, and analyze data from a variety of sources.
  • Assist in developing static and interactive data visualizations
  • Develop predictive models and simulations using a variety of software and tools
  • Leverage data / big data to discover patterns and solve strategic analytic business problems using both structured and unstructured data sets across many environments
  • Develop analytic capabilities that drive better outcomes for both customers and the company, informing business decisions across a broad range of functions.

Position Requirements

The Data Scientist Intern should have the experience and skills needed to successfully execute the key position objectives. Requirements include:

  • Motivated self-starter with a desire to develop solutions for the data analytics space using cutting edge computing technology
  • Experience with analytic projects and programs
  • Strong organizational and communication skills, the ability to work in a collaborative environment, and a desire to improve skills are essential
  • Ability to extract, merge and analyze data from a wide variety of sources (e.g., relational databases, text and unstructured files, sensor data, image and video files).
  • Ability to quickly and easily learn new open source software
  • Experience with static and/or interactive data visualization methods such as Qlik or Tableau
  • Experience with SQL and NoSQL databases, including any of these: MySQL, PostgreSQL, SQLite, MongoDB, and Neo4j.
  • Experience with techniques and technologies for accessing and analyzing “big data” using Hadoop, Kafka, Spark, Cassandra, Splunk, etc.
  • Programming experience in Java, R, Python, Hive, Pig, etc
  • Experience with machine learning or AI tools such as Mahout, Weka, etc.

Big Data as the Next Major Utility: Musings on the Future of Autonomous Vehicles and CASE.

“Big Data” is everywhere.  It powers business solutions as well as drives economic opportunity.  Is it possible that “Big Data” will become the next major utility?  By utility, I don’t mean its usefulness to businesses.  Can data be a utility like electricity, gas or water which is distributed reliably through major cities for customer demand?  With the Smart City initiatives, that certainly appears to becoming more and more a reality, but smart cities programs do not necessarily build the B2C model that major utilities do.  Autonomous vehicles (AV) and Machine Learning (ML) may fill the gap that makes “Big Data” a utility.  One possible business model includes customers who pay for how much data they use and the times they use it.  Since AV technology will have data from internal and external sensors to evaluate road conditions and anomalies, the utility business model may come into play as a way to pay for such computation and classification.   Machine learning algorithms will help create reinforcement of anomaly and object detection scenarios for AV.

Currently, cars on the market have Advanced Driver Assistance Systems (ADAS) development and includes driver assist technology such as accident avoidance sensors, drowsiness warnings, pedestrian detection, and lane departure warnings. Today’s driver-less cars are actually vehicles that are retrofitted with components that allow drivers to remove their hands from the steering wheel.  To have fully autonomous vehicles, there must be a supply of historical and near real-time data to train ML models that will guide future AV.  Like the generation of electrical power from a turbine, there has to be a supply and distribution approach to ML systems that is continuously providing reinforcement learning to AV.  The generation of AV data must be ongoing every hour of the day for years in order to continuously train the ML models to build reliability in future AV algorithms and models.

The future on Autonomous Vehicles

CASE stands for Connected, Autonomous, Shared, Electrification (Vehicles).  In many regards, its the evolution of modern transportation: A vehicle that doesn’t need a human operator, but transports people or goods to different destinations effectively, safely and efficiently with little or no impact on the environment.  But not only will this vehicle be able to transport, but it will serve as a data collector and generator that could be used to determine road conditions, connect with businesses and establish business to customer or customer to business relationships.

The development of AV must be based on electrification (electric vehicles).  Direct digital control and feedback systems of electrical consumption is ideal for clean and efficient generation of power.  The autonomous capabilities of vehicles would not only control direction and speed but also the granularity of electrical consumption needed by the AV that would be imperceptible to a live human operator.  Metrics could then be displayed to the passenger, owner or the manufacturer of the AV as feedback of its efficiency.

The main focus of the future generation of fully autonomous vehicles will be the ability keep a driver safe and successfully navigate any condition or obstacle as the AV transports its passengers to their destination…from leaving their home to getting into the vehicle, to walking into the destination.   Services will be available to businesses that will allow AVs to follow exact directions where the business is and have approved parking spaces that the vehicle will navigate to.  Most interfacing will be conducted through the passenger(s) smart phone(s).

Here is an example.  David picks up his smart phone and clicks on an app to request reservations at a restaurant for his wedding anniversary.  The service request is paired with an AV smart phone application that also sends the request to the cloud and the restaurant reservation API.  The ML system in the cloud then programs the AV to navigate to the restaurant as well as park in a designated parking space (no valet needed).  When the dinner is complete, David clicks on the app to pick up him and his wife and return home.

Future autonomous vehicles will not have manual overrides or speed up to make it to that movie on time.

In order for autonomous vehicles to build trust within the driving community, it must maintain consistent patterns and make decisions that ensure the safety and comfort of all its passengers.  What you don’t want is the AV to immediately speed up to make a light or make sharp or quick turns to avoid oncoming traffic.  This mean the automobile needs to have AI and machine learning capabilities that obeys all traffic laws and makes correct predictions on any anomaly or object.  Future AV and CASE will not have steering wheels or brake pedals because that represents a manual override which in turn erodes trust with the occupants.

The future generations of AV should not have steering wheels.  Most modern cars rely on a steering system that includes a “rack and pinion” assembly by which a live operator (driver) can turn the car right or left when needed.  Removing the steering mechanism will allow for passenger only occupancy and create a system that is principally controlled by computerized systems instead of mechanisms that require human intervention.  In the event that the vehicle requires override control by an operator, that operator will be in a vehicle control and command operations center (VOC).  The center will be maned by trained commercial drivers.  Such command operation centers would be third-party, provided by the manufacturer of the vehicle, or by a municipality.

Future autonomous vehicles will be fully connected mobile platforms.

Think of a smart phone and everything that it does.  Now, imagine an autonomous vehicle as essentially a large smart phone that can transport passengers who are connected to  what’s happening outside the car.  These riders will expect to map the course to their destination through connected devices, data, cloud computing and sensors that will then be shared with businesses and users before, during and after they reach their destination.  The applications for such connectivity are tremendous.

The impact of Big Data on autonomous vehicles.

As 5G wireless networks come online, smart cities and autonomous vehicles will fully utilize data to the cloud and back.  5G will facilitate unprecedented communication speed from the vehicle to the outside world allowing sensing and tracking of nearly 5,000 GBs of data per vehicle per day, making vehicles more efficient and safe.  New computer processor architectures will test, train and build Machine Learning and Deep Learning models faster than in the past and help train AVs to become better equipped to conditions in cities and on highways.

Maintaining a competitive advantage has become a important business strategy.

One of the things I love about data science and data analytics is that most of the innovation done in this area has been shared in open data and open source communities.  Internet sites like Kaggle, Amazon and Google have offered public data to anyone wanting to perform Machine Learning, Predictive Analysis and Deep Learning (see my review of DataSciCon.Tech).  Open Source software and platforms has grown quickly as well.

This is not the case for vendors invested in the future of AV.  The data collected from sensors and IoT devices in the vehicle as well as in big data cloud systems are a well guarded secret.  Development SDKs for AV technology is accessible only to clients of these AV manufacturers and their partners.  What this will mean to the future for AV innovation is still up for debate; However, companies certainly have the right to safeguard their proprietary research in this area.  It’s not completely known what impact this strategy will have on long-term adoption of AV.

 

DataSciCon.Tech 2017 Review

Saturday, December 2nd , 2017

DataSciCon.Tech is a data science conference held in Atlanta, Georgia Wednesday November 29th to Friday, December 1st and includes both workshops and conference lectures. It took place at the Global Learning Center on the campus of Georgia Tech.  This was the first year of this conference, and I attended to get a sense of the data science scene in Atlanta.  Overall, the experience was very enlightening and introduced me to the dynamic and intensive work being conducted in the area of data science.

IMG_1786

Keynote speaker Rob High, CTO of IBM Watson, discussing IBM Watson and Artificial Intelligence (DataSciCon.Tech 2017).

DataSciCon.Tech Workshops

Four workshop tracks were held Wednesday including Introduction to Machine Learning with Python and TensorFlow, Tableau Hands-on Workshop, Data Science for Discover, Innovation and Value Creation and Data Science with R Workshop.  I elected to attend the Machine Learning with Python with TensorFlow track.  TensorFlow is an open source software library for numerical computations using data flow graphs for Machine Learning.

To prepare for the conference, I installed the TensorFlow module downloaded from https://www.tensorflow.org/install.  In addition to TensorFlow, I downloaded Anaconda (https://www.anaconda.com/), a great Python development environment for those practicing data science programming and includes many of the Python data science packages such as Numpy and SciKit-Learn.

Among the predictive and classification modeling techniques discussed in the workshop:

  • Neural Networks
  • Naive Bayes
  • Linear Regression
  • k -nearest neighbor (kNN)  analysis

These modeling techniques are popular for classifying data and predictive analysis.    Few training sessions on Python, SciKit-Learn or Numpy go into these algorithms in detail due to the various math educational levels of the audience members.  For the course, we used Jupyter Notebook, a web-based python development environment which allows you to share and present your code and results using web services.  Jupyter Notebook can also be hosted in Microsoft Azure, as well as, in other cloud platforms such as Anaconda Cloud and AWS.  To host Python Jupyter Notebook in Azure sign into  https://notebooks.azure.com.

TensorFlow

TensorFlow has a series of functions that uses neural networks and machine learning to test, train and score models.  The advantage of TensorFlow is its ability to train models faster than other modules, which is a very big advantage since splitting data for training models is a process intensive operation. It is particularly powerful on the Graphics Processing Unit (GPU) architecture popular for Machine Learning and Deep Learning.

Download Tensorflow from http://tensorflow.org.  The website also includes a Neural Network Tensorflow sandlot at http://playground.tensorflow.org.

2017-11-29_16-11-32

source:  http://playground.tensorflow.org.  tensorflow.org (DataSciCon.Tech)

DataSciCon.Tech Sessions

I’m going to break down the sessions I attended into the main topics that were covered.  So this is a very high level, one hundred foot point-of-view of the topics covered at the conference.  My plan is to create a few more blogs on the topic that will go into my work as an aspiring data scientist/data architect.  All the information in this blog is based on information presented at the DataSciCon.Tech 2017 conference.

Machine Learning and Artificial Intelligence

The conference emphasized Artificial Intelligence and Machine Learning pretty heavily.  Artificial Intelligence was discussed more in theory and direct applications than design and development.  There were a few demonstrations of the popular IBM Watson Artificial Intelligence system; but I want to focus this blog primarily on Machine Learning, as it’s something that interests me and other data architects.  Artificial Intelligence and Machine Learning are both based on computerized learning algorithms.  Machine Learning uses past data to learn, predict events or identify anomalies.

Another key fact presented at the conference is the number of open source projects and companies that have produced software modules, libraries and packages devoted to the use and implementation of Machine Learning in business applications.  I strongly recommend anyone interested in learning more to research the software solutions discussed in this blog and how they can be implemented.

For those who are new to the concept of Machine Learning (like me), essentially it is defined as follows:

Machine Learning is a subset of Artificial Intelligence that focuses on creating models that learn and predict events based on past data without a human computer programmer having to change code to adapt to new events.  An example would be a spam filter learning new exploits and then blocking those exploits.

2017-12-02_8-17-02 Continue reading

What Companies Need to Know About Big Data and Social Computing in Information Technology Management

Internet statistics estimate that 500 million tweets are produced per day. That translates to millions of conversations about a vast array of topics.  “Big data” is a term that has become more prominent as social media sites such as Twitter, Facebook, Instagram, etc. continue to generate large data streams.  Consumers produce click stream data and complete transactions visiting corporate websites to make purchases, schedule appointments for services or typing reviews on Yelp, Amazon and Uber about an experience that they’ve had.  With a well-planned IS strategy, this data  can be analyzed to gain insight into their customers and make critical strategic decisions necessary to compete.  Here are a few things companies should know about “Big Data” and social media computing as a business strategy.

Understand that social media and social networking is more a concept than a platform.

One of the  biggest problems with companies adopting social media as part of their IT business strategy is that the concept of social media for many IT managers does not extend beyond Twitter and Facebook.  There are many platforms for which social media is beneficial to business.  Slack and Github build on crowd-sourcing by emulating project management, software development and agile methodologies; even though those platforms are not primarily used for social media.

As more engineering firms adopt open source solutions, agile and DevOps development companies are deciding to use code development repositories such as GitHub.  Microsoft has already adopted GitHub as part of its Visual Studio Team Foundation options for source control.  The power of GitHub is very evident as global communities of developers use it to make some of the most innovative software products in languages such as Python, Java, C#, Ruby, etc.  It’s has also become a viable social media platform for software engineers who frequently collaborate on sprints.  Companies are also turning to solutions such as Slack to build entire global teams of developers to collaborate of on projects and sprints.

Social media as an IT business strategy is about understanding its contextual design and how the user interacts with it.  Part of understanding the contextual design of social media includes identifying the actors (primary and secondary) for which the platform are based and how those users interact with it to build relationships and communities.

Context also extends to how a user interfaces with social media.  Take, for example, the device many currently have in their pockets.  Apply classifications of contextual scope to this device and determine all the ways users interact through a platform (tablet, smartphone, computer, etc).  

A method known as the 4-I’s framework¹ is a good model to understand the user interaction in the context of social media.  The method is typically utilized in classifying interactions with information systems as described above.  The 4-I’s include:

  • Inscriptive (inputs)
  • Informative (outputs)
  • Interactive (processing)
  • Isolated (stored data)

This framework is useful for looking at ways to interact as a user that can perform as well as the information exchanged within that platform.  Another method that is popular is the MVC model or Model-View-Controller model which is used in software analysis and engineering as an architectural platform for implementing user interfaces on computers through separation of layers of those systems.

Do not dismiss “Big Data” as a gimmick.

The term “Big Data” itself may seem oversold through marketing, but the production of large data sets is very real, very fast and very large – with new data set being produced every day through public and private portals.

Big data is described as data that has variety (video, text, images, unstructured and structured), volume (over a terabyte, scale of brand), velocity (constant production of data streams), and veracity (the data needs to be cleaned and managed) .

 Information has become more fluid and available to more people faster and easier. Although no company should drive business decisions by what happens on Twitter or Facebook (or on the Dow), the power of “Big Data” as a tool can help in  trending analysis, customer segmentation and insight into short to long term business decisions.  

With “Big Data” companies will be able to:

  • Respond more quickly to market by making faster decisions.
  • Make patterns more evident to make changes to processes and products.
  • Better realize innovations and products and services and bring those to market faster.
  • Build and manage new and current data streams.
  • Create a data analytics ecosystem.  Make analyzing and aggregating data a business process all employees to utilize.

For a “Big Data” strategy to be successful, companies must:

  • Create data lakes and systems where raw data can live prior to being transformed for the business intelligence and reporting.
  • Remove data silos where data exists but is only accessible to a few internal stakeholders.  
  • Create a data analytics ecosystem
  • Create hybrid cloud solutions and begin moving applications to the cloud.

Know what association and segmentation analysis are and how to use them to learn about your customers.

With data streams, most coming online every day, new analytical methods can be used to gain insight into what consumers need in products and services.  Two popular analytical methods include association analysis and segmentation analysis.  In my next blog, I will discuss how these methods give insights into customers to better predict how they shop and what campaign ads are more likely to be successful with consumers.

With the popularity of Map Reduce and Hadoop, the business world is seeing an increase in “Big Data” analytics based on click stream and social media data.  Large data sets which would have taken days to analyze can now be done in minutes.

Conclusion

As data has become more prominent within an organization, and the means of collecting because easier and more ubiquitous, new skills will be necessary in certain roles to take full advantage of this data to drive value.  The corporate culture will need to adhere more to a data culture, where there is a value quotient to it collecting, cleansing, aggregating and analyzing data sources and data repositories.  Business leaders must establish new models that take advantage of social media and big data assets.

Works Cited

  1. Pitt, Leyland; Berthon, Pierre; Robson, Karen.  Deciding When to Use Tablets for Business Applications.  MIS Quarterly Executive Volume 10 Number 3 September 2011.

IT Strategies and Data Analytics

In an extension to my first blog, I research quantitative analysis of enterprise IT functions to demonstrate how to create IT business value.  It has to be established that, with so much data being collected from IT systems, IT managers can use this type of pervasive data to their advantage.  Functionality such as maintaining health,  securing systems,  and properly sizing new systems all have an impact to IT budgets.

Data analytics promotes value in IT.  Strategies using data analytics aim to create incremental value that can build on itself.  One of the keys of strategic IT value is to adopt a holistic approach to technology value, ignoring gimmicks, gadgets and marketing and instead looking at innovation as a combination of people, information and technology.  This balanced business strategy involves taking ownership of IT assets. In order for businesses to understand the value of those assets, it is crucial for IT managers to communicate that value.  Data analysis is a part of that communication.  Although data analytics can provide great insight into business technology, it will not always be successful in that goal.  The mission of data analytics as an IT strategy is to experiment often and to not be fearful of failure.

IT strategy involves aligning overall business goals and technology investment.  The first priority is for IT resources, people and functions to be planned around the overall business organization goals.  In order for such alignment to take place, IT managers need to communicate their strategy in business terms.

In many companies, funding for strategic initiatives is allocated in stages so their potential value can be reassessed between those stages.  When executives introduce a new business plan to increase market share by 15 percent with a new technology, IT managers must also meet those goals by assessing the quality of the IT infrastructure.

Executives also must have confidence that the IT assets that they purchase are sound.  There must be mutual trust, visible business support, and IT staff who are part of the business problem-solving team.   All of these factors are needed to properly determine the business value of IT.

One of the principals of business technology innovation is to aim for joint ownership of technology initiatives.  The quality of the IT-business relationship is central to delivering quality IT solutions that scale and meet production requirements.  Imagine a scenario where IT wasn’t aware that a utility would bring 1,000,000 new meters online that read electrical data every hour within two years, but instead, only sized for the initial 5,000 meter deployment.  This type of scenario would directly result in an utility customer having to upgrade all of their hardware only a year after the full deployment.

Innovations have created new ways of automating analysis to give more visibility into IT infrastructure.  This data can be analyzed using trending and predictive analytics to determine how much growth is needed based on specific targets and parameters.

Ideally, business and IT strategies should complement and support each other.  In order to improve the IT “Value Proposition”, IT projects must stop being considered the responsibility of only IT.  The definition of value must be clearly designed and presented by IT, but there must be a greater understanding that business executives have to take leadership in making technology investments shape and align the business strategy.  IT strategy must always be closely linked with sound business strategy.

Not only should IT and business be aligned, they must also complement each other strongly in order to build the type of relationship essential to achieve business goals.  It is a mistake to consider technology projects solely the responsibility of IT or to make IT solely accountable.  Business and IT must be accountable to each other when implementing and executing IT projects.

When creating an IT Strategy that can align to business objectives, five themes should be addressed.  These include:

  • business improvement
  • business enabling
  • business opportunities
  • opportunity leverage
  • infrastructure.

Research has shown that companies that have a framework for making targeted investments in IT infrastructure will further their overall strategic development and direction.  When companies fail to make IT infrastructure investment strategic, struggle on how to justify or fund for it.  In order for IT expenditures to be justified, many companies have concentrated on determining the business value of specific IT project deliverables, because it allows projects that focus on specific business goals to be properly scoped to include IT expenditures.

How a company measures business performance can be an accumulation of metrics both on the business side and the IT side.  Undelivered IT investment remains a big problem for organizations.  Many CEOs and CIOs believe that their Return on Investment (ROI) expectations for IT investments have not been properly met.   Although IT measures can be qualitative, meaning that expertise and knowledge from IT managers and staff contribute to understanding current and future IT growth and capacity, there are also ways to measure value quantitatively to help in the decisions making.

Non-technical communication is critical to executives.  IT staff typically work across many organizational units and must be effective at translating technical requirements into business requirements and vice versa.  Communication has become mission critical in the IT business value proposition.  When deciding how to apply data analytics across the organizations, IT should work with business leaders by looking at the IT function areas that produce the most data for their organization.  These areas include:

  • business analysis
  • system analysis
  • data management
  • project management
  • architecture
  • application development
  • quality assurance and testing
  • infrastructure
  • application and system support
  • data center operations

IT strategies require full business integration.  When IT managers are proposing new strategies, an executive summary should be the most important part of the proposal, prototype, roadmap, technical architecture document, etc.

Along with IT system metrics, IT managers must also keep in mind business operational metrics which are metrics based more on labor and time.  IT managers need to factor both IT and operational metrics in reports to business stakeholders.  There are several ways of reporting IT strategies to the business. Key Performance Indicators (KPIs) are fundamental to business decisions and are used to correlate business performance such as the how often a transaction results in a customer satisfaction.  KPIs examples include:

  • Efficiency rates.
  • Customer satisfaction scores
  • Capacity rates
  • Incident reporting rate
  • Total penalties paid per incident

Balanced Scorecards are strategic initiatives that align business strategy to corporate vision and goals.  It’s typically not the responsibility of IT managers to build scorecards, but rather understand the corporate balanced scorecards when building IT strategies.

Dashboards are visual representations of success, risk, status and failure of business operations.  In a very high paced organization, they allow information to be quickly disseminated and assessed by stakeholders for business decision making.  Dashboards tend to have more quantitative analysis than other types of reporting styles.

IT Governance

In the area of governance, the International Standards Organization (ISO) certification 27002 addresses monitoring and information security incidents.  Many of the methods used in the collection of data about system health can complement the adherence to information system security. Monitors log user access and security events such as unauthorized access to information systems.  Keeping security audit logs synchronized with specific system activity logs can indicate coordinated attacks on the system or denial of service (DOS) attacks that are popular for web applications and application service provides.  Using data analytics can help determine if deviations in system performance are related to security events such as unauthorized access, security threats such as malware, or other security issues; or if there is an issue with a functional issue within the system itself.  The boundaries between security and system health are consistently breached with networking, services and databases where the integrity and size of user traffic can be impacted.  Any unauthorized access can impact the availability and integrity of an information systems.

DevOps and Agile Software Development

DevOps is a corporate culture that emphasizes collaboration between developers (typically software developers) and operational business units.  DevOps provide tools and automation that can create a better customer experience by addressing issues and product changes faster.  Information systems can assist this functional area by providing analytical techniques about the readiness of release product code in the software development life cycle.

The principles of DevOps is to develop and test against production-like systems, deploy reliable processes, monitoring and validate operational quality and to improve the customer feedback loop to turn issues around faster.  Part of the power of data analysis is the ability to assist in agile, continuous delivery of software.  Automated testing and feedback with data analytical methods can provide the most qualitative information for business.  Providing data analysis on performance analysis, error logging and customer feedback as dashboards and visualizations can help make software development life cycle visible to all business stakeholders. As a rule of thumb business leaders are not interested in code or complex spreadsheets.  They are much more interested in quality scores, key performance indicators (KPIs) and business metrics.

IT Budgets

IT budgets are addressed in two categories: operational costs and strategic investments.  Operation are “keep the lights on” cost that involve running IT like a utility. Operation cost include maintenance, computing, storage, network and support, to name a few examples.  Strategic investments is a balance of initiative spending and coordination with organizational strategic objectives.  Strategic investment becomes more efficient from the corporate to department level.

IT budgets are also about reducing costs.  Many organizations have legacy systems that are not used efficiently and have requirements that create problems for strategic investments in new innovations.  Having an application portfolio is a good way of understanding the risks versus benefits of maintaining legacy systems.  Creating a data integration strategy as part of a data analysis ecosystem allows businesses to fully utilize all of their assets.  Most of these systems contain metadata that has long since been de-supported.  Part of the power of data analysis services such as online analytical processing (OLAP), business intelligence (BI) and master data management (MDM) is the ability to integrate with legacy systems.

Budgets are a key components of corporate performance management.  The most important thing to understand about IT budgets are that they assist in the establishment of strategic goals.  Systems provide data about the various level of utilization of resources.  An example question that a business client would pose to an IT manager would include:

What are the annual storage requirements of our Enterprise Billing System?

This question could be answered by tracking the amount storage consumed throughout the year based on the number of data sets stored in megabytes and looking at the interval of time that those data sets are stored.  From there an IT manager can translate that requirement in yearly terms, which in turn gives the budgeting team a metric of how much storage they need to purchase or maintain each year.

For large corporate firms in utilities, energy and manufacturing where literally, there could be hundreds of servers, there needs to be a more centralized structure for IT operations budgets.  The mandate given to IT managers in centralized IT Budget structures is to standardize and streamline multiple processes on hardware and software services.  The introduction of both private and public cloud architectures, and virtual architectures has made this possible.  Another question likely to be posed to IT managers:

Can our physical servers be migrated to a cloud or virtual infrastructure with higher performance and availability?

Having the right kind of analysis on current systems helps to ensure that dollars are spent appropriately when systems are consolidated or provisioned, and that they perform ideally according to business requirements.  IT managers are receiving pressure from executives to do more with less.  Data analysis has been a catalyst for innovation in cross delivery business development through the integration of systems and data.  Operational questions regarding IT include:

How much operational labor is expended providing IT services to an organization?

How much of the IT budget expended implementing changes to infrastructure?

Other budget concerns includes transitioning from a physical architecture to a cloud service based model.  Typically, with public cloud architecture, the resources are provisioned and managed by a hosting team.  Most cloud services will propose “elastic” solutions such as Amazon’s EC2 solution or Microsoft Azure which allows companies to use only what they need.  Therefore, the methodologies of sizing may not be as appropriate in such architecture.  However, in very data intensive industries where there are large scale architectures and multiple interaction of business and server processes, placing everything in a cloud domain is not only impractical, but very expensive and potentially illegal.  For example, in the utilities industry, state regulations may prohibit customer data from being off site.  An energy company’s proprietary information stored in an international data center that does not recognize the source country’s regulatory body could represent a public trust violation.

If migrating from a multi-tier architecture to a complete cloud-base services, it’s important to understand the type of cost involved.  Cloud based services typically have subscription model, where all the management, configuration and provisioning (unless self-provisioned) is handled by the hosting company.  There is a contract that specifies a level of service and support and that cost reflects how many resources the company is utilizing and the level of service for which to service its customers.  Payment terms can be yearly and quarterly, and there is usually a renewal date when payment is due [20].

The IT Values Proposition

IT value measures the worth and effectiveness of business technology solutions.  It is mostly a subjective assessment of how a business measures its assets when it pertains to business goals.  Value in information technology is typically defined in Return on Investment (ROI) and Key Performance Indicators (KPI) and other economic terms.   IT is most valuable when tied to business goals and objectives.  Adding value to IT also includes ensuring that IT assets are part of a data analytics ecosystem.  A data analytics ecosystem is where IT assets generate insight into how businesses produce, collect, store and learn from data and data analytics.  Data analytics is an important part of the IT value proposition, because of the tremendous treasure trove of knowledge and insight that can be gained from it.  A data analytics ecosystem helps to create processes to turn data into actionable business decisions.

Other best practices in IT value includes:

  • Evaluating the corporate business model in order to promote innovation.
  • Have strategic themes around data collection, dissemination and analysis.
  • Get the right people involved. This can include data scientist, engineers, business analysis, and many others.