Saturday, December 2nd , 2017
DataSciCon.Tech is a data science conference held in Atlanta, Georgia Wednesday November 29th to Friday, December 1st and includes both workshops and conference lectures. It took place at the Global Learning Center on the campus of Georgia Tech. This was the first year of this conference, and I attended to get a sense of the data science scene in Atlanta. Overall, the experience was very enlightening and introduced me to the dynamic and intensive work being conducted in the area of data science.
Keynote speaker Rob High, CTO of IBM Watson, discussing IBM Watson and Artificial Intelligence (DataSciCon.Tech 2017).
Four workshop tracks were held Wednesday including Introduction to Machine Learning with Python and TensorFlow, Tableau Hands-on Workshop, Data Science for Discover, Innovation and Value Creation and Data Science with R Workshop. I elected to attend the Machine Learning with Python with TensorFlow track. TensorFlow is an open source software library for numerical computations using data flow graphs for Machine Learning.
To prepare for the conference, I installed the TensorFlow module downloaded from https://www.tensorflow.org/install. In addition to TensorFlow, I downloaded Anaconda (https://www.anaconda.com/), a great Python development environment for those practicing data science programming and includes many of the Python data science packages such as Numpy and SciKit-Learn.
Among the predictive and classification modeling techniques discussed in the workshop:
- Neural Networks
- Naive Bayes
- Linear Regression
- k -nearest neighbor (kNN) analysis
These modeling techniques are popular for classifying data and predictive analysis. Few training sessions on Python, SciKit-Learn or Numpy go into these algorithms in detail due to the various math educational levels of the audience members. For the course, we used Jupyter Notebook, a web-based python development environment which allows you to share and present your code and results using web services. Jupyter Notebook can also be hosted in Microsoft Azure, as well as, in other cloud platforms such as Anaconda Cloud and AWS. To host Python Jupyter Notebook in Azure sign into https://notebooks.azure.com.
TensorFlow has a series of functions that uses neural networks and machine learning to test, train and score models. The advantage of TensorFlow is its ability to train models faster than other modules, which is a very big advantage since splitting data for training models is a process intensive operation. It is particularly powerful on the Graphics Processing Unit (GPU) architecture popular for Machine Learning and Deep Learning.
Download Tensorflow from http://tensorflow.org. The website also includes a Neural Network Tensorflow sandlot at http://playground.tensorflow.org.
source: http://playground.tensorflow.org. tensorflow.org (DataSciCon.Tech)
I’m going to break down the sessions I attended into the main topics that were covered. So this is a very high level, one hundred foot point-of-view of the topics covered at the conference. My plan is to create a few more blogs on the topic that will go into my work as an aspiring data scientist/data architect. All the information in this blog is based on information presented at the DataSciCon.Tech 2017 conference.
Machine Learning and Artificial Intelligence
The conference emphasized Artificial Intelligence and Machine Learning pretty heavily. Artificial Intelligence was discussed more in theory and direct applications than design and development. There were a few demonstrations of the popular IBM Watson Artificial Intelligence system; but I want to focus this blog primarily on Machine Learning, as it’s something that interests me and other data architects. Artificial Intelligence and Machine Learning are both based on computerized learning algorithms. Machine Learning uses past data to learn, predict events or identify anomalies.
Another key fact presented at the conference is the number of open source projects and companies that have produced software modules, libraries and packages devoted to the use and implementation of Machine Learning in business applications. I strongly recommend anyone interested in learning more to research the software solutions discussed in this blog and how they can be implemented.
For those who are new to the concept of Machine Learning (like me), essentially it is defined as follows:
Machine Learning is a subset of Artificial Intelligence that focuses on creating models that learn and predict events based on past data without a human computer programmer having to change code to adapt to new events. An example would be a spam filter learning new exploits and then blocking those exploits.
Machine Learning should be used when you cannot effectively code the solution and you cannot scale.
Building and using Machine Learning models is where data science part comes into play. You simply cannot build a model in python, R, or some other tool and then start feeding data to help make decisions and create insight; you have to do what data scientist do in order to make certain that the model is accurate and minimize. Every session included the following data scientist techniques. Although there were some variations in the session, in general every data science process included these steps.
- Doing exploratory data analysis.
- Filtering, imputing, and cleaning data.
- Creating testing, training and/or validation data sets.
- Training multiple models and comparing and scoring those models with real data.
- Applying model.
The type of Machine Learning techniques that were presented at the conference included:
- Neural Networks
- Clustering (unsupervised learning)
- Bayesian Networks
- Deep Learning
Neural Networks (NN), which uses hyperbolic mathematics to create prediction formulas, are transformations based on weights and bias estimates to come up with the best predictor. It sometimes performs better than logistic and linear regression because it can better fit data that doesn’t adhere to normal regression method that is less linear. The disadvantages of NN is that you can overfit data and that it can require a great deal of time to train and build. Neural Networks are made up of layers. These layers are divided in to input layers, hidden layers and output layers. Each layer has a number of modes that make up the predictor. Neural Networks are good for:
- Managing missing values
- Handling extreme or unusual values
- Non-numeric inputs
My favorite learning experience from the conference were the many software tools, modules, packages and libraries dedicated to Machine Learning algorithms and techniques. I will not attempt to explain all the tools presented at the conference, since many are very new to me, but I strongly recommend researching prior to building an ML architecture. These tools include:
If you’re interested in building ML solutions with public ML datasets, these sites are available:
- University of California Irvin (UCI) has a huge machine learning repository to practice techniques. This repository can be accessed at archive.ics.uci.edu/ml/index.php.
- Amazon Web Services datasets are available to the public. https://aws.amazon.com/datasets/.
- Kaggle is a data science competition website that rewards prizes to teams for the best ML models. Datasets are located at https://www.kaggle.com/datasets.
- University of Michigan Sentiment Data.
- The time series data repositories are located at https://fred.stlouisfed.org/categories.
Several cloud services also host ML services such as Amazon Machine Learning (https://aws.amazon.com/aml), Google ML (https://cloud.google.com/ml) and Microsoft Azure ML (https://studio.azureml.net/).
Anomaly Detection in IoT and Customer Transactions using Supervised Machine Learning
Analytics and the “Internet of Things” (IoT) has been and will continue be the main focus of my research and my work experience. Anomaly Detection is an application Machine Learning that has many uses in the IoT and sensor domain.
Another area of focus is customer transactions, particularly in credit card transactions. Fraud detection allows credit card companies to detect anomalous events in transactions in order to alert the company and its customer of suspicious activity.
In IoT, applications such as monitoring computer performance in data centers and network intrusion are popular for anomaly detection. Applying machine learning to an IoT architecture requires the following layers:
- Sensor data source layer where data is collected at the endpoint.
- Edge gateway is where the data is temporarily stored in a private, secured network through a series of authentication and certification protocols.
- Data Processing layer, where the Machine Learning processes train and fit data and where the ML model learns.
- Application and Analytics layer where business processes utilize these models.
From the session, the architecture was built using:
- MQTT which is a machine to machine IoT message connectivity protocol that is scalable.
- Apache NiFi, which is an open source data transformation software package, that has features such as building data flows, graphing, data buffering, prioritized queuing, push and pull models, visual command and control.
- There is also a version of Apache NiFi called minifi, which is built in C++ and can be run on constrained computer systems such Raspberry Pi.
- HDFS and HBase are Hadoop platforms.
- Kudos bridges the analytical gap between HDFS and HBase
- Spark and Impala are suited for complex ETL processing, machine learning and stream processing.
Strategies of Anomaly Detection error minimization include:
- Moving Z-Score
- Gaussian Mixture Models
- Exponential smoothing
- Auto-encoders and deep auto-encoder
- Concept Drift
Rahul Gupta, a data scientist from Capital One where the customer transaction ML application was built, explained that most of the initial trial development was done with a limited amount of data using Tableau. When the solution needed to scale for live real-time data, Capitol One wanted to test and implement an open source solution using Sparkling Water. Sparkling Water is a tool that allows developers to rapidly test and deploy machine learning and combines scalable ML algorithms. Sparkling Water combines the large, scalable ML algorithms and is part of product suite called H20 (www.H20.ai) to build ML solutions for large implementations using in-memory processing response.
Rahul Gupta of Capital One demonstrates the open source, cloud-based solution to detect anomalies with customer transactions (DataSciCon.Tech 2017).
For customer transactions anomaly detection, the H20 Gradient Boosting Machine (GBM) model accepts external explanatory variables.
- Number of accounts having payment due
- Change orders
- Payment due dates
GBM also enables data filtering exclusion (e.g., incident data for training set).
To learn more about the H20 GBM model go to http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html.
Graphics Processing Unit (GPU)
In order to push for high performing of ML model training, many developers are expanding beyond the limitations of CPUs. While memory and storage capacities have increased nearly exponentially, processor core and caching performance has not keep up with the demands of big data analytics and machine learning algorithm platforms. GPUs work with CPUs to accelerate machine learning and deep learning, increasing performance by 3 to 10 times compared to only CPU processing. Nvidia is the leading vendor in GPU technology.
Deep learning is a type Neural Networks ML methods with error minimization which is more specialized than RMSE, such as gradient descent functions. With Deep Learning you have more nodes in your model. Types of Deep Learning Neural Networks include Recurrent Neural Networks (RNN) and Artificial Neural Networks (ANN). To learn more about Recurrent Neural Networks go to http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
Long-Short Term Memory (LSTM) Networks are a special kind of RNN used for building hidden layers and is composed of four main components: a cell, an input gate, an output gate and a forget gate. LSTM is best for classifications that predict time series given time lags of unknown size and duration between events. A popular website to learn about deep learning for LSTM is http://deeplearning.net/tutorial/lstm.html.
Functional Programming for Machine Learning
The functional programming paradigm treats computation as the evaluation of a mathematical function which is stateless and helps prevents side effects from memory containers and iterative logic.
Properties of Functional Programming include:
- First class and high order functions
- Pure function
- Lazy evaluation which prevents objects being loaded into memory before they are used.
A popular functional programming language, Clojure (https://clojure.org/) uses expressions evaluated at run-time. Clojure is a dynamic, general-purpose programming language designed for robust infrastructures and multi-thread programming. Since it typically uses memory only when needed, it is very efficient for performing machine learning and deep learning.
Functional Programming examples of ML Programming include:
- Cortex (Library)
- Cortex is an open source machine learning toolkit that can execute algorithms on CPU and GPU. It is designed to implement as much of neural network as possible in pure.
- Highly transparent and highly customization.
Time Series Forecasting: Statistical and Machine Learning Models
Finally, I attended sessions about time series forecasting and time dependent data analysis.
Vector Autoreressive models and Recurrent Neural Network are used primarily for time series forecasting. Vector Autoregressive Models (VAR) are a type of model used for time series forecasting. One such example of VAR is the Autoregressive Moving Average model with Explanatory Variables (ARMAX). VAR models are capable of addressing the dynamic properties of data. However, there are limitations with this framework. For example, it’s not able to capture relationships that are bidirectional.
Timestamps are captured data at an exact moment (typically down to the milli- or microseconds) of when an event occurs. Timestamps are most effective at the edge on the endpoints, where the event directly occurs. The further away a timestamp is from the data sensor or data source (known as downstream), the more likely there will be overlap. Overlap and gaps are an issue with time dependent data. When gaps or overlaps occur, resampling or recollecting the data may be necessary for correction and accuracy. Techniques for resampling include taking median values over a window of time.
Traditional time series techniques assume stationary data (no trends/seasonality) and constant variance over time. Auto-Correlation captures time series relationships and gives us an idea of whether the relation is linear. It’s important to make sure that the model we deploy is appropriate for series. In the case of Vector Autoregressive Models (VAR) each of the series needs to be stationary. VAR is a stationary model.
When building ML models for time series data create train and testing sets that look at your forecasting horizon (Number of days, Number of months in the future, long range vs short range forecast). When determining the size of your training and testing sets, size the data to reflect the forecast horizon. For example, if your forecast horizon is four months, Have sequential data for the a period of four months. Do not split sequences.
The time series data repositories are located at https://fred.stlouisfed.org/categories.
So, first of all, I’d like to get as much feedback on this blog as possible. I’d love to hear from others on their experience with Machine Learning and what sort of solutions they may have come up with. I’m excited to learn more and become a ML contributor to the technology community and to my employer. If there is anything I got wrong in this blog, please comment and let me know. This entire blog was written in one evening based on notes I had taken while at the conference. If you’re interested in getting your feet wet in this area prior to building an ML prediction platform, I recommend downloading Scikit-Learn, TensorFlow and Anaconda and purchasing the book entitled Hands-On Machine Learning with SciKit-learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron. Being practicing the examples in the book on your computer after everything is installed. At the conclusion of the conference I plan to expand my knowledge in this domain and learn what Machine Learning applications that I can build. I plan to do more work in my spare time and will post my work on Data Flux.
Pingback: Big Data as the Next Major Utility: Musings on the Future of Autonomous Vehicles and CASE. – Data Flux
With every thing which seems to be developing within this particular subject matter, your points of view happen to be very stimulating. Even so, I am sorry, because I can not give credence to your whole strategy, all be it radical none the less. It appears to me that your remarks are actually not entirely validated and in actuality you are your self not really wholly convinced of the argument. In any case I did enjoy looking at it.
Thank-you for you feedback. Being very new to data science, I completely agree that my point of view is evolving and I’m hoping to understand more about the subject matter and that my opinions and objective scope will evolve. Thanks again for commenting!
Pingback: How to Transition from a Database Administrator Job to a Data Science or Data Engineering Job | Data Flux