Using R to Create Maps for GIS Shape File

If you are new to the R programming language, like I am, you may not realize that ESRI GIS Shape files, which are used to do map layering with Latitude and Longitude coordinates, can be plotted in R. You will need to load the following packages:

library(rgdal) Bindings for Geospatial Data Package

library(rgeos): Interface to Geometry Engine

library(maptools) Spatial Tools

library(ggplot2): Popular plotting package

Coding requires pointing the R code to the directory of the shape files and other dependencies. The following data is from a GIS documents (Shapefiles and dependencies) for geospatial layers from Antarctica.

file.exists('../GIS/gis_osm_natural_a_free_1.shp')
map <- readOGR(dsn="../GIS",layer="gis_osm_natural_a_free_1",verbose=FALSE)
map_wgs84 <- spTransform(map, CRS("+proj=longlat +datum=WGS84"))
#str(map_wgs84)
#summary(map_wgs84)
write.csv(map_wgs84, "../GIS/gis_osm_natural_a_free_2.csv", row.names=TRUE)
summary(map_wgs84)
plot(map_wgs84, axes=TRUE)

Machine Learning with Azure ML Studio

Directions on How to Build the Predictive Model In Microsoft Azure ML

  • Sign in to Microsoft Azure using your login credentials in the  Azure portal 
  • Create a workspace for you to store your work
    • In the upper-left corner of Azure portal, select + Create a resource.
    • Use the search bar to type Machine Learning.
    • Select Machine Learning.
    • In the Machine Learning pane, select Create to begin.
    • You will provide the following information below to configure your new workspace:
      • Subscription – Select the Azure subscription that you would like to use.
      • Resource group – Create a name for your resource group which will hold resources for your Azure solution.
      • Workspace name – Create a unique name that identifies your workspace.
      • Region – select the region closest to the users to reduce latency
      • Storage account – created by default
      • Key Vault – created by default
      • Application insights – created by default
    • When you have completed configuring the workspace, select Review + Create.
    • Review the settings and make any additional changes or corrections. Lastly, select Create. When deployment of workspaces has completed you will see the message “Your deployment is Complete”. Please see the visual below as a reference. 
  • To Launch your workspace, click Go to resource
  • Next, Click the blue Launch Studio button which is under Manage your Machine Learning Lifecycle. Now you are ready to begin!!!!
  • Click on Experiments in the left panel
  • Click on NEW in the lower left corner 
  • Select Blank Experiment. The new experiment is created with a default name. You can change the name at the top of the page. 
  • Upload the data above into Ml studio
    • Drag the datasets on to the experiment canvas. (We uploaded preprocessed data
    • If you would like to see what the data looks like, click on the outpost port at the bottom on the dataset and select Visualize. Given this data we are going to try and predict if there the IoT sensors have communication errors. 
  • Next, prepare the data
    • Remove unnecessary columns /data
      • Type “Select Columns” in the Search box  and select Select Columns in the Dataset  module, then drag and drop it on the canvas. This allows you to exclude any columns that you do not want in the model. 
      • Connect Select Columns in Dataset to the Data on the canvas.
    • Choose and Apply a Learning Algorithm
      • Click on Data Transformation in the left column 
        • Next, click on the drop down Manipulation 
        • Drag the Select Edit the Metadata (use this to change the metadata that is associated with columns inside the dataset. This changes the metadata inside Azure Machine Learning that tells the downstream components how to use the selected columns.)
      • Split the data 
        • Then, click on the drop down Sample and Split.
        • Choose Split Data and add it to the canvas and connect it to Edit the Metadata.
        • Click on Split Data and find the Fraction of rows in the output dataset and set it to .80. You are splitting the data to train the model using 80% of the data and test the model using 20% of the data.
  • Then you train the data 
    • Choose the drop down under Machine Learning
    • Choose the drop down under Initialize Model
    • Choose the drop down under Anomaly Detection 
    • Click on PCA- Based Anomaly Detection and add this to the canvas and connect with the Split data.  
    • Choose the drop down under Machine Learning
    • Choose the drop down under Initialize Model
    • Choose the drop down under Anomaly Detection 
    • Click on One-Class Support Vector machine and add this to the canvas and connect with the Split data.  
    • Choose the drop down under Machine Learning
    • Then, choose the drop down under Train
    • Click on Tune Model Hyperparameters and add this to the canvas and connect with the Split Data.
    • Choose the drop down under Machine Learning
    • Then, choose the drop down under Train
    • Click on Train Anomaly Detection Model
  • Then score the model 
    • Choose the drop down under Machine Learning
    • Then, choose the drop down – Score
    • Click on Score Model
  • Normalize the data
    • Choose the drop down under Data Transformation
    • Then, choose the drop down under Scale and Reduce
    • Click on Normalize Data
  • Evaluate the model – this will compare the one-class SVM and PCA – based anomaly detectors.
    • Choose the drop down under Machine Learning
    • Then, choose the drop down under Evaluate
    • Click on Evaluate Model
  • Click Run at the bottom of the screen to run the experiment. Below is how the model should look. Please click on the link to use our experiment (Experiment Name: IOT Anomaly Detection) for further reference.  This link requires that you have a Azure ML account.  To access the gallery, click the following public link:  https://gallery.cortanaintelligence.com/Experiment/IOT-Anomaly-Detection

Derek MooreErica Davis, and Hank Galbraith, authors.

Anomaly and Intrusion Detection in IoT Networks with Enterprise Scale Endpoint Communication – Pt 2

Derek MooreErica Davis, and Hank Galbraith, authors.

Part two of a series of LinkedIn articles based on Cognitive Computing and Artificial Intelligence Applications

Background

Several high profile incidents of ransomware attacks have called attention to IoT networks security. An assessment of security vulnerabilities and penetration testing have become increasingly important to sufficient design. Most of this assessment and testing takes place at the software and hardware level. However, a more broad approach is vital to the security of IoT networks. The protocol and traffic analysis is of importance to structured dedicated IoT networks since communication and endpoints are tracked and managed. Understanding all the risks posed to these types of network allows for more complete risk management plan and strategy. Beside network challenges, there are challenges to scalability, operability, channels and also the information being transmitted and collected with such networks. In IoT networks, looking for vulnerabilities spans the network architecture, endpoint devices and services, where services include the hardware, software and processes that build an overall IoT architecture. Building a threat assessment or map, as part of an overall security plan, as well as, updating it on a schedule basis allows security professionals and stakeholders to manage for all possible threats to the architecture. Whenever possible, creating simulations of possible attack vectors, understanding the behavior of such attacks and then creating models will help build upon a overall security management plan.

Open ports, SQL injection flaws, unencrypted services, insecure network interfaces, buffer overflow risks, lack of firewall protocols, authorization settings, web interface insecurity are among some of the types of vulnerabilities in an IoT network and devices.

Where is the location of a impending attack? Is it occurring at the device, server or service? Is it occurring in the location where the data is stored or while the data is in transit? What type of attacks can be identified? Types of attacks include distributed denial of service, man-in-the-middle, ransomware, botnets, spoofing, account penetrations, etc.

Business Use Case

For this business use case research study, a fictional company was created. The company is a national farmland and agricultural cooperative that supplies food to local and state markets. Part of the company’s IT infrastructure is an IoT network that uses endpoint devices for monitoring and controlling temperature, humidity and moisture for the company’s large agricultural farmlands. This network has over 2000 IoT devices in operations on 800 acres. Any intrusion into the network by a rogue service or bad actor could have consequences in regards to delivering fresh produce with quality and on time. The network design in the simulation below is a concept of this agricultural network. Our team created a simulation network using Cisco Packet Tracer, a tool which allows users to create and simulate package traffic throughout a computerized network at multiple ISO levels.

Simulated data was generated for using the packet tracer simulator to track and build. In the simulation network below using multiple routers, switches, servers and IoT devices for packets such as TCP, UDP, RIPv4 and ICMP, for instance.

Network Simulation

Below is a simulation of packet routing throughout the IoT network.

Cisco Packet Tracer Simulation for IoT network.  Packet logging to test anomaly detection deep learning models.

Problem Statement

Our fictional company will be the basis of our team’s mock network for monitoring for intrusions and anomaly. Being a simulated IoT network, it contains only a few dozen IoT enabled sensors and devices such as sprinklers, temperature and water level sensors, and drains. Since our model will be designed for large scale IoT deployment, it will be trained on publicly available data, while the simulated data will serve as a way to score the accuracy of the model. The simulation has the ability to generate the type of threats that would create anomalies. It is important to distinguish between an attack and a known issue or event (see part one of this research for IoT communication issues). The company is aware of those miscommunications and has open work orders for them. The goal is for our model is to be able to detect an actual attack on the IP network by a bad actor. Although miscommunication is technically an anomaly, it is known by the IT staff and should not raise an alarm. Miscommunicating devices are fairly easy to detect, but to a machine learning or deep learning model, it can be a bit more tricky. Creating a security alarm for daily miscommunication issues that originate from the endpoints, would constitute a prevalence of false positives (FP) in a machine learning confusion matrix.

No alt text provided for this image

A running simulation

Project Significance and Implementation

In today’s age of modern technology and the internet, it is becoming increasingly more difficult to protect enterprise networks against malicious attacks. Not only are malicious actors becoming more advanced with the methodologies of their attacks, but also the number IoT devices that live and operate in a business environment is ever increasing. It needs to be a top priority for any business to create an IT business strategy that protects the company’s technical architecture systems and core intellectual property. When accessing all potential security weakness, you must decompose the network model and define trust zones within the IoT architecture.

This application was designed to use Microsoft Azure Machine Learning analyze and detect anomalies in large data sets collected from all devices on the business’ network. In an actual implementation of this application, there would be a constant data flow running through our predictive model to classify traffic as Normal, Incorrect Setup, Distributed Denial of Service (DDOS attack), Data Type Probing, Scan Attack, or Man in the Middle. Using a supervised learning method to iteratively train our model, the application would grow increasingly more cognitive, and accurate at identifying these network traffic patterns correctly. If this system were to be fully implemented, there would need to also be actions for each of these classification patterns. For instance, if the model detected a DDOS attack coming from a certain device, the application would automatically send shutdown commands to the device, thus isolating it from the network and containing the attack. When these actions occur, there would be logs taken, and notifications automatically sent to appropriate IT administrators and management teams, so that quick and effective action could be taken. Applications such as the one we have designed are already being used throughout the world by companies in all sectors. Crowdstrike for instance, is a cyber technology company that produces Information Security applications with machine learning capabilities. Cyber technology companies such as Crowdstrike have grown ever more popular over the past few years as the number of cyber attacks have increased. We have seen first hand how advanced these attacks can be with data breaches on the US Federal government, Equifax, Facebook, and more. The need for advanced information security applications is increasing daily, not just for large companies, but small- to mid-sized companies as well. While outsourcing information security is an easy choice for some companies, others may not have the budget to afford such technology. That is where our application gives an example of the low barrier to entry that can be attained when using machine learning applications, such as Microsoft Azure ML or IBM Watson. Products such as these create relatively easy interfaces for IT Security Administrators to take the action into their own hands, and design their own anomaly detection applications. In conclusion, our IOT Network Anomaly Detection Application is an example of how a company could design and implement it’s own advanced cyber security defense applications. This would better enable any company to protect it’s network devices, and intellectual property against the ever growing malicious attacks.

Methodology

For this project, our team acquired public data from Google, Kaggle and Amazon. For the IoT model, preprocessed data was selected for the anomaly detection model. Preprocessed data from the Google open data repository was collected to test and train the models. R Studio programming served as an initial data analysis and data analytics process to determine Receiver Operating Characters (ROC) and Area Under the Curve (AUC) and evaluate the sensitivity and specificity of the models for scoring the predictability of the response variables. In R, predictability was compared between with logistic regression, random forest, and gradient boosting models. In the preprocessed data, a predictor (normality) variable was used for training and testing purposes. After the initial data discovery stage, the data was processed by a machine learning model in Azure ML using support vector machine and principal component analysis pipelines for anomaly detection. The response variable has the following values:

  • Normal – 0
  • Wrong Setup – 1
  • DDOS – 2
  • Scan Attack – 4
  • Man in the Middle – 5

The preprocessed dataset for intrusion detection for network-based IoT devices includes ultrasonic sensors using Arduino microcontrollers and Node MCU, a low-cost open source IoT platform that can run on the ESP8266 Wi-Fi Module used to send data.

The following table represents data from the ethernet frame which is part of the TCP/IP packet that is transmitted from a source device to a destination device for network communication.  The following dataset is preprocessed according to the network intrusion detection based system.

The following table represents data from the ethernet frame which is part of the TCP/IP packet that is transmitted from a source device to a destination device for network communication. 

Source:  Google.com

Source: Google.com

In the next article, we’ll be exploring the R code and Azure ML trained anomaly detection models in greater depth.

Anomaly and Intrusion Detection in IoT Networks with Enterprise Scale Endpoint Communication

This is part one of a series of articles to be published on LinkedIn based on a classroom project for ISM 647: Cognitive Computing and Artificial Intelligence Applications taught by Dr. Hamid R. Nemati at the University of North Carolina at Greensboro Bryan School of Business and Economics.

The Internet of Things (IoT) continues to be one of the most innovative and exciting areas of technology in the last decade. IoT are a collection of devices that reside in the world that collect data from the environment around it or through mechanical, electrical, thermodynamic or hydrological processes. These environments could be the human body, geological areas, the atmosphere, etc. The networking of IoT devices has been more prevalent in the many industries for years including the gas, oil and utilities industry. As companies create demand for higher sample read rates of data from sensors, meters and other IoT devices and bad actors from foreign and domestic sources have become more prevalent and brazen, these networks have become vulnerable to security threats due to their increasing ubiquity and evolving role in industry. In addition to this, these networks are also prone to read rate fluctuations that can produce false positives for anomaly and intrusion detection systems when you have enterprise scale deployment of devices that are sending TCP/IP transmissions of data upstream to central office locations. This paper focuses on developing an application for anomaly detection using cognitive computing and artificial Intelligence as a way to get better anomaly and intrusion detection in enterprise scale IoT applications.

This project is to use the capabilities of automating machine learning to develop a cognitive application that addresses possible security threats in high volume IoT networks such as utilities, smart city, manufacturing networks. These are networks that have high communication read success rates with hundreds of thousands to millions of IoT sensors; however, they still may have issues such as:

  1. Noncommunication or missing/gap communication.
  2. Maintenance Work Orders
  3. Alarm Events (Tamper/Power outages)

In large scale IoT networks, such interruptions are normal to business operations. Certainly, noncommunication is typically experienced because devices fail, or get swapped out due to a legitimate work order. Weather events and people, can also cause issues with the endpoint device itself, as power outages can cause connected routers to fail, and tampering with a device, such as people trying to do a hardwire by-pass or removing a meter.

The scope of this project is to build machine learning models that address IP specific attacks on the IoT network such as DDoS from within and external to the networking infrastructure. These particular models should be intelligent enough to predict network attacks (true positive) versus communication issues (true negative). Network communication typical for such an IoT network include:

  1. Short range: Wi-Fi, Zigbee, Bluetooth, Z-ware, NFC.
  2. Long range: 2G, 3G, 4G, LTE, 5G.
  3. Protocols: IPv4/IPv6, SLIP, uIP, RLP, TCP/UDP.

Eventually, as such machine learning and deep learning models expand, these types of communications will also be monitored.

Scope of Project

This project will focus on complex IoT systems typical in multi-tier architectures within corporations. As part of the research into the analytical properties of IT systems, this project will focus primarily on the characteristics of operations that begin with the collection of data through transactions or data sensing, and end with storage in data warehouses, repositories, billing, auditing and other systems of record. Examples include:

  1. Building a simulator application in Cisco Packet Tracer for a mock IoT network.
  2. Creating a Machine Learning anomaly detection model in Azure.
  3. Generating and collecting simulated and actual TCP/IP network traffic data from open data repositories in order to train and score the team machine learning model.

Other characteristics of the IT systems that will be researched as part of this project, include systems that preform the following:

  1. Collect, store, aggregate and transport large data sets
  2. Require application integration, such as web services, remote API calls, etc.
  3. Are beyond a single stack solution.

Next: Business Use Cases and IoT security

Derek MooreErica Davis, and Hank Galbraith, authors.

Information Technology Strategies in Software Development Projects

The goal of software projects is to complete them on time and on budget. There can somewhat be an added challenge, when part of the project is also to create specifications for hardware and software acquisitions that are needed to deploy a solution.  In the myriad of possible configurations and architectures that advances in Information Technology have created, there can sometime be no clear path as to what should be budgeted, and if the budget is even adequate.

Licensing can a large obstacle to having a successful software project.  How third-party software can be used and deployed (for example: the number of connections or users allowed by license cost) is just one example of the difficulty in obtaining proper budget projections.  Add to this the complexity of the type of hardware needed and the specifications of that hardware.  Not only do companies have to spec hardware for their own internal development practices (if they do not have the resources beforehand), clients will also put into a statement-of-work (SOW), that the vendor create the correct specifications to deploy the software on site or in the cloud or both.

Information technology needs to become a project planning process rather than an line item on a project Gantt chart.  Creating a road map where IT can become a process that has a business component at every step, will create significant predictability in software projects with less risk.  Strategies such as IT Governance, Unified Modeling, hardware sizing, use-case analysis and economic cost analysis are essentially putting pen and paper together to plan all aspects of IT prior to planning any budget or acquiring any product.

If you’d like more information or to ask questions on how create IT Business Strategies, please feel free to contact me.

 

Designing and Building a Self-Driving Car – Part 4

Topics:  The many lessons of Pulse Width Modification or “Where have I seen this before?” Thoughts on Motor Speed and where to look for inspiration for a project.

This blog will be about the many lessons one learns throughout his or her life – and boy! have I’ve learned many!  In college I took electrical and electronics engineering courses, but one of the biggest lessons learned were not about resistors, capacitors, or Kirchhoff’s Voltage Law, or transistors; it was: ” You will forget what you learn quickly, if you don’t apply it in practice.”  This is the lesson that I intend to instill in my child throughout this project.

My “Parent’s Guide” tip is to create a project that will continuously reinforce the concepts that you will be using for the project.  The best way, in my opinion, to do this is to have a project that will last many months (or in years in my case).  Take a detour now and then to do a science fair or demonstration in from of the grandparents, science clubs and assortment of friends; in between boy scouts, or girl scouts, baseball, volleyball, tennis, swimming, or the myriad of activities children do throughout their lives, you need one project that is consistent and continuous and big for which they can take into college or whatever the next big step in their lives will be.

Learned about pulse modifications throughout my education.  First in an electronics lab course in 1993, and again in 1997 and 1999.  When I began working on the autonomous car project with my child, we began working on the PWM circuit for the electric motor that was to go into the car.  I forgot how it worked or what it would take to build this circuit.

The morale is working on a project such as this she help to continuously reinforce concepts of STEM throughout a child’s life and should not just be about getting it done!

We decided to create multiple prototypes (4), varying in difficulty, size, configuration and purpose.

Prototype 1:  Will be the an simplified Arduino and RaspberryPi prototype built for the sole purpose of controlling a DC Motor and a Servo with OpenCV and a RaspberryPi camera.  It’s a non-functional RC model.

IMG_2630[1]

Prototype 2:  Will be a workable RC Car that will do everything that Prototype 1 will do, but will have a Brushless motor with ESC and will move.

IMG_0849

Prototype 3:  Will be a 3D printed model designed in Bender and then 3D printed.  This prototype will better emulate an “actual car”.  It will have all the essentials such as a transmission, steering, suspension, differential, axle, etc.  It will not have all the other things a car has that have nothing to do with power, acceleration or steering.  I will be a working model with an actual brushless motor, computers and microcontrollers.  It will not have cameras, but will be operated by a dataset from actual road video.

2019-10-29_16-56-33

Figure 1:  Blender open source design software

Prototype 4:   An actual gasoline car rebuilt into an electrified self-driving car.  My hope is to rebuild an old Geo Metro.  But the chances are a little small at this point, because Geo Metros all over the country are being snapped up to convert to electric.  The advantages of rebuilding an electric car using a Geo Metro are:

  1. They’re cheap.  Many have bad engines which make them even cheaper.  You could find one from $300 to $1500.
  2. They’re lightweight.  Geo Metros are three cylinder and don’t have all the stuff that weights modern cars down.
  3. The shape is conducive to the space you would need for an electric car.  Since the car that I’m building will also contain computers and other mechanisms to make it self-driving, the hatchback is particularly attractive.

 

Chevrolet_Metro_hatchback_rear

 

Information Technology Management Strategies in Industry Series

Use cases for Information Technology in the energy and manufacturing industry continue to expand beyond typical financials, asset management and plant management applications. Now IT is being use more often in driving goals such as resourcefulness and efficiency in current business processes and products.

Starting next month until May, I will be releasing a series of blogs addressing IT Strategies in areas of business and Industry in manufacturing and energy.  The series will include:

  • Information Technology Management Strategies for Energy Management
  • Information Technology Management Strategies for Customer Service 
  • Information Technology Management Strategies for the Internet of Things and Smart Cities Initiatives 
  • Information Technology Management Strategies for Data Analysis in Manufacturing

The main objective of this series is to apply IT management knowledge to experiences in energy, manufacturing and customer service.  With the advent of new innovative solutions in areas such as data science and machine learning, there are more opportunities than ever to make these industries more resourceful, efficient and effective in its business processes and beyond.

I will post links to the blogs from this page for easy reference in the future.

Effectively using Associative Tables in Relational Database Design

 

When creating entities in an entity-relational (ER) diagram, there are times where multiple entity attributes (also known as fields in tables) have associations to could create redundancy in the entity instances of a relational database model.  An associative entity is an entity type that associates the instances of one or more entity types that contains attributes in the relationships between those entities.   Associative entities  helps prevent joining two or more entities directly with multiple relationships that would create duplication of instances.  In databases this can sometime result in Cartesian joins and can produce duplication in table rows; sometimes many times over in a result set.

In entities where the instances are not very unique, such as the days of the week, months in a year, and positions in a company; it’s not a good idea to create a direct association with attributes on that entity to another entity that has very unique instances.  For example, meetings in a company, full names, age and address of people in New York City, or people working in a company are examples of uniqueness.  Although these are a very basic example, you will soon discover that you could almost create a level of duplication that can exponentially grow the number of instances in the design of your database model.

Below is an use case for an associative entity.  The EMPLOYEE  and SKILLS tables are the entities that contain the employee information and types of skills and titles within the company, respectively.  In this example a company ranks it’s employee’s skills by titles such as Principal, Senior, III, II,  and I.  However, they also want to know which skills are managed by which employees.  Since we want to prevent as much duplication as possible, we create an associative table called ORGANIZATION that will have a primary identifier attribute (ORG_ID) and joins together the EMPLOYEE and SKILLS table. The MANAGED_SKILL_ID attribute would create a composition identifier that would allow an employee to have multiple instances of the skills for which he or she would manage without duplicating data in the EMPLOYEE or SKILLS entities.

2019-11-10_6-13-37

 

Top Data Analytics and Data Science Resources

Here are my favorite Data Science/Data Analytics Resources

 Curriculum

1) MIT Open Courseware 

A great MOOC (Massive Online Open Courses) to learn about the math and statistical fundamentals of Data Science, such as linear algebra, statistics, probability, etc. From a
great university. This will give you some of the fundamentals of data science. https://ocw.mit.edu

2) https://kaggle.com

A website that sets up data analytics and data science competitions. It also provides a lot of free data that you can play build your skills on.

3) https://Datasciencemasters.org

A open-source curriculum for learning Data Science. Including foundation in theory and technologies. You can download code and use it to build
project to improve your skills.

4) https://superdatascience.com/pages/machine-learning

I really like like this website, because it teaches you all the fundamentals of machine learning from A to Z.

5) https://Coursera.org

A MOOC (Massive Online Open Courses) that can teach you anything you need to know about Data Science and Machine Learning and Data analytics

6) https://Udemy.com

Another good MOOC

7) https://EdX.org

Another good MOOC

8) Standford Online (https://online.stanford.edu/)

A lot like MIT Open Courseware. Free and from a renowned university.

Programming

4) https://Anaconda.com

This is my favorite Data Science/Data Analytics platform. Python is very hot right now in regards to Data Science. Anaconda is the best platform to learn and program in Python.
Strongly recommend learning Python and/or R. It looks like you’ve learned SQL, which is the other popular language.

5) Sci-kit Learn (https://scikit-learn.org)

This is my favorite library for data science and deep learning. A lot of great features for classification and anomaly detect and other stuff.

6) https://Github.com

Github is a community source repository for Python, C++, C#, Java, and Javascript. You should create your own GitHub account and start being active on it. There are a lot of tutorials.

7) https://www.w3schools.com/

You can learn almost any coding language here. C++/Python

There are a lot of good books from Amazon to learn Python.

Datasets

My Favorite Publicly Available Datasets (also see https://rtpopendata.com/2019/02/03/my-favorite-publicly-available-datasets/)

I’ve been working with data for decades, searching for insights, converting it, managing it, and now performing data analytics. We have access to unbelievable treasure troves of public data to analyze. Many of the blogs I write are based on these datasets, as I don’t have access to large computing systems. Here is a list of my favorite publicly available datasets. Enjoy!

PJM Interconnection Data Dictionary for electrical grids, distribution and transmission. https://www.pjm.com/markets-and-operations/data-dictionary.aspx
University of California Irvin (UCI) has a huge machine learning repository to practice techniques. This repository can be accessed at archive.ics.uci.edu/ml/index.php
Amazon Web Services datasets are available to the public. https://aws.amazon.com/datasets/.
Kaggle is a data science competition website that rewards prizes to teams for the best ML models. Datasets are located at https://www.kaggle.com/datasets
University of Michigan Sentiment Data.
The time series data repositories are located at https://fred.stlouisfed.org/categories.
Canadian Institute of Cyber Security. https://www.unb.ca/cic/datasets/nsl.html.
Datasets for “The Elements of Statistical Learning”. https://web.stanford.edu/~hastie/ElemStatLearn/.
Government Open Data Portal. https://data.gov