Installing the U.S. Census API in R

The U.S. Census Bureau American Community Survey application program interface (API) is a valuable programming interface for quickly and efficiently adding census data to your R code for analysis. It’s an extremely efficient way of doing data analysis in R without having to pull, format and organize census tables into a CSV or spreadsheet.

In order to use the census API, you will need to apply for a census API key, which takes only a few minutes. You then will need to install the key using the function census_api_key.

census_api_key('<key>', install=TRUE)

Once the key is install, it doesn’t require to be installed again, unless it expires or you apply for new key. Going to the U.S. Census website and searching for the API is how to apply.

The get_acs R function gives you the ability to pull in multiple years and categories of survey information for demographic information including race, household income, marital status, employment, etc. It’s a power feature for analyzing survey and census data in tidy format.

ACS_2010 <- get_acs("state",  year=2010, variables="S1702_C02_001", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2011 <- get_acs("state", variables="S1702_C02_001", year=2011, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2012 <- get_acs("state", variables="S1702_C02_001", year=2012, output="tidy", geometry=TRUE) %>%
  select(-moe)
  
ACS_2013 <- get_acs("state", variables="S1702_C02_001", year=2013, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2014 <- get_acs("state", variables="S1702_C02_001", year=2014, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2015 <- get_acs("state", variables="S1702_C02_001", year=2015, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2016 <- get_acs("state", variables="S1702_C02_001", year=2016, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2017 <- get_acs("state", variables="S1702_C02_001", year=2017, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_S1702_C01_001_Family <- get_acs("county", variables="S1702_C01_001", output="tidy", geometry=TRUE) %>%
  select(-moe)
  
ACS_S1702_C01_013_Household_Work <-  get_acs("county", variables="S1702_C01_013", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_household_work <-  get_acs("county", variables="S1702_C01_013", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACE variable data can be enumerated in a list with with descriptive names for easier classification. All variables are indicated by a document id (for example S1702_c01_018, S1702_C01_019, etc.).


 ACS_Education <- get_acs("county", variables= c(nohighschool = "S1702_C01_018", 
                                                  highschool = "S1702_C01_019",
                                                 somecollege = "S1702_C01_020", 
                                                 collegeplus = "S1702_C01_021"),
                         output="tidy", geometry=TRUE) %>%
  select(-moe)

You can also choose to include margin of error (moe) or not. Visualizations can be in typical R GGplot library mode or included into maps. All ACS information includes IDs that categorizes survey data.

ACS_geo_2010 <- ACS_2010 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))


ACS_geo_2011 <- ACS_2011 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2012 <- ACS_2012 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2013 <- ACS_2013 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2014 <- ACS_2014 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2015 <- ACS_2015 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2016 <- ACS_2016 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))

ACS_geo_2017 <- ACS_2017 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))

Geographic Mapping of ACS Data

png(file="images/ACS_geo_2010.png")
tm_shape(ACS_geo_2010) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2010 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2011.png")
tm_shape(ACS_geo_2011) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2011 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2012.png")
tm_shape(ACS_geo_2012) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2012 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2013.png")
tm_shape(ACS_geo_2013) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2013 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2014.png")
tm_shape(ACS_geo_2014) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2014 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2015.png")
tm_shape(ACS_geo_2015) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2015 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2016.png")
tm_shape(ACS_geo_2016) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2016 Post-Recession", asp=1)
dev.off()

png(file="images/ACS_geo_2017.png")
tm_shape(ACS_geo_2017) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="% Poverty in U.S. Year: 2017 Post-Recession", asp=1)
dev.off()

Programming R for Human Genome Variation Data

Human Genome variation analysis is a popular biomedical and biological typical used for finding disease, developing treatments and discovering a wide array of human genetic variation that can study the impact of disease and medical treatments.

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

The following libraries needed to load and process VCF files.

install.packages("vcfR")
library(vcfR)
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)
library(Rsamtools)
library(pasillaBamSubset)

To load VCF files. Because of their size, VCF file are typically zipped

vcf_file <- read.vcfR('file1.vcf.gz',verbose=FALSE)

Techniques for processing VCF files include:

chrom <- create.chromR(name='Supercontig',vcf=vcf_file,seq=dna,ann=gff)

#Extract GenoTypes

vcf_file_1 <- extract_gt_tidy(vcf_file, format_fields=NULL, format_types=TRUE, dot_is_NA=TRUE, alleles=TRUE, allele.sep="/", gt_column_prepend="gt_", verbose=TRUE)
str(vcf_file_1)
 

system.time(write.csv(vcf_file_1,'out_combine.indel.vcf.gz.csv',row.names=FALSE,col.names=FALSE))

extract_info_tidy(vcf_file, info_fields=NULL, info_types=TRUE, info_sep=";")

 
#Extract the VCF Header Information

extract_info_tidy(vcf_snp, info_fields=NULL, info_types=TRUE, info_sep=";")

A great way to time how long it takes load genome data is to use system.time. For example:

system.time(write.csv(fdta_combine,'fdta_out3.csv',row.names=FALSE))

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large.

install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)

These are the files that align sequencing data with referencing genome data and be converted to CSV files.

fq.file <- file.path("D:/temp","fastq_file.fq.gz")
fdta1 <- readFastq(fq.file)
head(fdta1,100)
summary(fdta1)
str(fdta1)
fdta_sample_a <- fdta1[1:10,]
summary(fdta_sample_a)
system.time(write.csv(fdta_sample_a,'out_2.csv',row.names=FALSE))

BAM Files

BAM files contain the RAW genomic data an are typically very large. Along with a wide array of tools that can read BAM files, R has many functions that can process BAM data. BAM files also come with an index file that makes it easier to find information with the larger BAM files.

To load the BAM file libraries, you can install them directory into R or download the Bioconductor packages from https://www.bioconductor.org/.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")
(bf <- BamFile("D:/temp/raw1.bam"))
(bf <- BamFile("D:/temp/raw1.bam",yieldSize=1000))

seqinfo(bf)
(sl <- seqlengths(bf))
#quickBamFlagSummary(bf)  -- Realloc cound not re-allocate memory problem
(gr <- GRanges("chr4",IRanges(1, sl["chr4"])))
countBam(bf, param=ScanBamParam(which = gr))

reads <- scanBam(BamFile("D:/temp/raw2.bam", yieldSize=5))
class(reads)
names(reads[[1]])
reads[[1]]$pos # the aligned start position
reads[[1]]$rname # the chromosome
reads[[1]]$strand # the strand
reads[[1]]$qwidth # the width of the read
reads[[1]]$seq # the sequence of the read

gr <- GRanges("chr4",IRanges(500000, 700000))
reads <- scanBam(bf, param=ScanBamParam(what=c("pos","strand"), which=gr))

hist(reads[[1]]$pos)

readsByStrand <- split(reads[[1]]$pos, reads[[1]]$strand)
myHist <- function(x) table(cut(x, 50:70 * 10000 ))
tab <- sapply(readsByStrand, myHist)
barplot(t(tab))

(ga <- readGAlignments(bf)) # allocation of memory issue.  If the BAM file is too large.

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
#load library
library(Rsamtools)
#read in entire BAM file
bam <- scanBam("wgEncodeRikenCageHchCellPapAlnRep1.bam")
#names of the BAM fields
names(bam[[1]])
# [1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar"
# [9] "mrnm" "mpos" "isize" "seq" "qual"
#distribution of BAM flags
table(bam[[1]]$flag)
# 0 4 16
#1472261 775200 1652949
#function for collapsing the list of lists into a single list
#as per the Rsamtools vignette
.unlist <- function (x){
## do.call(c, …) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)){
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
#store names of BAM fields
bam_field <- names(bam[[1]])
#go through each BAM field and unlist
list <- lapply(bam_field, function(y) .unlist(lapply(bam, "[[", y)))
#store as data frame
bam_df <- do.call("DataFrame", list)
names(bam_df) <- bam_field
dim(bam_df)
#[1] 3900410 13
view raw read_bam.R hosted with ❤ by GitHub

Tools and Methods to Analyze Variant and Genotyping of Human Genome Data 12/1/2021

Recently I’ve been tasked with analyzing Biological and Genomic data. I’ve learned a lot about tools and libraries for R and Python. As part of this analysis, I’m helping scientists analyze genome variations and perform analysis of genotypes. The human genome has 23 chromosomes. That’s about 3 billion base pairs that contain around 30,000 genes. Every base has pair that can be coded with 2 bits. This equates to around 750 megabytes of data. The data that I have been analyzing is several terabytes of genome sequences for around eight humans. Because the data is so massive, there are several high-throughput tools available to perform genotyping and variation discovery that I will cover in a series of articles in the next few months.

One characteristic is that repetitive DNA sequences comprise approximately 50% of the human genome. Genome size 3,100 Mbps (mega-basepairs) per haploid genome. A base pair is two chemical bases bonded to one another forming the “rung” in the DNA. DNA strands look someone like ladder twisted around.

Variations

Variations include differences in the number of copies individuals have of a particular gene, deletions, translocations and inversions. One such variation is Single-nucleotide polymorphism (SNPs). It’s a type of copy number variation (CNV), Variations in DNA are actually a normal part of human genetics and can sometimes be a sign of the body adapting to various changes within the sequences or even adaptation for protecting and adapting.

SNP can be any nucleic acid substitution:

  1. Transition
    1. Interchange of the purine (Adenine/Guanine)
    2. Pyrimidine (Cytosine/Thymine) nucleic acids
  2. Transversion
    1. Interchange of purine and pyrimidine nucleic acid

Since variation discovery is very important in the biological sciences, many tools have been developed to assist in creating medicines and treatments for all type of mutations within cells.

The International HapMap Project was develop a describe of variation patterns in the human genome that finds variations that impact health, responses to drugs and an individual’s environment. Variations include small-scale and large-scale variations.

Copy number variation (CNV). With the number of copies of a particular gene varies from one individual to the next. Following the completion of he Human Genome Project, it became apparent that the genome experiences gains and losses of genetic material. The extent to which copy number variation contributes to human disease is not yet known. It has long been recognized that some cancers are associated with elevated copy numbers of particular genes. They are categorized as long repeats or short repeats.

Insertions and Deletions (InDel) are a type of CNV: Insertion-deletion mutations refer to insertion and/or deletion of nucleotides into genomic DNA and include events less that 1Kb in length.

Other Definitions

Length of the base pairs (bp). One bp corresponds to approximately 3.4 A (340 pm) of length along the strand, and to roughly 618 or 643 daltons for DNA and RNA respectively.

Kilobase (kb) is a unit of measurement in molecular biology equal to 1000 base pairs of DNA or RNA.

Data Analysis

Most of analysis is performed in R. Here are some of the analysis done using Genome libraries:

install.packages("vcfR")
library(vcfR)
library(Rsamtools)
library(pasillaBamSubset)
# prepare for transaction data
install.packages("fastqcr")
library("fastqcr")
library("ShortRead")
install.packages("microseq")
library(microseq)
library(vcfR)
library(GenomicAlignments)

library(pasillaBamSubset)
library(Rsamtools)
library(ggrepel)
install.packages("factoextra")
library(factoextra)

The above libraries are standard R libraries for analyzing Genomic data. Later in this document, I will discuss the multiple tools that produce the files necessary for these libraries.

The most advanced libraries can be downloaded from the BiocManager website.

if (!require("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install()

if (!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")

BiocManager::install("Rsamtools")
BiocManager::install("pasillaBamSubset")

Visualization

Function

Patient1A

Using R with Human Genome Variation Data

Reading VCF files into R.

The vcfR package was designed to work with data from VCF files. The vcfR package was designed to work on an individual chromosome. A VCF file structure is a standard file format for storing variations for genomic data and is used by organizations to map human genome variations. It used used for large scale variant mapping. One example is the International Genome Sample Resource (IGSR).

It contains headers

  1. CHROM
  2. POS
  3. ID
  4. REF
  5. ALT
  6. QUAL
  7. FILTER
  8. INFO
  9. FORMAT
  1. The name of the chromosome.
  2. The starting position of the variant indicated.
  3. Identifier
  4. Reference allele. An allele is one of two or more alternative forms of a gene that occur by mutation and found in the same area of a chromosome.
  5. Alternate allele
  6. Quality score out of 100.
  7. Pass/Fail. Did it pace quality filters.
  8. Information about the following columns.
  9. Format of the columns.

FASTQ file format

The FASTQ files contain entire genome sequencing and can be very large and represents the raw sequencing data.

BAM or CRAM file formats

These are the files that align sequencing data with referencing genome data.

Genomics Tools

Genome data is very large, and contains millions of base pairs for chromosomes. Although this data can be loaded into R, the complexity of looking at individual genes and chromosomes can be very daunting. One tool that makes reading genomic data more visual is Integrative Genomics Viewer or (IGV). IVG is a visualization tool that zooms in to the gene and chromosome level at the base length.

IGV efficiently pulls in BAM file indexes to locate genomic data.

Other tools for Genomics, structural biology and molecular biology is DNASTAR Lasergene Structural Biology Suite and Spartan.

Another tool for visualization is the Variant Effect Predictor (VEP), which determines the effect of variants on genes.

Other tools include SnpEff and SnpSuft. SnpEff provides genetic variant annotation and function effect prediction. It also annotates and predicts the effect of genetic variants on genetic variants on genes and protein.

SnpSift annotates genomic variants using database, filters, and annotated variants. Once you annotated your files using SnpEff, you can use SnpFift to help you filter large genomic datasets in order to find the most significant variants for your experiment. Microsoft Genomics: All SnpEff & SnpSift genomic database are kindly hosted by Microsoft Genomics and Azure

Microsoft Genomics service provides a cloud hosted solution that makes it easy to variant call your genomic samples. The service takes in genomic samples as two paired end read fastq (.fq.gz) files and produces .bam, .bai, or.vcf files, along with the associated log files.

The process uses a BWA / GATK data pipeline where Microsoft has improve the efficiency of both BWA and GATK producing results faster and with less overhead. There is also a secondary analysis .

GATK is the Genome Analysis Toolkit also used for variant discovery using a data pipeline which can be scaled in the Azure or Google cloud. GATK is a framework for Variant Discovery with high-throughput sequencing data.

Another tool is the CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

References

https://gatk.broadinstitute.org/hc/en-us

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC403693/

https://pcingola.github.io/SnpEff/

https://pcingola.github.io/SnpEff/features/

https://github.com/abyzovlab/CNVnator

https://ggplot2.tidyverse.org/reference/geom_histogram.html

https://www.microsoft.com/en-us/genomics/

Comparing Machine Learning Models in Determining Credit Worthiness for Bank Loans

The R language has a number of machine learning libraries to help determine for both supervised and unsupervised machine learning. This includes such ML techniques such as linear and logistic regression, decision trees, random forest, generalized boosted regression modeling among others. I strongly recommend learning how these models work and how they can be used to predictive analytics.

Part of the Machine Learning process includes the following:

  1. Sample: Create a sample set of data either through random sampling or top tier sampling.  Create a test, training and validation set of data.
  2. Explore: Use exploratory methods on the data.  This includes descriptive statistics, scatter plots, histograms, etc.
  3. Modify:  Clean, prepare, impute or filter data.  Perform cluster analysis, association and segmentation.
  4. Model:  Model the data using Logistic or Linear regression, Neural Networking, and Decision Trees.
  5. Assess:  Access the model by comparing it to other model types and again real data. Determine how close your model is to reality.  Test the data using hypothesis testing.

When creating machine learning models for any application, it is wise to following a process flow such as the following:

In the following example, we use machine learning to determine the credit worthiness of prospective borrowers for a bank loan.

The loan data consist of the following inputs

  1. Loan amount
  2. Interest rate
  3. Grade of credit
  4. Employment length of borrower
  5. Home ownership status
  6. Annual Income
  7. Age of borrower

The response variable or predictor to predict the default rate

  1. Loan status (0 or 1).

After loading the data into R, we partition the data for training or testing sets.

loan <- read.csv("loan.csv", stringsAsFactors = TRUE)

str(loan)

## Split the data into 70% training and 30% test datasets

library(rsample)
set.seed(634)

loan_split <- initial_split(loan, prop = 0.7)

loan_training <- training(loan_split)
loan_test <- testing(loan_split)

Create a over-sample training data based on ROSE library. This checks for over-sampling of the data.

str(loan_training)

table(loan_training$loan_status)

library(ROSE)

loan_training_both <- ovun.sample(loan_status ~ ., data = loan_training, method = "both", p = 0.5)$data

table(loan_training_both$loan_status)

Build a logistic regression model and a classification tree to predict loan default.

loan_logistic <- glm(loan_status ~ . , data = loan_training_both, family = "binomial")

library(rpart)

loan_ctree <- rpart(loan_status ~ . , data = loan_training_both, method = "class")

library(rpart.plot)

rpart.plot(loan_ctree, cex=1)

Build the ensemble models (random forest, gradient boosting) to predict loan default.

library(randomForest)

loan_rf <- randomForest(as.factor(loan_status) ~ ., data = loan_training_both, ntree = 200, importance=TRUE)

plot(loan_rf)

varImpPlot(loan_rf)

library(gbm)

Summarize gradient boosting model

loan_gbm <- gbm(loan_status ~ ., data = loan_training_both, n.trees = 200, distribution = "bernoulli")
summary(loan_gbm)

Use the ROC (receiver operating curve) and compute the AUC (area under the curve) to check the specificity and sensitivity of the models.

# Step 1. Predicting on test data

predicted_logistic <- loan_logistic %>% 
  predict(newdata = loan_test, type = "response")

predicted_ctree <- loan_ctree %>% 
  predict(newdata = loan_test, type = "prob")

predicted_rf <- loan_rf %>% 
  predict(newdata = loan_test, type = "prob")

predicted_gbm <- loan_gbm %>% 
  predict(newdata = loan_test, type = "response")

# Step 3. Create ROC and Compute AUC

library(cutpointr)

roc_logistic <- roc(loan_test, x= .fitted_logistic, class = loan_status, pos_class = 1 , neg_class = 0)
roc_ctree<- roc(loan_test, x= .fitted_ctree, class = loan_status, pos_class = 1 , neg_class = 0)
roc_rf<- roc(loan_test, x= .fitted_rf, class = loan_status, pos_class = 1 , neg_class = 0)
roc_gbm<- roc(loan_test, x= .fitted_gbm, class = loan_status, pos_class = 1 , neg_class = 0)

plot(roc_logistic) + 
  geom_line(data = roc_logistic, color = "red") + 
  geom_line(data = roc_ctree, color = "blue") + 
  geom_line(data = roc_rf, color = "green") + 
  geom_line(data = roc_gbm, color = "black")

auc(roc_logistic)
auc(roc_ctree)
auc(roc_rf)
auc(roc_gbm)

These help you compare and score which model works best for the type of data presented in the test set. When looking at the ROC chart, you can see that the gradient boost model has the best performance of all the model as it is closer to 1.00 than the other models. Classifiers that are closer to 1.00 for the top left where Sensitivity is 1.00 and Specificity is closer to 0.00 have the best performance.

Creating Twitter Sentiment Association Analysis using the Association Rules and Recommender System Methods

Contextual text mining methods extract information from documents, live data streams and social media.  In this project, thousands of tweets by users were extracted to generate  sentiment analysis scores.

Sentiment analysis is a common text classification tool that analyzes streams of text data in order to ascertain the sentiment (subject context) of the text, which is typically classified as positive, negative or neutral. 

In the R sentiment analysis engine, our team built, the sentiment score has a range of .-5 to 5. Numbers within this range determine the the change in sentiment. 

SentimentScore
Negative-5
Neutral0
Positive5

Sentiment Scores are determined by a text file of key words and scores called the AFINN lexicon.  It’s a popular with simple lexicon used in sentiment analysis.  

New versions of the file are released in source repositories and contains over 3,300+ words with scores associated with each word based on its level of positivity or negativity.

Twitter is an excellent example of sentiment analysis.

An example of exploration of the sentiment scores based on the retweets filtered on the keywords:

  1. Trump
  2. Biden
  3. Republican
  4. Democrat
  5. Election

The data was created using a sentiment engine built in R.  It is mostly based on the political climate in the United States leading up to and after the 2020 United States election.

Each bubble size represents the followers of user who’ve retweeted.  The bubble size gives a sense of the influence of those users (impact). The Y-axis is the sentiment score, the X-axis represents the retweet count of the bubble name.

“Impact” is a measure of how often a twitter user is retweeted by users with high follower counts.

Using the Apriori Algorithm, you can build a sentiment association analysis in R. See my article on Apriori Association Analysis in R.

Applying the Apriori algorithm. using the single format, we assigned our transactions as the sentiment score and We assigned items_id as retweeted_screen_name.  

This is the measure the association between highly retweeted accounts and their associations based on sentiment scores (negative, neutral, positive).  Support is the minimum support for an itemset.  Minimum support was set to 0.02.

The majority of the high retweeted accounts had highly confident associations based on sentiment values. We then focused on the highest confidence associates that provided lift above 1.  After removing redundancy, we were able to see the accounts where sentiment values are strongly associated between accounts.

 According to the scatter plot above, we see most of the rules overlap, but have very good lift due to strong associations, but also this is indicated by the limited number of transactions and redundancy in the rules.

The analysis showed a large number of redundancy, but this was mostly due to the near nominal level of sentiment values.  So having high lift, a larger minimum support and .removing redundancy find the most valuable rules.

Using R to Create Decision Tree Classification

R is a great language for creating decision tree classification for a wide array of applications. Decision trees are a tree-like model in machine learning commonly used in decision analysis. The technique is commonly used in creating strategies for reaching a particular goal based on multi-dimensional datasets.

Decision trees are commonly used for applications such as determining what type of consumer is at higher risk of defaulting on a loan than borrowers of lower risk. What sort of factors impacts whether a company can retain customers, and what type of students are more at risk at dropping out and require mediation based on school attendance, grades, family structure, etc.

Below are the typically libraries for building machine learning analysis are below including decision trees, linear and logistic regression

library(tidyverse)
library(dplyr)
library(broom)
library(yardstick)
library(DescTools)
library(margins)
library(cutpointr)
library(tidyverse)
library(caTools)
library(rsample)
library(ROSE)
library(rpart)
library(rpart.plot)
library(caret)
install.packages("rsample")
install.packages("caTools")
install.packages("ROSE")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("yardstick")
install.packages("DescTools")
install.packages("margins")
install.packages("cutpointr")

The following code block creates regression and decision tree analysis of custom churn predictions.

# Import the customer_churn.csv and explore it.
# Drop all observations with NAs (missing values)


customers <- read.csv('customer_churn.csv')
summary(customers)
str(customers)
customers$customerID <- NULL
customers$tenure <- NULL
sapply(customers, function(x) sum(is.na(x)))
customers <- customers[complete.cases(customers),]


#===================================================================



#  Build a logistic regression model to predict customer churn by using predictor variables (You determine which ones will be included).
# Calculate the Pseudo R2 for the logistic regression.


# Build a logistic regression model to predict customer churn by using predictor variables (You determine which ones will be included).

customers <- customers %>%
 mutate(Churn=if_else(Churn=="No", 0, 1))


str(customers)
         
regression1 <- glm(Churn ~ Contract + MonthlyCharges + TotalCharges + TechSupport + MultipleLines + InternetService, data=customers, family="binomial")


# Calculate the Pseudo R2 for the logistic regression.


regression1 %>%
  PseudoR2()

#  Split data into 70% train and 30% test datasets.
#  Train the same logistic regression on only "train" data.

#  Split data into 70% train and 30% test datasets.


set.seed(645)

customer_split <- initial_split(customers, prop=0.7)
train_customers <- training(customer_split)
test_customers <- testing(customer_split)

# Train the same logistic regression on only "train" data.

regression_train <- glm(Churn ~ Contract + MonthlyCharges + TotalCharges + TechSupport + MultipleLines + InternetService, data=train_customers, family="binomial")
#regression_test <- glm(Churn ~ Contract + MonthlyCharges + TotalCharges + tenure + TechSupport, data=test_customers)



#  For "test" data, make prediction using the logistic regression trained in Question 2.
#  With the cutoff of 0.5, predict a binary classification ("Yes" or "No") based on the cutoff value, 
#  Create a confusion matrix using the prediction result.

#. For "test" data, make prediction using the logistic regression trained in Question 2.

str(regression_train)
prediction <- regression_train %>%
  predict(newdata = test_customers, type = "response")
 
 


# With the cutoff of 0.5, predict a binary classification ("Yes" or "No") based on the cutoff value, 


#train_customers <- train_customers %>%
#  mutate(Churn=if_else(Churn=="0", "No", "Yes"))



The last code creates the decision tree.


train_cust_dtree <- rpart(Churn ~ ., data=train_customers, method = "class")
rpart.plot(train_cust_dtree, cex=0.8)

Check the sensitivity and specificity of the classification tree, we create a confusion matrix for ROC charts. ROC Charts are receiver operating characteristic curves that have the diagnostic ability of a binary classifier system as its threshold.

set.seed(1304)

train_cust_dtree_over <- ovun.sample(Churn ~., data=train_customers, method="over", p = 0.5)$data
train_cust_dtree_under <- ovun.sample(Churn ~., data=train_customers,  method="under", p=0.5)$data
train_cust_dtree_both <- ovun.sample(Churn ~., data=train_customers, method="both", p=0.5)$data

table(train_customers$Churn)
table(train_cust_dtree_over$Churn)
table(train_cust_dtree_under$Churn)
table(train_cust_dtree_both$Churn)

train_cust_dtree_over_A <- rpart(Churn ~ ., data = train_cust_dtree_over, method="class")
rpart.plot(train_cust_dtree_over_A, cex=0.8)

customers_dtree_prob <- train_cust_dtree_over_A %>%
  predict(newdata = test_customers, type = "prob")

#  Create a confusion matrix using the prediction result.


head(customers_dtree_prob)

table(customers_dtree_prob)

customers_dtree_class <- train_cust_dtree_over_A %>%
  predict(newdata = test_customers, type = "class")

table(customers_dtree_class)

test_customers <- test_customers %>%
    mutate(.fitted = customers_dtree_prob[, 2]) %>%
    mutate(predicted_class = customers_dtree_class)


confusionMatrix(as.factor(test_customers$predicted_class), as.factor(test_customers$Churn), positive = "1")
#===================================================================


#  Based on prediction results in Question 3, draw a ROC curve and calculate AUC.


roc <- roc(test_customers, x=.fitted, class=Churn, pos_class=1, neg_clas=0)

plot(roc)
auc(roc)

plot(roc) +
    geom_line(data = roc, color = "red") +
    geom_abline(slope = 1) +
    labs(title = "ROC Curve for Classification Tree")

Machine Learning with Azure ML Studio

Directions on How to Build the Predictive Model In Microsoft Azure ML

  • Sign in to Microsoft Azure using your login credentials in the  Azure portal 
  • Create a workspace for you to store your work
    • In the upper-left corner of Azure portal, select + Create a resource.
    • Use the search bar to type Machine Learning.
    • Select Machine Learning.
    • In the Machine Learning pane, select Create to begin.
    • You will provide the following information below to configure your new workspace:
      • Subscription – Select the Azure subscription that you would like to use.
      • Resource group – Create a name for your resource group which will hold resources for your Azure solution.
      • Workspace name – Create a unique name that identifies your workspace.
      • Region – select the region closest to the users to reduce latency
      • Storage account – created by default
      • Key Vault – created by default
      • Application insights – created by default
    • When you have completed configuring the workspace, select Review + Create.
    • Review the settings and make any additional changes or corrections. Lastly, select Create. When deployment of workspaces has completed you will see the message “Your deployment is Complete”. Please see the visual below as a reference. 
  • To Launch your workspace, click Go to resource
  • Next, Click the blue Launch Studio button which is under Manage your Machine Learning Lifecycle. Now you are ready to begin!!!!
  • Click on Experiments in the left panel
  • Click on NEW in the lower left corner 
  • Select Blank Experiment. The new experiment is created with a default name. You can change the name at the top of the page. 
  • Upload the data above into Ml studio
    • Drag the datasets on to the experiment canvas. (We uploaded preprocessed data
    • If you would like to see what the data looks like, click on the outpost port at the bottom on the dataset and select Visualize. Given this data we are going to try and predict if there the IoT sensors have communication errors. 
  • Next, prepare the data
    • Remove unnecessary columns /data
      • Type “Select Columns” in the Search box  and select Select Columns in the Dataset  module, then drag and drop it on the canvas. This allows you to exclude any columns that you do not want in the model. 
      • Connect Select Columns in Dataset to the Data on the canvas.
    • Choose and Apply a Learning Algorithm
      • Click on Data Transformation in the left column 
        • Next, click on the drop down Manipulation 
        • Drag the Select Edit the Metadata (use this to change the metadata that is associated with columns inside the dataset. This changes the metadata inside Azure Machine Learning that tells the downstream components how to use the selected columns.)
      • Split the data 
        • Then, click on the drop down Sample and Split.
        • Choose Split Data and add it to the canvas and connect it to Edit the Metadata.
        • Click on Split Data and find the Fraction of rows in the output dataset and set it to .80. You are splitting the data to train the model using 80% of the data and test the model using 20% of the data.
  • Then you train the data 
    • Choose the drop down under Machine Learning
    • Choose the drop down under Initialize Model
    • Choose the drop down under Anomaly Detection 
    • Click on PCA- Based Anomaly Detection and add this to the canvas and connect with the Split data.  
    • Choose the drop down under Machine Learning
    • Choose the drop down under Initialize Model
    • Choose the drop down under Anomaly Detection 
    • Click on One-Class Support Vector machine and add this to the canvas and connect with the Split data.  
    • Choose the drop down under Machine Learning
    • Then, choose the drop down under Train
    • Click on Tune Model Hyperparameters and add this to the canvas and connect with the Split Data.
    • Choose the drop down under Machine Learning
    • Then, choose the drop down under Train
    • Click on Train Anomaly Detection Model
  • Then score the model 
    • Choose the drop down under Machine Learning
    • Then, choose the drop down – Score
    • Click on Score Model
  • Normalize the data
    • Choose the drop down under Data Transformation
    • Then, choose the drop down under Scale and Reduce
    • Click on Normalize Data
  • Evaluate the model – this will compare the one-class SVM and PCA – based anomaly detectors.
    • Choose the drop down under Machine Learning
    • Then, choose the drop down under Evaluate
    • Click on Evaluate Model
  • Click Run at the bottom of the screen to run the experiment. Below is how the model should look. Please click on the link to use our experiment (Experiment Name: IOT Anomaly Detection) for further reference.  This link requires that you have a Azure ML account.  To access the gallery, click the following public link:  https://gallery.cortanaintelligence.com/Experiment/IOT-Anomaly-Detection

Derek MooreErica Davis, and Hank Galbraith, authors.

Anomaly and Intrusion Detection in IoT Networks with Enterprise Scale Endpoint Communication – Pt 2

Derek MooreErica Davis, and Hank Galbraith, authors.

Part two of a series of LinkedIn articles based on Cognitive Computing and Artificial Intelligence Applications

Background

Several high profile incidents of ransomware attacks have called attention to IoT networks security. An assessment of security vulnerabilities and penetration testing have become increasingly important to sufficient design. Most of this assessment and testing takes place at the software and hardware level. However, a more broad approach is vital to the security of IoT networks. The protocol and traffic analysis is of importance to structured dedicated IoT networks since communication and endpoints are tracked and managed. Understanding all the risks posed to these types of network allows for more complete risk management plan and strategy. Beside network challenges, there are challenges to scalability, operability, channels and also the information being transmitted and collected with such networks. In IoT networks, looking for vulnerabilities spans the network architecture, endpoint devices and services, where services include the hardware, software and processes that build an overall IoT architecture. Building a threat assessment or map, as part of an overall security plan, as well as, updating it on a schedule basis allows security professionals and stakeholders to manage for all possible threats to the architecture. Whenever possible, creating simulations of possible attack vectors, understanding the behavior of such attacks and then creating models will help build upon a overall security management plan.

Open ports, SQL injection flaws, unencrypted services, insecure network interfaces, buffer overflow risks, lack of firewall protocols, authorization settings, web interface insecurity are among some of the types of vulnerabilities in an IoT network and devices.

Where is the location of a impending attack? Is it occurring at the device, server or service? Is it occurring in the location where the data is stored or while the data is in transit? What type of attacks can be identified? Types of attacks include distributed denial of service, man-in-the-middle, ransomware, botnets, spoofing, account penetrations, etc.

Business Use Case

For this business use case research study, a fictional company was created. The company is a national farmland and agricultural cooperative that supplies food to local and state markets. Part of the company’s IT infrastructure is an IoT network that uses endpoint devices for monitoring and controlling temperature, humidity and moisture for the company’s large agricultural farmlands. This network has over 2000 IoT devices in operations on 800 acres. Any intrusion into the network by a rogue service or bad actor could have consequences in regards to delivering fresh produce with quality and on time. The network design in the simulation below is a concept of this agricultural network. Our team created a simulation network using Cisco Packet Tracer, a tool which allows users to create and simulate package traffic throughout a computerized network at multiple ISO levels.

Simulated data was generated for using the packet tracer simulator to track and build. In the simulation network below using multiple routers, switches, servers and IoT devices for packets such as TCP, UDP, RIPv4 and ICMP, for instance.

Network Simulation

Below is a simulation of packet routing throughout the IoT network.

Cisco Packet Tracer Simulation for IoT network.  Packet logging to test anomaly detection deep learning models.

Problem Statement

Our fictional company will be the basis of our team’s mock network for monitoring for intrusions and anomaly. Being a simulated IoT network, it contains only a few dozen IoT enabled sensors and devices such as sprinklers, temperature and water level sensors, and drains. Since our model will be designed for large scale IoT deployment, it will be trained on publicly available data, while the simulated data will serve as a way to score the accuracy of the model. The simulation has the ability to generate the type of threats that would create anomalies. It is important to distinguish between an attack and a known issue or event (see part one of this research for IoT communication issues). The company is aware of those miscommunications and has open work orders for them. The goal is for our model is to be able to detect an actual attack on the IP network by a bad actor. Although miscommunication is technically an anomaly, it is known by the IT staff and should not raise an alarm. Miscommunicating devices are fairly easy to detect, but to a machine learning or deep learning model, it can be a bit more tricky. Creating a security alarm for daily miscommunication issues that originate from the endpoints, would constitute a prevalence of false positives (FP) in a machine learning confusion matrix.

No alt text provided for this image

A running simulation

Project Significance and Implementation

In today’s age of modern technology and the internet, it is becoming increasingly more difficult to protect enterprise networks against malicious attacks. Not only are malicious actors becoming more advanced with the methodologies of their attacks, but also the number IoT devices that live and operate in a business environment is ever increasing. It needs to be a top priority for any business to create an IT business strategy that protects the company’s technical architecture systems and core intellectual property. When accessing all potential security weakness, you must decompose the network model and define trust zones within the IoT architecture.

This application was designed to use Microsoft Azure Machine Learning analyze and detect anomalies in large data sets collected from all devices on the business’ network. In an actual implementation of this application, there would be a constant data flow running through our predictive model to classify traffic as Normal, Incorrect Setup, Distributed Denial of Service (DDOS attack), Data Type Probing, Scan Attack, or Man in the Middle. Using a supervised learning method to iteratively train our model, the application would grow increasingly more cognitive, and accurate at identifying these network traffic patterns correctly. If this system were to be fully implemented, there would need to also be actions for each of these classification patterns. For instance, if the model detected a DDOS attack coming from a certain device, the application would automatically send shutdown commands to the device, thus isolating it from the network and containing the attack. When these actions occur, there would be logs taken, and notifications automatically sent to appropriate IT administrators and management teams, so that quick and effective action could be taken. Applications such as the one we have designed are already being used throughout the world by companies in all sectors. Crowdstrike for instance, is a cyber technology company that produces Information Security applications with machine learning capabilities. Cyber technology companies such as Crowdstrike have grown ever more popular over the past few years as the number of cyber attacks have increased. We have seen first hand how advanced these attacks can be with data breaches on the US Federal government, Equifax, Facebook, and more. The need for advanced information security applications is increasing daily, not just for large companies, but small- to mid-sized companies as well. While outsourcing information security is an easy choice for some companies, others may not have the budget to afford such technology. That is where our application gives an example of the low barrier to entry that can be attained when using machine learning applications, such as Microsoft Azure ML or IBM Watson. Products such as these create relatively easy interfaces for IT Security Administrators to take the action into their own hands, and design their own anomaly detection applications. In conclusion, our IOT Network Anomaly Detection Application is an example of how a company could design and implement it’s own advanced cyber security defense applications. This would better enable any company to protect it’s network devices, and intellectual property against the ever growing malicious attacks.

Methodology

For this project, our team acquired public data from Google, Kaggle and Amazon. For the IoT model, preprocessed data was selected for the anomaly detection model. Preprocessed data from the Google open data repository was collected to test and train the models. R Studio programming served as an initial data analysis and data analytics process to determine Receiver Operating Characters (ROC) and Area Under the Curve (AUC) and evaluate the sensitivity and specificity of the models for scoring the predictability of the response variables. In R, predictability was compared between with logistic regression, random forest, and gradient boosting models. In the preprocessed data, a predictor (normality) variable was used for training and testing purposes. After the initial data discovery stage, the data was processed by a machine learning model in Azure ML using support vector machine and principal component analysis pipelines for anomaly detection. The response variable has the following values:

  • Normal – 0
  • Wrong Setup – 1
  • DDOS – 2
  • Scan Attack – 4
  • Man in the Middle – 5

The preprocessed dataset for intrusion detection for network-based IoT devices includes ultrasonic sensors using Arduino microcontrollers and Node MCU, a low-cost open source IoT platform that can run on the ESP8266 Wi-Fi Module used to send data.

The following table represents data from the ethernet frame which is part of the TCP/IP packet that is transmitted from a source device to a destination device for network communication.  The following dataset is preprocessed according to the network intrusion detection based system.

The following table represents data from the ethernet frame which is part of the TCP/IP packet that is transmitted from a source device to a destination device for network communication. 

Source:  Google.com

Source: Google.com

In the next article, we’ll be exploring the R code and Azure ML trained anomaly detection models in greater depth.

Anomaly and Intrusion Detection in IoT Networks with Enterprise Scale Endpoint Communication

This is part one of a series of articles to be published on LinkedIn based on a classroom project for ISM 647: Cognitive Computing and Artificial Intelligence Applications taught by Dr. Hamid R. Nemati at the University of North Carolina at Greensboro Bryan School of Business and Economics.

The Internet of Things (IoT) continues to be one of the most innovative and exciting areas of technology in the last decade. IoT are a collection of devices that reside in the world that collect data from the environment around it or through mechanical, electrical, thermodynamic or hydrological processes. These environments could be the human body, geological areas, the atmosphere, etc. The networking of IoT devices has been more prevalent in the many industries for years including the gas, oil and utilities industry. As companies create demand for higher sample read rates of data from sensors, meters and other IoT devices and bad actors from foreign and domestic sources have become more prevalent and brazen, these networks have become vulnerable to security threats due to their increasing ubiquity and evolving role in industry. In addition to this, these networks are also prone to read rate fluctuations that can produce false positives for anomaly and intrusion detection systems when you have enterprise scale deployment of devices that are sending TCP/IP transmissions of data upstream to central office locations. This paper focuses on developing an application for anomaly detection using cognitive computing and artificial Intelligence as a way to get better anomaly and intrusion detection in enterprise scale IoT applications.

This project is to use the capabilities of automating machine learning to develop a cognitive application that addresses possible security threats in high volume IoT networks such as utilities, smart city, manufacturing networks. These are networks that have high communication read success rates with hundreds of thousands to millions of IoT sensors; however, they still may have issues such as:

  1. Noncommunication or missing/gap communication.
  2. Maintenance Work Orders
  3. Alarm Events (Tamper/Power outages)

In large scale IoT networks, such interruptions are normal to business operations. Certainly, noncommunication is typically experienced because devices fail, or get swapped out due to a legitimate work order. Weather events and people, can also cause issues with the endpoint device itself, as power outages can cause connected routers to fail, and tampering with a device, such as people trying to do a hardwire by-pass or removing a meter.

The scope of this project is to build machine learning models that address IP specific attacks on the IoT network such as DDoS from within and external to the networking infrastructure. These particular models should be intelligent enough to predict network attacks (true positive) versus communication issues (true negative). Network communication typical for such an IoT network include:

  1. Short range: Wi-Fi, Zigbee, Bluetooth, Z-ware, NFC.
  2. Long range: 2G, 3G, 4G, LTE, 5G.
  3. Protocols: IPv4/IPv6, SLIP, uIP, RLP, TCP/UDP.

Eventually, as such machine learning and deep learning models expand, these types of communications will also be monitored.

Scope of Project

This project will focus on complex IoT systems typical in multi-tier architectures within corporations. As part of the research into the analytical properties of IT systems, this project will focus primarily on the characteristics of operations that begin with the collection of data through transactions or data sensing, and end with storage in data warehouses, repositories, billing, auditing and other systems of record. Examples include:

  1. Building a simulator application in Cisco Packet Tracer for a mock IoT network.
  2. Creating a Machine Learning anomaly detection model in Azure.
  3. Generating and collecting simulated and actual TCP/IP network traffic data from open data repositories in order to train and score the team machine learning model.

Other characteristics of the IT systems that will be researched as part of this project, include systems that preform the following:

  1. Collect, store, aggregate and transport large data sets
  2. Require application integration, such as web services, remote API calls, etc.
  3. Are beyond a single stack solution.

Next: Business Use Cases and IoT security

Derek MooreErica Davis, and Hank Galbraith, authors.