Creating Plan Stability in Microsoft SQL-Server

To ensure optimal performance of the database, an administrator must follow the most important database areas to maintenance including: 

  1. Rebuilding indexes 
  2. Minimizing procedure compilations and recompilations during heavy workloads. 
  3. Check page allocation, contention and aging. 

  Actions such as running the re-indexing script daily and checking for fragmentation are important for optimization health. 

Plan and SQL handles 

Optimization of code to reduce the amount of logical and physical lookups occurring in memory and disk should be the first method of troubleshooting performance issues.  When code has been released into a production environment, it is not always possible to change code to make it faster.  At times, what is needed is to monitor for plan changes and plan execution.  As part of this effort, it’s important to ensure that plans executed on the database maintain a consistent optimal execution plan or that no significant plan execution change occur.  In the Appendix, there are queries that will all an administrator observe plan changes and execution statistics.  Here is an example of how to investigate plan stability in a SQL-Server database.  The table below is a typical query used multiple times during the day with its respective plan handle which is a hash of the query plan.  You can map a particular SQL text to a plan handle using the dynamic management function (DMF) sys.dm_exec_sql_text(sql_handle | plan_handle). 

SQL Statement Plan Handle  
SELECT * FROM Sales.SalesOrderHeader AS h, Sales.Customer AS c, Sales.SalesTerritory AS t WHERE h.CustomerID = c.CustomerID AND c.TerritoryID = t.TerritoryID AND CountryRegionCode = @Country_region 0x0600060066E0F61940015D4D070000008800000000000000   

Table G5-1 

Having multiple plan handles is not necessarily an issue and doesn’t mean the plan is running errantly.  Investigation of the query plan can help determine if the plan has a high cost to the optimizer. 

Steps necessary to determine if query execution is optimized includes 

  1. Looking into the plan cache to determine cache plan reuse is occurring. 
  2. Monitor for multiple compilations and recompilation to make sure it is minimized and preplanned. 
  3. Report any issues with long running code to Itron Support. 

Checking for locking issues such has long PAGELATCHIO latch waits and deadlocks are critical.  Administrators must check for deadlock events either by running traces or dynamic management views. 

In the appendix below, there are examples of looking for specific plan handle performance and the use of plan guides.  Detail of this methods are beyond the scope of this report and should be researched and thoroughly tested before use. 

Plan Maintenance Analysis Queries









SET NOCOUNT ON 
GO 
 
/*  Query checks for memory usage in database o/s  */ 
 
SELECT 
(physical_memory_in_use_kb/1024.0/1024.0) AS Memory_usedby_Sqlserver_GB, 
(locked_page_allocations_kb/1024.0) AS Locked_pages_used_Sqlserver_MB, 
(total_virtual_address_space_kb/1024.0/1024.0) AS Total_VAS_in_GB, 
process_physical_memory_low, 
process_virtual_memory_low 
FROM sys.dm_os_process_memory; 


SELECT object_name, cntr_value 
  FROM sys.dm_os_performance_counters 
  WHERE counter_name IN ('Total Server Memory (KB)', 'Target Server Memory (KB)'); 
 
SELECT physical_memory_kb FROM sys.dm_os_sys_info; 
SELECT physical_memory_in_bytes FROM sys.dm_os_sys_info; 





SELECT top(5) sum(total_physical_reads) tot_prds, sum(max_physical_reads) tot_max_prds, sum(total_logical_reads) tot_lrds, sum(max_logical_reads) tot_max_lrds, 
 sum(total_logical_writes) tot_lwrts, sum(max_logical_writes) tot_max_lwrts 
 FROM   sys.dm_exec_query_stats stat 
GO  
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text] 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
p.plan_handle, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (p.plan_handle) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text], p.plan_handle 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 


/* Can use query to check for specific plan handles based on performance */ 
 
SELECT usecounts, size_in_bytes, cacheobjtype, 
SUM(total_worker_time / 1000) AS total_cpu_time_in_ms, 
SUM(total_physical_reads) AS total_physical_reads, 
SUM(total_logical_reads) AS total_logical_reads, 
SUM(total_logical_writes) AS total_logical_write, 
p.plan_handle 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (0x050005007E58905A307BC1EB1800000001000000000000000000000000000000000000000000000000000000) 
WHERE p.objtype = 'Proc' AND cacheobjtype = 'Compiled Plan' 
GROUP BY usecounts, size_in_bytes, cacheobjtype, [text], p.plan_handle 
ORDER BY usecounts DESC, total_logical_reads DESC, total_logical_write DESC 
GO 
 
select * from sys.dm_exec_cached_plan_dependent_objects(0x050005007E58905A307BC1EB1800000001000000000000000000000000000000000000000000000000000000) 
go 
 
sp_create_plan_guide_from handle 
go 
 
 
SELECT TOP(10) total_physical_reads, last_physical_reads, min_physical_reads, max_physical_reads, 
min_physical_reads, max_physical_reads, total_logical_writes, last_logical_writes, 
min_logical_writes, max_logical_writes, total_logical_reads, last_logical_reads, 
min_logical_reads, max_logical_reads,  
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM     sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
ORDER BY total_physical_reads, Total_logical_reads DESC 
GO 
 
SELECT TOP(10) total_physical_reads, max_physical_reads, total_logical_writes, max_logical_writes, 
REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM     sys.dm_exec_cached_plans AS p 
INNER JOIN sys.dm_exec_query_stats stat ON p.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
WHERE p.objtype <> 'Proc' 
ORDER BY total_physical_reads, Total_logical_reads DESC 
GO 
 
 
SELECT t1.session_id, t1.request_id, t1.task_alloc, 
t1.task_dealloc, t2.statement_start_offset,  
t2.statement_end_offset, REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM (Select session_id, request_id, 
SUM(internal_objects_alloc_page_count) AS task_alloc, 
SUM (internal_objects_dealloc_page_count) AS task_dealloc  
FROM sys.dm_db_task_space_usage  
GROUP BY session_id, request_id) AS t1,  
sys.dm_exec_requests AS t2 
INNER JOIN sys.dm_exec_query_stats stat ON t2.plan_handle = stat.plan_handle 
CROSS APPLY sys.dm_exec_sql_text (stat.sql_handle) 
WHERE t1.session_id = t2.session_id 
AND (t1.request_id = t2.request_id) 
 
-- I/O query stats based on the sql text 
 
SELECT top (10) (stats.total_logical_reads/stats.execution_count), 
    (stats.total_logical_writes/stats.execution_count), 
    (stats.total_physical_reads/stats.execution_count), 
    stats.execution_count, stats.sql_handle, stats.plan_handle, 
    REPLACE (REPLACE([text], char(13), ' '), CHAR(10), ' ') AS sql_text 
FROM sys.dm_exec_query_stats stats 
CROSS APPLY sys.dm_exec_sql_text (stats.sql_handle) 
ORDER BY (total_logical_reads + total_logical_writes) DESC 
 
 
 
-- Missed Index information 
 
select TOP(50) * from sys.dm_db_missing_index_group_stats 
GO 
 
select object_name(object_id) object, equality_columns, included_columns, statement from sys.dm_db_missing_index_details 
go 


/* Queries below use an option technique known as plan guides */ 
  
sp_create_plan_guide_from_handle @name = N'plan_guide_name' 
 
-- Create a plan guide for the query by specifying the query plan in the plan cache. 
DECLARE @plan_handle varbinary(64); 
DECLARE @offset int; 
SELECT @plan_handle = plan_handle, @offset = qs.statement_start_offset 
FROM sys.dm_exec_query_stats AS qs 
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS st 
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp 
WHERE text LIKE N'SELECT WorkOrderID, p.Name, OrderQty, DueDate%'; 
 
EXECUTE sp_create_plan_guide_from_handle  
    @name =  N'Guide1', 
    @plan_handle = @plan_handle, 
    @statement_start_offset = @offset; 
GO 
-- Verify that the plan guide is created. 
SELECT * FROM sys.plan_guides 
WHERE scope_batch LIKE N'SELECT WorkOrderID, p.Name, OrderQty, DueDate%'; 
GO 

Create Plan Guides in SQL-Server

sp_create_plan_guide  
@name = N'Guide1', 
@stmt = N'SELECT * FROM Sales.SalesOrderHeader AS h, 
        Sales.Customer AS c, 
        Sales.SalesTerritory AS t 
        WHERE h.CustomerID = c.CustomerID  
            AND c.TerritoryID = t.TerritoryID 
            AND CountryRegionCode = @Country_region', 
@type = N'OBJECT', 
@module_or_batch = N'Sales.GetSalesOrderByCountry', 
@params = NULL, 
@hints = N'OPTION (OPTIMIZE FOR (@Country_region = N''US''))' 
 

Creating Query Plan Stability in Oracle Relational Databases with Cost-based Optimizers

What Query Plan Stability?

Query Plan Stability ensures predictable and appreciable database performance no matter how much data is being stored in the buffer cache or on disk.   It ensures that all queries are performing according to business requirements and processes are running in a timely manner during the all business hours.

The information presented in this document applies to object relational databases with a cost-base optimizer. The principal audience for this documentation are database administrators, database programmers and customer support analyst.

Whenever confronted with a poorly performing application.  The following typically occurs during troubleshooting

  1. Ask stakeholder when and at what frequency statistical collection is performed.  If stakeholder has not computed statistics recently or has never computed statistics, they must compute statistics before proceeding.
  2. Ask customers if there has been a recent change in data. If this has occurred, get as much detail on the changes
    1. When and where the changes made
    2. The quantity of changes.

During execution of processes  plan instability can occur due to the following factors:

  1. Lack of adherence to cardinality and primary and foreign key relationships.
  2. Lack of proper statistics.
  3. Misuse of predicates in query and plan binding.
  4. Flush of “good” plans from the shared pool (plan cache).

One of the best tools for ensuring plan stability in Oracle is the cost-based optimizer explain plan, which helps determine how costly it is to execute a plan based on its cost in CPU.

Query identifies in Oracle (SQL_ID) are indexes with alphanumeric characters (e.g. 3mx8whn1c5jbb). 

When reviewing query execution performance, it’s always best to create a query plan:

SQL>EXPLAIN PLAN FOR 
select XML_LOADID, XML_RECORDTEXT FROM OLAPSYS.XML_LOAD_RECORDS WHERE XML_RECORDID=2;

SQL>  select * from TABLE(DBMS_XPLAN.DISPLAY);

Explain plans show how the query is performing in the cost-based optimizer once parsed. It provides valuable information of where the highest cost are in your query. Query tuning is essential for maintaining performance in a cost-based optimizer.

If plan instability occurs in an Oracle database, there are several options in addition to the list in the preceding sections to correct the issue.   

  1.  Query performs very poorly or performs full table scans on very high cardinal and big table objects.
  2. Make sure that statistical collection script has been executed.  Make certain that statistics are up to date.   See appendix for script to check statistics.
  3. If Statistics are up to date.  Attempt to generate a plan analysis (SQL profile) of the current plan.  Note:  Oracle Diagnostic Pack and Automated Database Diagnostic Manager (ADDM) needs to be installed.  The profile can be generated via the OEM GUI or using scripts.

Command Line SQL Tuning and SQL Profiling.

The following steps allow an administrator to create SQL profile tuning sets as well as export out those profiles from a source database and import it into a new database.  This allows administrator to create and test better plans in test systems and then load them safely into production systems and enable/disable them when needed.

Run query to find currently in memory that has high execution time.

SET HEADING ON
SET PAGESIZE 20000
SET LINESIZE 2000
COL text FOR A50
COL ctext FOR A50


select vs.last_load_time, ao.OWNER, parsing_schema_name, first_load_time , ao.OBJECT_NAME, vs.program_line#, executions exe, vs.sqltype, vs.sql_id, vs.rows_processed rows_processed, concurrency_wait_time, elapsed_time/1000000 elapsed_secs, elapsed_time/1000000/(case when executions = 0 then 1 else executions end) elap_per_exec_secs, vs.sql_fulltext text, vs.sql_text ctext
from v$sql vs, all_objects ao
where vs.PROGRAM_ID = ao.OBJECT_ID and parsing_schema_name not in ('SYS','SYSTEM') and parsing_schema_name in ('<SCHEMA_OWNER>')
and owner not in ('SYS','SYSTEM','DBSNMP','SYSMAN','MDSYS')
order by vs.last_load_time desc , vs.parsing_schema_name, vs.first_load_time desc , program_id, vs.program_line#;

Execute v$sqlarea query to find plan and hash values that identify that query.


select address, hash_value, plan_hash_value from v$sqlarea where sql_id in ('2zd224kurqsr9');

ADDRESS          HASH_VALUE PLAN_HASH_VALUE
---------------- ---------- ---------------
00000003ACC32290 3419553494      1111780545

Prior to doing a purge of a PLAN_HASH from the shared_pool, it’s important to collect information about the bad plan.   You can do this by executing an DBMS_XPLAN outline on the Plans.  To preserve the plan prior to flushing if from the shared_pool, read section on Loading Plan from Cache to Baseline and return here to continue.

SQL> EXPLAIN PLAN FOR select XML_LOADID, XML_RECORDTEXT FROM OLAPSYS.XML_LOAD_RECORDS WHERE XML_RECORDID=2;

Explained.

SQL> select * from TABLE(DBMS_XPLAN.DISPLAY);

If plan exists in memory, but has a bad hash plan number, create tuning task and SQL set for the query.

Execute tuning task command to see if ADDM will generate a better plan (Do not purge from memory if using this step).

var stmt_task VARCHAR2(64);

SET SERVEROUTPUT ON LINESIZE 200 PAGESIZE 20000 LONG 9999

EXEC :stmt_task := DBMS_SQLTUNE.CREATE_TUNING_TASK(sql_id => '2zd224kurqsr9', task_name => 'sga_tuning_task_1');
	
SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('sga_tuning_task_1', 'TEXT', 'TYPICAL', 'FINDINGS') FROM DUAL;

 If no results are returned, then ADDM will not be able to tune.

SQL> SELECT DBMS_SQLTUNE.REPORT_TUNING_TASK('sga_tuning_task_1', 'TEXT', 'TYPICAL', 'FINDINGS') FROM DUAL;
ERROR:
ORA-13631: The most recent execution of task sga_tuning_task_1 contains no results.
ORA-06512: at "SYS.PRVT_ADVISOR", line 5739
ORA-06512: at "SYS.DBMS_SQLTUNE", line 1045
ORA-06512: at line 1

If no results are returned skip following query and go to step 6.  Always make sure not to purge SGA if you use these steps.

DECLARE
    
    sqlprofile_name VARCHAR2(30);

BEGIN
	 
    sqlprofile_name := DBMS_SQLTUNE.ACCEPT_SQL_PROFILE ( 
    task_name    => 'sga_tuning_task_1'
,   name         => 'sql_profile_1'
,   force_match  => true 
);
END;
/

Create a SQL Tuning Set from AWR.

Find the begin and end span for the SQL_ID and plan hash that you need from AWR (This is if the plan has been purged from the pool and exists in AWR repository.

SET LINESIZE 2000 PAGESIZE 20000
SET LONG 99999
     select
        SNAP_ID,
        TO_CHAR(begin_interval_time,'DD-MON-YYYY HH24') begin_interval_time,
        TO_CHAR(end_interval_time,'DD-MON-YYYY HH24') end_interval_time,
        SQL_ID,
        round((round((avg(ELAPSED_TIME_DELTA)/1000000),2)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as ELAPSED_TIME_SECS,
        round((round((avg(CPU_TIME_DELTA)/1000000),2)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as CPU_TIME_SECS,
        round((avg(BUFFER_GETS_DELTA)/(case when avg(EXECUTIONS_DELTA)=0 then 1 else avg(EXECUTIONS_DELTA) end)),2) as GETS_PER_EXEC
     from
        dba_hist_snapshot natural join dba_hist_sqlstat natural join dba_hist_sqltext
     where
      (elapsed_time_delta > 0 or elapsed_time_delta is not null)
      and SQL_ID = 'bbuc9hr5x4gqq'
     group by
        SNAP_ID,
        TO_CHAR(begin_interval_time,'DD-MON-YYYY HH24'),
        TO_CHAR(end_interval_time,'DD-MON-YYYY HH24'),
        SQL_ID
     order by
        snap_id asc
/

Create a SQL set.

Begin 
  DBMS_SQLTUNE.CREATE_SQLSET (sqlset_name => '2zd224kurqsr9_TUNING_SET_1');
END;
/

 Load SQL Tuning Set.

DECLARE
   Cur sys_refcursor;
 BEGIN
    OPEN cur FOR
     SELECT VALUE(P) FROM TABLE(dbms_sqltune.select_workload_repository(begin_snap=>13645, end_snap=>13649, Basic_filter=>'sql_id=''2zd224kurqsr9''', attribute_list => 'ALL' )
) p;
DBMS_SQLTUNE.LOAD_SQLSET(sqlset_name=> '2zd224kurqsr9_TUNING_SET_1', populate_cursor=> cur);
CLOSE cur;
End;
/

 Verify tuning set is correct.

SELECT * FROM TABLE(DBMS_SQLTUNE.SELECT_SQLSET(sqlset_name => '2zd224kurqsr9_TUNING_SET_1'));

 create the sql set staging table (Cannot be created in the SYS or SYSTEM schema).

execute DBMS_SQLTUNE.CREATE_STGTAB_SQLSET(table_name =>'TUNING_SET_1_STGTAB');

Pack the tuning set into the staging table.

execute DBMS_SQLTUNE.PACK_STGTAB_SQLSET(sqlset_name => 'bbuc9hr5x4gqq_TUNING_SET_1',staging_table_name => 'TUNING_SET_1_STGTAB', staging_schema_owner=>'<SCHEMA_OWNER_SOURCE>');

Check the count of the table to make sure it’s been populated.

SQL> select count(*) from <SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB;

  COUNT(*)
----------
         3

Export out the staging table.

[oracle]$ expdp system/******* DUMPFILE=TUNIING_SET_1.DMP DIRECTORY=BACKUPS TABLES='<SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB'

Export: Release 11.2.0.4.0 - Production on Fri Mar 9 04:10:34 2018

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Starting "SYSTEM"."SYS_EXPORT_TABLE_01":  system/******** DUMPFILE=TUNIING_SET_1.DMP DIRECTORY=BACKUPS TABLES=<SCHEMA_OWNER_SOURCE>.TUNING_SET_1_STGTAB
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 9.562 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/MARKER
. . exported "<SCHEMA_OWNER_SOURCE>"."TUNING_SET_1_STGTAB"             54.01 KB       3 rows
Master table "SYSTEM"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYSTEM.SYS_EXPORT_TABLE_01 is:
  /data/backups/TUNIING_SET_1.DMP
Job "SYSTEM"."SYS_EXPORT_TABLE_01" successfully completed at Fri Mar 9 04:11:10 2018 elapsed 0 00:00:32


******** impdp staging table into target database  ******

Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Master table "SYSTEM"."SYS_IMPORT_FULL_01" successfully loaded/unloaded
Starting "SYSTEM"."SYS_IMPORT_FULL_01":  system/******** DUMPFILE=TUNIING_SET_1.DMP REMAP_SCHEMA=<schema_owner_source>:<schema_owner> DIRECTORY=BACKUPS
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . imported "<SCHEMA_OWNER>"."TUNING_SET_1_STGTAB"                54.01 KB       3 rows
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/MARKER
Job "SYSTEM"."SYS_IMPORT_FULL_01" successfully completed at Fri Mar 9 04:29:52 2018 elapsed 0 00:00:41

Purge the plan from the shared pool on source database.

exec sys.dbms_shared_pool.purge('00000003ACC32290,3419553494','C');

*** OR ***

set pagesize 20000
select 'exec sys.dbms_shared_pool.purge(''' || address || ',' ||  hash_value   || ''',''C'');' from v$sqlarea where sql_text like '%TASKTEMPLATEPARAMETERVALUE X  where (1=1)%'
/

'EXECSYS.DBMS_SHARED_POOL.PURGE('''||ADDRESS||','||HASH_VALUE||''',''C'');'
--------------------------------------------------------------------------------
exec sys.dbms_shared_pool.purge('00000003ACC32290,3419553494','C');
exec sys.dbms_shared_pool.purge('0000000127D41C68,3957004154','C');
exec sys.dbms_shared_pool.purge('00000001A851BB70,525044174','C');
exec sys.dbms_shared_pool.purge('000000016C757A98,4128744307','C');
exec sys.dbms_shared_pool.purge('000000025115E2A8,4132567071','C');

Installing new plan on target database.

Unpack the SQL set in the target database.


NAME
--------------------------------------------------------------------------------
CREATED             STATEMENT_COUNT
------------------- ---------------
bbuc9hr5x4gqq_TUNING_SET_1
2018-03-09 04:33:59               

Load plan into SQL Plan Management (SPM).

set serveroutput on
declare
my_integer pls_integer;
begin
my_integer := dbms_spm.load_plans_from_sqlset (
sqlset_name => '2zd224kurqsr9_TUNING_SET_1',
sqlset_owner => 'SYS',
fixed => 'YES',
enabled => 'YES'
);
DBMS_OUTPUT.PUT_line(my_integer);
end;
/

Gather stats on objects used by tuning set.

begin
dbms_stats.gather_table_stats( 
ownname=> '<schema_owner>', 
tabname=> 'TASKTEMPLATEPARAMETERVALUE', 
estimate_percent=> null, 
cascade=> TRUE, 
degree=> null, 
no_invalidate=> DBMS_STATS.AUTO_INVALIDATE, 
granularity=> 'AUTO', 
method_opt=> 'FOR ALL COLUMNS SIZE AUTO');
end;

Load known STS for baseline.

DECLARE
my_plans pls_integer;
BEGIN
  my_plans := DBMS_SPM.LOAD_PLANS_FROM_SQLSET(
     sqlset_name => '2zd224kurqsr9_TUNING_SET_1',
     basic_filter => 'plan_hash_value = '1111780545');
END;
/

List out baseline.

SELECT * FROM dba_sql_plan_baselines;

Please make sure that the plan we would like optimizer to use is fixed enabled.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => <attribute_name>, ‘ENABLED’, attribute_value =>’YES’);
END;
/

For the plan that was manually loaded, check to make sure it has been accepted.

select sql_handle, plan_name, enabled, accepted,fixed,origin from dba_sql_plan_baselines where signature=<signature>;

Load old plan(s) from cache into baseline to preserve it.

Query the pool.

Select sql_id, exact_matching_signature, force_matching_signature from v$sql where sql_text like ‘%<sql_text%’;

Create baseline from plan.

SET SERVEROUTPUT ON

declare
my_int pls_integer;
begin
my_int := DBMS_SPM.load_plans_from_cursor_cache(sql_id => '2zd224kurqsr9’, plan_hash_value => 1111780545, fixed => 'NO', enabled => 'NO');
DBMS_OUTPUT.PUT_line(my_int);
end;
/

List baseline to make sure it has been loaded.

SELECT * FROM dba_sql_plan_baselines where signature=<exact matching signature query>;

Make sure to load current plan from cursor cache into baseline and disable with the following command.  Also make certain that the baseline is not accepted.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => <attribute_name>, ‘ENABLED’, attribute_value =>’NO’);
END;

Purge the SQL from cursor Cache, so that the optimizer will pick the new plan from baseline.

select 'DBMS_SHARED_POOL.PURGE ('''||ADDRESS||', '||HASH_VALUE||''', ''C'');' from V$SQLAREA where SQL_ID = '2zd224kurqsr9';

exec DBMS_SHARED_POOL.PURGE ('000000051A0961B0, 1111780545', 'C');

Confirmation and rollback procedures

You can roll back the changes by enabling the previous plans within SQL Profile Manager and running the application to reload planes into the pool.

DECLARE
My_plan pls_integer;
BEGIN
	my_plans := DBMS_SPM.ALTER_SQL_PLAN_BASELINE (
 sql_handle =><sql_handle>, plan_name => <plan_name>, attribute_name => ‘ENABLED’, 
attribute_value =>’YES’);
END;
/

Check to make sure old plan is enabled and accepted.

select sql_handle, plan_name, enabled, accepted,fixed,origin from dba_sql_plan_baselines where signature=<signature>;
Derek Moore
Derek Moore

Tableau Visualizations of the “Great Recession” (2007-2009)

Executive Summary

The Great Recession was one of the most turbulent economic periods of the past eighty years.  It was a global economic recession that impacted hundreds of banks, some of them responsible for financing the Gross Domestic Product (GDP) of entire countries. During the recession, many banks closed due to the speculation that flooded into the real estate market with new banking products that gave loans to millions of subprime consumers.  This allowed people to take out loans with low interest rates but very complicated terms that made it much easier to end up in foreclosure proceedings. During this period, home mortgages defaults skyrocketed in many states further depressing the market. This in turned caused many people to lose their life savings as well as their home and jobs. It is now popularly considered the longest period of economic downturn since the Great Depression of the 1930’s.

Since the great recession, many financial regulations and policies were put in place to prevent it from happening in the future.  The most popular of these were the Dodd-Frank Wall Street Reform and Consumer Protection act, also known as part of the Emergency Economic Stabilization Act, passed by Congress that amended many existing regulations to make it much easier to protect consumers by creating new government agents tasked with overseeing aspects of the financial banking system.  When Donald Trump was elected President in 2016, he signed a new law rolling back significant portions of the law.

One of the upsides of the recession is how much articles, books, and analysis done before, during and after the financial crisis.  This is of significant importance, because understanding how the financial markets got to that point in 2008, will help consumers better understand and better prepare for what not to do.  Another added benefit of understanding this period of time, is how many banks gambled with risky financial products, only to lose all the money it had.  Many consumers are now educated enough to know take on these risky loans, and banks scrutinized mortgage applications more closely.  During the recession, Inflation in the market ballooned housing prices until there was an inevitable crash.

The following report details state-by-state the impact of this period on the housing market, consumers, and overall housing values to give the audience a better understanding of quantitative impact monthly and yearly.  I will also show visualizations of housing prices detailing the loss of housing value, as a result causing many borrowers to be “under water” with their home mortgages.  The report will also focus on four states (North Carolina, California, Massachusetts and Florida) and impact on job losses and employment each month during the period of this recession.

Charts and Graphs

One the most crippling results of the Great Recession was on consumers.  Many consumers lost their homes as the downturn rippled throughout the country.  Figure 1 shows an aggregate picture of housing prices between 1996 and 2017 for the four states analyzed.  Notice how there was a steep growth in housing values and then a sudden drop during the period during and shortly after the recession.  Housing values would eventually rise again, but trillions of dollars of equity would be lost, with many communities not recovering.

Figure (1):  Average condo home prices in Florida and California were impacted more heavily by the economic downturn and drop in housing prices compared to North Carolina and Massachusetts.  Source Zillow.com

The impact of housing prices for California and Florida do not really come as a surprise.  Both of these states tend to have housing values greater than states where the cost of living is less.  Also, these states tend to have residents with higher incomes than the general population (retirees and white-collar workers).

When compared to the entire country, several states saw changes in the percentage of income compared to the change in home ownership.  This is an indication of how well some states weathered the economic downturn among their populations.  States such as California, Nevada, and many areas in the southwest saw a slow precipitous drop in home ownership along with a drop in income (with the exception of Utah).  Many states in the south did see drops in home ownership, but on average, the incomes of their populations remained the same.  North Carolina, for instance, saw no net loss of income, but an increase in home ownership.  States such as Nevada saw both large losses in income for their population as well as loss in home ownership in Figure 2.  Nevada was particularly hit hard by the economic recession.

Figure (2):  A heat map of the country comparing changes in income and home ownership.  Source GeoFred. https://geofred.stlouisfed.org/

The nest visualization shows the average unemployment rate.  This increased significantly during and after the great recession.  In figure 3, it shows how this unemployment became more long than the period of the recession itself, showing the economic impact on families beyond the initial wave of economic decline.

Figure (3):  Line graph of unemployment rates. Source U.S. Bureau of Labor Statistics.  https://www.bls.gov/cps/tables.htm

Conclusion

What these graphs show us is how, during the economic downturn, there was not only an impact on the housing market, but also income growth and unemployment.  Showing the scope and depth of the problem.  One of the unfortunate results of the recession was that many banks considered “too big to fail” received government funding to bail them out, and many homeowners and consumers were left to fill the blunt of the economic recession on their own.

Comparing Machine Learning Models in Determining Credit Worthiness for Bank Loans

The R language has a number of machine learning libraries to help determine for both supervised and unsupervised machine learning. This includes such ML techniques such as linear and logistic regression, decision trees, random forest, generalized boosted regression modeling among others. I strongly recommend learning how these models work and how they can be used to predictive analytics.

Part of the Machine Learning process includes the following:

  1. Sample: Create a sample set of data either through random sampling or top tier sampling.  Create a test, training and validation set of data.
  2. Explore: Use exploratory methods on the data.  This includes descriptive statistics, scatter plots, histograms, etc.
  3. Modify:  Clean, prepare, impute or filter data.  Perform cluster analysis, association and segmentation.
  4. Model:  Model the data using Logistic or Linear regression, Neural Networking, and Decision Trees.
  5. Assess:  Access the model by comparing it to other model types and again real data. Determine how close your model is to reality.  Test the data using hypothesis testing.

When creating machine learning models for any application, it is wise to following a process flow such as the following:

In the following example, we use machine learning to determine the credit worthiness of prospective borrowers for a bank loan.

The loan data consist of the following inputs

  1. Loan amount
  2. Interest rate
  3. Grade of credit
  4. Employment length of borrower
  5. Home ownership status
  6. Annual Income
  7. Age of borrower

The response variable or predictor to predict the default rate

  1. Loan status (0 or 1).

After loading the data into R, we partition the data for training or testing sets.

loan <- read.csv("loan.csv", stringsAsFactors = TRUE)

str(loan)

## Split the data into 70% training and 30% test datasets

library(rsample)
set.seed(634)

loan_split <- initial_split(loan, prop = 0.7)

loan_training <- training(loan_split)
loan_test <- testing(loan_split)

Create a over-sample training data based on ROSE library. This checks for over-sampling of the data.

str(loan_training)

table(loan_training$loan_status)

library(ROSE)

loan_training_both <- ovun.sample(loan_status ~ ., data = loan_training, method = "both", p = 0.5)$data

table(loan_training_both$loan_status)

Build a logistic regression model and a classification tree to predict loan default.

loan_logistic <- glm(loan_status ~ . , data = loan_training_both, family = "binomial")

library(rpart)

loan_ctree <- rpart(loan_status ~ . , data = loan_training_both, method = "class")

library(rpart.plot)

rpart.plot(loan_ctree, cex=1)

Build the ensemble models (random forest, gradient boosting) to predict loan default.

library(randomForest)

loan_rf <- randomForest(as.factor(loan_status) ~ ., data = loan_training_both, ntree = 200, importance=TRUE)

plot(loan_rf)

varImpPlot(loan_rf)

library(gbm)

Summarize gradient boosting model

loan_gbm <- gbm(loan_status ~ ., data = loan_training_both, n.trees = 200, distribution = "bernoulli")
summary(loan_gbm)

Use the ROC (receiver operating curve) and compute the AUC (area under the curve) to check the specificity and sensitivity of the models.

# Step 1. Predicting on test data

predicted_logistic <- loan_logistic %>% 
  predict(newdata = loan_test, type = "response")

predicted_ctree <- loan_ctree %>% 
  predict(newdata = loan_test, type = "prob")

predicted_rf <- loan_rf %>% 
  predict(newdata = loan_test, type = "prob")

predicted_gbm <- loan_gbm %>% 
  predict(newdata = loan_test, type = "response")

# Step 3. Create ROC and Compute AUC

library(cutpointr)

roc_logistic <- roc(loan_test, x= .fitted_logistic, class = loan_status, pos_class = 1 , neg_class = 0)
roc_ctree<- roc(loan_test, x= .fitted_ctree, class = loan_status, pos_class = 1 , neg_class = 0)
roc_rf<- roc(loan_test, x= .fitted_rf, class = loan_status, pos_class = 1 , neg_class = 0)
roc_gbm<- roc(loan_test, x= .fitted_gbm, class = loan_status, pos_class = 1 , neg_class = 0)

plot(roc_logistic) + 
  geom_line(data = roc_logistic, color = "red") + 
  geom_line(data = roc_ctree, color = "blue") + 
  geom_line(data = roc_rf, color = "green") + 
  geom_line(data = roc_gbm, color = "black")

auc(roc_logistic)
auc(roc_ctree)
auc(roc_rf)
auc(roc_gbm)

These help you compare and score which model works best for the type of data presented in the test set. When looking at the ROC chart, you can see that the gradient boost model has the best performance of all the model as it is closer to 1.00 than the other models. Classifiers that are closer to 1.00 for the top left where Sensitivity is 1.00 and Specificity is closer to 0.00 have the best performance.

Storytelling with R Markdown, Storyboards and Knitr

Storytelling with data is a critical aspect of data visualization. The ability to bring massive amounts of data and simplify it to an audience creatively and with meaning in purpose is a skill that is critical to data science. With the plethora of tools available to create effective story telling (Tableau, PowerBI, Data Driven Documents (D3), etc.) there are a few others that don’t get mentioned.

One of my favorites tools to use is R Markdown, storyboards and Knitr. R Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents using already existing R code (for more information go to http://rmarkdown.rstudio.com).

When you create a markdown document with extension *.rmd, you are given a button called Knit. Once clicked a document will be generated that includes both content and output of any embedded R code chunks within the document.

In this example, we take R Markdown syntax with R code for the MotorTrends car library to demonstrate how this works:

---
title: " Markdown 2"
author: "Derek Moore"
date: "3/22/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown


```{r cars}
summary(cars)
```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}
plot(pressure)

```

The Knit button is located in the top left hand corner of the R toolbox.

When pressed, the Knitr package creates either a PDF, Rich document or HTML, based on your settings. In my case, it’s an HTML file

Finally, a storyboard is a great storytelling tool that can be created by R Markdown. By implementing storyboarding within the syntax, you can create dynamic storyboards within HTML.

---
title: "2008 Recession"
author: "Derek Moore"
output: 
  flexdashboard::flex_dashboard:
      storyboard: true
      theme: bootstrap
      orientation: rows
 
---

```{r setup, include = FALSE, echo=FALSE}
library(ggplot2)
library(dplyr)
library(readr)
library(DT)
library(flexdashboard)
library(tidyverse)
library(datasets)
library(ggplot2)
library(grid)
#library(png)
#library(imager)
#library(plyr)
#install.packages("tidycensus")
#library(tidycensus)
#install.packages("tmap")
#library(tmap)
#library(tmaptools)
#library(sf)
#install.packages("imager")
#library(imager)

knitr::opts_chunk$set(fig.width = 5, fig.asp = 1/3)

setwd("C:/Dev/ISM 646/Assignment2/")
load(file = "Assignment2_646.RData")

```

<font face="sans-serif" size="1" color="#000000">Percentage of Subprime borrowers<br>during the Great Recession (2005 to 2009) </font> 
====================================================================

Row {data-width=100}
------------------------------------------------------------------

### <br><br><br><br><br> <font face="sans-serif" size="4" color="#000000"> The Great Recssion was one of the most turbulent economic periods of the past eighty year that lasted from December of 2007 and ended June of 2009.  It was a global economic recession that impacted hundreds of banks, some responsible for the financing of the Gross Domestic Product (GDP) of entire countries.  During this period, many banks closed.  Thousands lost jobs and fell into poverty due to the collapsing economy, which was basically fulled by rising federal interest rates and  market speculation centered around mortgage back securities that failed due to consumer defaulting on their mortgage.  This sent housing prices eventually plumetting, causing more economic turmoil in local economies around the world. As banks failed, businesses and consumers lost money.  In the U.S., the Great Recession ended with a GDP decline of 4.3 percent and an unemployment rate of 10 percent. </font><br><br> <font face="sans-serif" size="4" color="#000000">  The largest crisis of the Great Receession were subprime mortgages. Hedge funds, insurance companies, banks and other financial institutions created or insured mortgage-backed securities, all in an attempt make more money from the creation of default swaps (CDS) which tended to have higher rates of return.  In addition to this, the Federal Reserve raised rates. Adjustable-Rate Mortgages or ARMs and Interest Only (IO) loans, were being combined within the CDSs to give them high investor ratings, as to appear safe. This created a huge incentive for banks to approve subprime or low-credit, high-risk borrowers.  Derivatives spread the risk globally causing the 2007 banking crisis and the the Great Recession. </font>


Row {.tabset .tabset-fade}
--------------------------------------------------------------------

### % Subprime Borrowers in North Carolina (2005-2010)

```{r Subprime Borrowers in North Carolina}

ChartA <- ggplot(Subprime_NC, aes(x = `Year-Quarter`, y = Percent, color=`Percent`)) + 
  geom_point(size=3) + 
  geom_smooth(method="lm", se=FALSE) +
  ylab("% Receiving Subprime Loans (NC)") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1 ,size = 8), axis.title.y=element_text(size=5)) +
    scale_x_discrete(limit = c("2005 Q1","2005 Q2","2005 Q3","2005 Q4","2006 Q1","2006 Q2","2006 Q3","2006 Q4","2007 Q1","2007 Q2","2007 Q3","2007 Q4","2008 Q1","2008 Q2","2008 Q3","2008 Q5","2009 Q1","2009 Q2","2009 Q3","2009 Q4","2010 Q1"))



plot(ChartA)

```


### % subprime Borrower in Massachusetts (2005-2010)


```{r Subprime Borrowers in Massachusetts}

ChartA <- ggplot(Subprime_MA, aes(x = `Year-Quarter`, y = Percent, color=`Percent`)) + 
  geom_point(size=3) + 
  geom_smooth(method="lm", se=FALSE) +
  ylab("% Receiving Subprime Loans (MA)") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1 ,size = 8), axis.title.y=element_text(size=5)) +
  scale_x_discrete(limit = c("2005 Q1","2005 Q2","2005 Q3","2005 Q4","2006 Q1","2006 Q2","2006 Q3","2006 Q4","2007 Q1","2007 Q2","2007 Q3","2007 Q4","2008 Q1","2008 Q2","2008 Q3","2008 Q5","2009 Q1","2009 Q2","2009 Q3","2009 Q4","2010 Q1"))

plot(ChartA)

```


### % subprime Borrowers in Florida (2005-2010)


```{r Subprime Borrowers in California}

ChartA <- ggplot(Subprime_CA, aes(x = `Year-Quarter`, y = Percent, color=`Percent`)) + 
  geom_point(size=3) + 
  geom_smooth(method="lm", se=FALSE) +
  ylab("% Receiving Subprime Loans (CA)") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1 ,size = 8), axis.title.y=element_text(size=5)) +
  scale_x_discrete(limit = c("2005 Q1","2005 Q2","2005 Q3","2005 Q4","2006 Q1","2006 Q2","2006 Q3","2006 Q4","2007 Q1","2007 Q2","2007 Q3","2007 Q4","2008 Q1","2008 Q2","2008 Q3","2008 Q5","2009 Q1","2009 Q2","2009 Q3","2009 Q4","2010 Q1"))

plot(ChartA)


```


### % subprime Borrowers in California (2005-2010)



```{r Subprime Borrowers in Florida}

ChartA <- ggplot(Subprime_FL, aes(x = `Year-Quarter`, y = Percent, color=`Percent`)) + 
  geom_point(size=3) + 
  ylab("% Receiving Subprime Loans (FL)") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1 ,size = 8), axis.title.y=element_text(size=5)) +
  scale_x_discrete(limit = c("2005 Q1","2005 Q2","2005 Q3","2005 Q4","2006 Q1","2006 Q2","2006 Q3","2006 Q4","2007 Q1","2007 Q2","2007 Q3","2007 Q4","2008 Q1","2008 Q2","2008 Q3","2008 Q5","2009 Q1","2009 Q2","2009 Q3","2009 Q4","2010 Q1"))

plot(ChartA)


```

Creating Twitter Sentiment Association Analysis using the Association Rules and Recommender System Methods

Contextual text mining methods extract information from documents, live data streams and social media.  In this project, thousands of tweets by users were extracted to generate  sentiment analysis scores.

Sentiment analysis is a common text classification tool that analyzes streams of text data in order to ascertain the sentiment (subject context) of the text, which is typically classified as positive, negative or neutral. 

In the R sentiment analysis engine, our team built, the sentiment score has a range of .-5 to 5. Numbers within this range determine the the change in sentiment. 

SentimentScore
Negative-5
Neutral0
Positive5

Sentiment Scores are determined by a text file of key words and scores called the AFINN lexicon.  It’s a popular with simple lexicon used in sentiment analysis.  

New versions of the file are released in source repositories and contains over 3,300+ words with scores associated with each word based on its level of positivity or negativity.

Twitter is an excellent example of sentiment analysis.

An example of exploration of the sentiment scores based on the retweets filtered on the keywords:

  1. Trump
  2. Biden
  3. Republican
  4. Democrat
  5. Election

The data was created using a sentiment engine built in R.  It is mostly based on the political climate in the United States leading up to and after the 2020 United States election.

Each bubble size represents the followers of user who’ve retweeted.  The bubble size gives a sense of the influence of those users (impact). The Y-axis is the sentiment score, the X-axis represents the retweet count of the bubble name.

“Impact” is a measure of how often a twitter user is retweeted by users with high follower counts.

Using the Apriori Algorithm, you can build a sentiment association analysis in R. See my article on Apriori Association Analysis in R.

Applying the Apriori algorithm. using the single format, we assigned our transactions as the sentiment score and We assigned items_id as retweeted_screen_name.  

This is the measure the association between highly retweeted accounts and their associations based on sentiment scores (negative, neutral, positive).  Support is the minimum support for an itemset.  Minimum support was set to 0.02.

The majority of the high retweeted accounts had highly confident associations based on sentiment values. We then focused on the highest confidence associates that provided lift above 1.  After removing redundancy, we were able to see the accounts where sentiment values are strongly associated between accounts.

 According to the scatter plot above, we see most of the rules overlap, but have very good lift due to strong associations, but also this is indicated by the limited number of transactions and redundancy in the rules.

The analysis showed a large number of redundancy, but this was mostly due to the near nominal level of sentiment values.  So having high lift, a larger minimum support and .removing redundancy find the most valuable rules.

Apriori Association Analysis using R

install.packages("tidyverse")
library(tidyverse)

# install.packages("arules")
library(arules)
# install.packages("arulesViz")
library(arulesViz)

# prepare for transaction data

my_basket1 <- read.transactions("GroceryStore_Basket.csv", format="basket", sep=",")

my_sentiment <- 

my_basket1

inspect(my_basket1)

my_basket2 <- read.transactions("GroceryStore_Single.csv", format="single", sep=",", cols = c("TransactionID","Item"), header= TRUE)

inspect(my_basket2)

## (1) Import "Online Retail.csv" as a transaction data

summary(my_basket2)

itemFrequencyPlot(my_basket2)

rules <- apriori(my_basket2, parameter = list(supp=0.01, conf=0.8, maxlen =4))

summary(rules)
inspect(rules)

rules <- sort(rules, by = 'confidence', decreasing = TRUE)
inspect(rules[1:10])

itemFrequencyPlot(my_basket2)
## (2) Summarize and visualize transaction data 

rules <- apriori(my_basket2, parameter=list(supp=0.01, conf=0.8, maxlen=4, minlen=2))

summary(rules)
inspect(rules)

rules <- sort(rules, by = "confidence")

## (3) Apply the Apriori algorithm

## Remove redundant rules

is.redundant(rules)

inspect(rules[is.redundant(rules)])

rule2 <- rules[!is.redundant(rules)]
inspect(rules2)

plot(rules2)
plot(rules2, method = "graph")
plot(rules[1:10], method = "graph")


bread_rules <- apriori(my_basket2, parameter = list(supp=0.01, conf=0.8, maxlen=4), appearance=list(default="lhs", rhs = "BREAD"))

bread_rules <- sort(bread_rules, by = "confidence", decreasing = TRUE)
inspect(bread_rules)
plot(bread_rules, method="graph")

Using the American Community Survey API in R

The United Status Census Bureau is one of the largest data collection and aggregation organizations in the United States. Their Survey and collection processes allow district lines to be drawn for voting, help local, state and municipal organizations determine how to allocate budgets, and give non-profit organizations insight into the the changing demographics of the United States. Among this incredibly valuable dataset is the American Community Survey or ACS, that collects data on race, gender, household income, employment, education, and age of citizens within each U.S. State. The ACS API or Application Program Interface is a valuable tool to collect and visualize the ACS data, without having to store it locally. The API allows you to interface with the U.S. Census Bureau portal to load the data directory into the R.

First I load I the libraries and packages necessary.

library(tidyverse)
library(tidyr)
library(ggplot2)
library(dplyr)
install.packages("broom")
library(readxl)
install.packages("stringi")
library(stringi)
install.packages("tidycensus")
library(tidycensus)
install.packages("tmap")
library(tmap)
library(tmaptools)
library(sf)
library(png)
install.packages("imager")
library(imager)

To get the load data from the ACS API. You have to apply for a U.S. Census Key. To find out more information on using the U.S. Census API go to Census API User’s Guide.

census_api_key('<api_key>', install=TRUE, overwrite=TRUE)

Use the get_acs to pull the data into R.

ACS_2010 <- get_acs("state",  year=2010, variables="S1702_C02_001", output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2011 <- get_acs("state", variables="S1702_C02_001", year=2011, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2012 <- get_acs("state", variables="S1702_C02_001", year=2012, output="tidy", geometry=TRUE) %>%
  select(-moe)
  
ACS_2013 <- get_acs("state", variables="S1702_C02_001", year=2013, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2014 <- get_acs("state", variables="S1702_C02_001", year=2014, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2015 <- get_acs("state", variables="S1702_C02_001", year=2015, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2016 <- get_acs("state", variables="S1702_C02_001", year=2016, output="tidy", geometry=TRUE) %>%
  select(-moe)

ACS_2017 <- get_acs("state", variables="S1702_C02_001", year=2017, output="tidy", geometry=TRUE) %>%
  select(-moe)

The variable S1702_C02_001 is the table ID for the category of data that will be loaded. The data represents housing income data. Use Tidyverse to organize and aggregate.

ACS_geo_2011 <- ACS_2011 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2012 <- ACS_2012 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2013 <- ACS_2013 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2014 <- ACS_2014 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2015 <- ACS_2015 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate)) 

ACS_geo_2016 <- ACS_2016 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))

ACS_geo_2017 <- ACS_2017 %>%
  select('GEOID','NAME','variable','estimate','geometry') %>%
  filter(variable=='S1702_C02_001') %>%
  group_by(GEOID, NAME) %>%
  summarize(estimate = sum(estimate))

To generate the vector maps of the ACS, use tmap calls.

jpeg(file="ACS_geo_2010.jpg")
tm_shape(ACS_geo_2010) + tm_polygons("estimate") + tm_layout(title.position=c("left","top"), title="Poverty Levels in U.S. Post-Recessions", asp=1)
dev.off()

plot(load.image("ACS_geo_2010.jpg"), axes=FALSE)


tm_shape(ACS_geo_2011) + tm_polygons("estimate")

tm_shape(ACS_geo_2012) + tm_polygons("estimate")

tm_shape(ACS_geo_2013) + tm_polygons("estimate")

tm_shape(ACS_geo_2014) + tm_polygons("estimate")

tm_shape(ACS_geo_2015) + tm_polygons("estimate")

tm_shape(ACS_geo_2016) + tm_polygons("estimate")

tm_shape(ACS_geo_2017) + tm_polygons("estimate")

Data from the ACS portal can also be used to compare the home values by year of certain states.


ACS_Data_Housing <- ACS_Data %>%
  select('Home Values','Household Income','Bankruptcies','Percent Homeownership','Percent People in Poverty','State','Year') %>%
  filter(State %in% c("North Carolina","Massachusetts","Florida","California")) %>%
  group_by(`Year`)

ggplot(data=ACS_Data_Housing, aes(x=Year, y=`Home Values`, group=as.factor(`State`), color=as.factor(`State`))) +
   geom_line() + geom_point() +
  ylab("Home Values") +
  labs("States")

Using R to Create Decision Tree Classification

R is a great language for creating decision tree classification for a wide array of applications. Decision trees are a tree-like model in machine learning commonly used in decision analysis. The technique is commonly used in creating strategies for reaching a particular goal based on multi-dimensional datasets.

Decision trees are commonly used for applications such as determining what type of consumer is at higher risk of defaulting on a loan than borrowers of lower risk. What sort of factors impacts whether a company can retain customers, and what type of students are more at risk at dropping out and require mediation based on school attendance, grades, family structure, etc.

Below are the typically libraries for building machine learning analysis are below including decision trees, linear and logistic regression

library(tidyverse)
library(dplyr)
library(broom)
library(yardstick)
library(DescTools)
library(margins)
library(cutpointr)
library(tidyverse)
library(caTools)
library(rsample)
library(ROSE)
library(rpart)
library(rpart.plot)
library(caret)
install.packages("rsample")
install.packages("caTools")
install.packages("ROSE")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("yardstick")
install.packages("DescTools")
install.packages("margins")
install.packages("cutpointr")

The following code block creates regression and decision tree analysis of custom churn predictions.

# Import the customer_churn.csv and explore it.
# Drop all observations with NAs (missing values)


customers <- read.csv('customer_churn.csv')
summary(customers)
str(customers)
customers$customerID <- NULL
customers$tenure <- NULL
sapply(customers, function(x) sum(is.na(x)))
customers <- customers[complete.cases(customers),]


#===================================================================



#  Build a logistic regression model to predict customer churn by using predictor variables (You determine which ones will be included).
# Calculate the Pseudo R2 for the logistic regression.


# Build a logistic regression model to predict customer churn by using predictor variables (You determine which ones will be included).

customers <- customers %>%
 mutate(Churn=if_else(Churn=="No", 0, 1))


str(customers)
         
regression1 <- glm(Churn ~ Contract + MonthlyCharges + TotalCharges + TechSupport + MultipleLines + InternetService, data=customers, family="binomial")


# Calculate the Pseudo R2 for the logistic regression.


regression1 %>%
  PseudoR2()

#  Split data into 70% train and 30% test datasets.
#  Train the same logistic regression on only "train" data.

#  Split data into 70% train and 30% test datasets.


set.seed(645)

customer_split <- initial_split(customers, prop=0.7)
train_customers <- training(customer_split)
test_customers <- testing(customer_split)

# Train the same logistic regression on only "train" data.

regression_train <- glm(Churn ~ Contract + MonthlyCharges + TotalCharges + TechSupport + MultipleLines + InternetService, data=train_customers, family="binomial")
#regression_test <- glm(Churn ~ Contract + MonthlyCharges + TotalCharges + tenure + TechSupport, data=test_customers)



#  For "test" data, make prediction using the logistic regression trained in Question 2.
#  With the cutoff of 0.5, predict a binary classification ("Yes" or "No") based on the cutoff value, 
#  Create a confusion matrix using the prediction result.

#. For "test" data, make prediction using the logistic regression trained in Question 2.

str(regression_train)
prediction <- regression_train %>%
  predict(newdata = test_customers, type = "response")
 
 


# With the cutoff of 0.5, predict a binary classification ("Yes" or "No") based on the cutoff value, 


#train_customers <- train_customers %>%
#  mutate(Churn=if_else(Churn=="0", "No", "Yes"))



The last code creates the decision tree.


train_cust_dtree <- rpart(Churn ~ ., data=train_customers, method = "class")
rpart.plot(train_cust_dtree, cex=0.8)

Check the sensitivity and specificity of the classification tree, we create a confusion matrix for ROC charts. ROC Charts are receiver operating characteristic curves that have the diagnostic ability of a binary classifier system as its threshold.

set.seed(1304)

train_cust_dtree_over <- ovun.sample(Churn ~., data=train_customers, method="over", p = 0.5)$data
train_cust_dtree_under <- ovun.sample(Churn ~., data=train_customers,  method="under", p=0.5)$data
train_cust_dtree_both <- ovun.sample(Churn ~., data=train_customers, method="both", p=0.5)$data

table(train_customers$Churn)
table(train_cust_dtree_over$Churn)
table(train_cust_dtree_under$Churn)
table(train_cust_dtree_both$Churn)

train_cust_dtree_over_A <- rpart(Churn ~ ., data = train_cust_dtree_over, method="class")
rpart.plot(train_cust_dtree_over_A, cex=0.8)

customers_dtree_prob <- train_cust_dtree_over_A %>%
  predict(newdata = test_customers, type = "prob")

#  Create a confusion matrix using the prediction result.


head(customers_dtree_prob)

table(customers_dtree_prob)

customers_dtree_class <- train_cust_dtree_over_A %>%
  predict(newdata = test_customers, type = "class")

table(customers_dtree_class)

test_customers <- test_customers %>%
    mutate(.fitted = customers_dtree_prob[, 2]) %>%
    mutate(predicted_class = customers_dtree_class)


confusionMatrix(as.factor(test_customers$predicted_class), as.factor(test_customers$Churn), positive = "1")
#===================================================================


#  Based on prediction results in Question 3, draw a ROC curve and calculate AUC.


roc <- roc(test_customers, x=.fitted, class=Churn, pos_class=1, neg_clas=0)

plot(roc)
auc(roc)

plot(roc) +
    geom_line(data = roc, color = "red") +
    geom_abline(slope = 1) +
    labs(title = "ROC Curve for Classification Tree")