Final Report

Data Analytics Using R

(Classification, Clustering & Association)

Submitted to- Prof. R.K. Jena

Submitted by- Group:4 (ABCD)

Pragya Mishra- 2017111033

Abhinav Jain- 2017132131

Abhiroop Dey Sarkar- 2017133004

Kaira Dhanda- 2017141084

1. CLASSIFICATION :

Introduction

Classification predicts class labels, it is one type of prediction in which we can predict for the group, it constructs a model (Training Set) and classifies data based on the training set on a new data (Testing Set). It is a supervised learning and in this the class definition and labels are clearing mentioned.

The typical applications of classification involves-

1. Approval of credit by banks

2. Marketing based on particular target

3. Diagnosis in medical cases

4. To check the effectiveness of prescribed treatment and its analysis

The goal of classification algorithms is to place items into specific categories and provide answers to various questions based on the different problems, like: Whether the diagnosed tumor is cancerous or not? Whether the email received is spam? What are the chances of default of a loan applicant? To which demographic does some particular online customer fall into?

Here are some examples where classification can be used

• In a hospital, there are some newly admitted patients in the emergency room. Their record is maintained by collecting17 measures (e.g blood, age, pressure, diabetes etc.) .Now the decision has to be taken whether to put the patient in an ICU. But as the cost of ICU is high, the patients who may survive more than a month are given higher priority. The problem is to predict high risk patients and low risk patients that is we want to find how to differenciate both of them.

• For a credit card company which typically receives plethora of applications for new cards. The main problem is to categorize the applications into those who have good credit, bad credit, or fall into a grey area ( thus requiring further human analysis). The decision is made on the basis of information provided regarding several attributes, like their annual salary, outstanding debts if any, age group etc.

2. Classification Techniques (Decision Tree, SVM, NB, Random Forest, Logistic Regression etc.)

A. Decision Tree in R

Decision trees make use flowchart for taking various decisions. As, these structures are easy to understand , we can use them where transparency is needed, such as in banks regarding loan approval.

It is a supervised learning algorithm and used for classification problems. It works for both input and output type of variables. In this technique, the population is splited up into two or more homogeneous sets. Also, it is based on very significant splitter/differentiator for input variables. The decision tree algorithm is powerful classifiers which is non linear. The model of relationships among features is created using a tree structure and potential outcomes too. A structure of branching decisions is used by decision tree classifier.

In classifying data, the decision tree follows the steps mentioned below:

• It puts all training examples to a root.

• Training examples are divided by decision tree based on different selected attributes.

• The attributes are selected by using some statistical measures.

• Recursive partitioning continues until no training example remains.

Important terminologies related to Decision Tree

• Root Node: It represents entire population or sample. Moreover, it gets divided into two or more homogeneous sets.

• Splitting: Process of dividing a node into two or more sub-nodes.

• Decision Tree: it is produced when a sub-node splits into further sub-nodes.

• Leaf/Terminal Node: Nodes do not split is being called Leaf or Terminal node.

• Pruning: When we remove sub-nodes of a decision node, this process is being called pruning. You can say opposite process of splitting.

• Branch / Sub-Tree: A subsection of the entire tree is being called branch or sub-tree.

• Parent and Child Node:

A node, which is being divided into sub-nodes is being called parent node of sub-nodes. Whereas sub-nodes are the child of a parent node.

Types of Decision Tree-

a. Categorical (classification) Variable Decision Tree: Decision Tree which has categorical target variable.

b. Continuous (Regression) Variable Decision Tree: Decision Tree has continuous target variable.

Advantages of Decision Tree in R

• Easy to Understand:

No statistical knowledge is required to read and interpret them. Its graphical representation is very intuitive and users can relate their hypothesis.

• Less data cleaning required: It requires fewer data when compared to some other modeling techniques.

• The data type can handle both numerical and categorical variables.

• It handles nonlinearity.

• It is possible to confirm a model using statistical tests. Gives you confidence it will work on new data sets.

• It performs well even if you slightly deviate from assumptions.

Disadvantages of R Decision Tree

• Over fitting: It is one of the most practical difficulties for decision tree models. By setting constraints on model parameters and pruning we can solve this problem.

• Not fit for continuous variables: While using continuous numerical variables. Whenever it categorizes variables in different categories, the decision tree loses information.

B. Naïve Bayes classification

Naive Bayes uses principles from the field of statistics to make predictions. We use Bayes’ theorem to make the prediction. It is based on prior knowledge and current evidence.

P(B/A) X P(A)

P(A/B)=

P(B)

where P(A) and P(B) are the probability of events A and B without regarding each other. P(A|B) is the probability of A conditional on B and P(B|A) is the probability of B conditional on A.

C. Support Vector Machines (SVM)

It is used to find the optimal hyperplane (line in 2D, a plane in 3D and hyperplane in more than 3 dimensions). Which helps in maximizes the margin between two classes. Support Vectors are observations that support hyperplane on either side.

It helps in solving a linear optimization problem. It also helps out in finding the hyper plane with the largest margin. We use the “Kernel Trick” to separate instances that are inseparable.

Terminologies related to R SVM

• Hyper plane-

It is a line in 2D and plane in 3D. In higher dimensions (more than 3D), it’s called as a hyper plane. Moreover, SVM helps us to find a hyper plane that can separate two classes.

• Margin-

A distance between the hyper plane and the closest data point is called a margin. But if we want to double it, then it would be equal to the margin.

How to find the optimal hyper plane?

First, we have to select two hyper planes. They must separate the data with no points between them. Then maximize the distance between these two hyper planes. The distance here is ‘margin’.

• Kernel-

It is a method which helps to make SVM run in case of non-linear separable data points. We use a kernel function to transforms the data into a higher dimensional feature space. And also with the help of it to make it possible to perform the linear separation.

Working of SVM –

a. Choose an optimal hyper plane which maximizes margin.

b. Applies penalty for misclassifications (cost ‘c’ tuning parameter).

c. If the non-linearly separable the data points. Then transform data to high dimensional space. Where it is easier to classify with linear decision surfaces (Kernel trick).

Advantages of SVM in R

• If we are using Kernel trick in case of non-linear separable data then it performs very well.

• it works well in high dimensional space and in case of text or image classification.

• It does not suffer multicollinearity problem.

Disadvantages of R SVM

• It takes more time on large-sized data sets.

• It does not return probability estimates.

• In case of linearly separable data, this is almost like logistic regression.

Wherein dependent or target variable is continuous we can use Support Vector Machine – Regression is used. The aim of SVM regression is same as classification problem i.e. to find the largest margin.

D. Random Forests

Random Forests are similar to a famous Ensemble technique called Bagging but have a different tweak in it. In Random Forests the idea is to decorrelate the several trees which are generated by the different bootstrapped samples from training Data. And then we simply reduce the Variance in the Trees by averaging them.

Averaging the Trees helps us to reduce the variance and also improve the Performance of Decision Trees on Test Set and eventually avoid Overfitting.

The idea is to build lots of Trees in such a way to make the Correlation between the Trees smaller.

Another major difference is that we only consider a Random subset of predictors mmeach time we do a split on training examples. Whereas usually in Trees we find all the predictors while doing a split and choose best amongst them. Typically m=?pm=p where pp are the number of predictors.

Now it seems crazy to throw away lots of predictors, but it makes sense because the effect of doing so is that each tree uses different predictors to split data at various times. This means that 2 trees generated on same training data will have randomly different variables selected at each split, hence this is how the trees will get de-correlated and will be independent of each other. Another great thing about Random Forests and Bagging is that we can keep on adding more and more big bushy trees and that won’t hurt us because at the end we are just going to average them out which will reduce the variance by the factor of the number of Trees itself.

E. Logistic Regression

Logistic regression involves fitting a curve to numeric data to make predictions about binary events. Arguably one of the most widely used machine learning methods .This classification algorithm, is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.

The fundamental equation of generalized linear model is:

g(E(y)) = ? + ?x1 + ?x2

Here, g() is the link function, E(y) is the expectation of target variable and ? + ?x1 + ?x2 is the linear predictor ( ?,?,? to be predicted). The role of link function is to ‘link’ the expectation of y to linear predictor.

In logistic regression, we are only concerned about the probability of outcome dependent variable.

3. Case Study Implementation:

i. Data Set Information:

The data set allows for several new combinations of attributes and attribute exclusions, or the modification of the attribute type (categorical, integer, or real) depending on the purpose of the research. This database (Absenteeism at work) was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil and was used in academic research at the Universidade Nove de Julho – Postgraduate Program in Informatics and Knowledge Management.

Attribute Information:

A. Individual identification (ID) – Categorical

B. Reason for absence (ICD). – Categorical

Absences attested by the International Code of Diseases (ICD) stratified into 21 categories as follows:

1. Certain infectious and parasitic diseases

2. Neoplasms

3. Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism

4. Endocrine, nutritional and metabolic diseases

5. Mental and behavioral disorders

6. Diseases of the nervous system

7. Diseases of the eye and adnexa

8. Diseases of the ear and mastoid process

9. Diseases of the circulatory system

10. Diseases of the respiratory system

11. Diseases of the digestive system

12. Diseases of the skin and subcutaneous tissue

13. Diseases of the musculoskeletal system and connective tissue

14. Diseases of the genitourinary system

15. Pregnancy, childbirth and the puerperium

16. Certain conditions originating in the perinatal period

17. Congenital malformations, deformations and chromosomal abnormalities

18. Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified

19. Injury, poisoning and certain other consequences of external causes

20. External causes of morbidity and mortality

21. Factors influencing health status and contact with health services.

Few Other categories which don’t come under international code of diseases-

22. Patient follow-up

23. Medical consultation

24. Blood donation

25. Laboratory examination

26. Unjustified absence

27. Physiotherapy

28. Dental consultation

C. Month of absence- Real

D. Day of the week – Categorical

Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6)

E. Seasons- Categorical

Summer (1), Autumn (2), Winter (3), Spring (4)

F. Transportation expense- Real

G. Distance from Residence to Work (kilometers)- Real

H. Service time- Integer

I. Age- Integer

J. Work load Average/day – Real

K. Hit target- Real

L. Disciplinary failure (yes=1; no=0)- Categorical

M. Education (high school (1), graduate (2),

postgraduate (3), master and doctor (4))- Categorical

N. Children (number of children)- Real

O. Social drinker (yes=1; no=0)- Categorical

P. Social smoker (yes=1; no=0)- Categorical

Q. Pet (number of pet)- Real

R. Weight- Real

S. Height -Real

T. Body mass index- Real

U. Absenteeism time in hours (target)- Real

ii. Objective of the study-

To find out the cases of high absenteeism and low absenteeism in the company and the most possible reasons associated with it.

iii. R- Code

library(e1071)

library(rpart)

library(rpart.plot)

#Loading the Data

absent=read.csv(file.choose())

str(absent)

dim(absent)

names(absent)

head(absent)

sum(complete.cases(absent))

sapply(absent,function(x) sum(is.na(x)))

absent$Work.load.Average.day=as.numeric(absent$Work.load.Average.day)

summary( absent$Absenteeism.time.in.hours)

a1=mean(absent$Absenteeism.time.in.hours)

absent$Absenteeism.time.in.hours=as.data.frame(absent$Absenteeism.time.in.hours)

absent$Absenteeism.time.in.hours=ifelse(absent$Absenteeism.time.in.hours>a1,1,2)

str(absent)

absent$Absenteeism.time.in.hours=as.factor(absent$Absenteeism.time.in.hours)

str(absent)

dim(absent)

names(absent)

head(absent)

absenteeism=absent,-1

names(absenteeism)

dim(absenteeism)

str(absenteeism)

str(absenteeism$Absenteeism.time.in.hours)

res=cor(absenteeism,1:19)#cor b/w independent variables

round(res,2)

#removing highly correlated independent variables

absenteeism=absenteeism,-c(7,8,17,19)

names(absenteeism)

absenteeism$Absenteeism.time.in.hours=as.numeric(absenteeism$Absenteeism.time.in.hours)

cor(absenteeism,1:15,absenteeism$Absenteeism.time.in.hours)

#deleting less correlated values

absenteeism=absenteeism,-c(2,4:8,10:15)

str(absenteeism)

absenteeism$Absenteeism.time.in.hours=factor(absenteeism$Absenteeism.time.in.hours,levels=c(1,2),labels=c(“high absenteeism”,”low absenteeism”))

str(absenteeism)

summary(absenteeism)

#Dividing the data

set.seed(1234)

d = sort(sample(nrow(absenteeism), nrow(absenteeism)*.7))

length(d)

train