ISYE 6501 Introduction to Analytic
Modelling Homework 2 Georgia
Institute of Technology.
, lOMoARcPSD| 43283024
ISYE6501 Homework 2
Clear environment
rm(list = ls())
Question 3.1
Using the same data set (credit_card_data.txt or credit_card_data-headers.txt) as in Question 2.2, use the
ksvm or kknn function to find a good classifier: (a) using cross-validation (do this for the k-nearest-neighbors
model; SVM is optional); and (b) splitting the data into training, validation, and test data sets (pick either
KNN or SVM; the other is optional).
3.1 (a)Answer - #load the kernlab and kknn library (which contains the kknn function) and set seed value
#I have used k fold cross validation - in this method the data set is divided into k datasets. A model is given
a known dataset (training data set- Training is done on the training data set) and a an unknown data set
(test data set) against which the model we find is tested. K fold cross validation is a procedure to estimate
the skill of the model on new data. For our given data set I am taking the value of k as 10, as it is the
standard value. By doing this I am dividing the dataset into 10 sample datasets, out of which K-1(9) is the
no. of train dataset and 1 dataset is for Test. Then I have to find the model for this combination.
Similarly, I have to find the model for other 9 combinations where every time the test data set will be
different leaving k-1 dataset for the train dataset. By doing this we are making sure that every data is used
in training as
, lOMoARcPSD| 43283024
well as test. Ultimately we will have 10 models after running all the datasets. now I will choose the model
with the best accuracy among all the models.
First Create 10 partitions of the data into a matrix. Times = 1 means spilt this data 1 time and find 90%
of the data. I have used the sapply function instead of doing loops coz sapply is computationally efficient.
I get 10 values for all the samples of K taken. With the above code, I am getting least error in 7th value
but the mean error comes as 0.2508625.
3.1 (b)Answer -
My approach is to first divide the data sets to Validation data, Training data and Test data. I will be taking
70% of my data into the Training Data set and 15% each into Test and Validation data set. We can see 457
obs are in the Training Data set, 98 obs are in the validation data set and 99 obs are in the test data set.
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller saraciousstuvia. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $16.19. You're not tied to anything after your purchase.