R project for building a k-NN model: All answers should be output by your R script including the last question. A simulated data set called blueOrangeIn.csv accompanies this question. The data set has two continuous feature variables X1,X2 and a categorical response variable Y with two values of ‘‘Blue” and ‘‘Orange’’. The original data is drawn from random points in a 3 × 3 chessboard colored alternatively blue and orange, with some noise and inaccuracy injected in it.
3a) Using the read.csv command read this data into a data frame called Q3dat. Print the head and tail of the Q3dat data frame to make sure it is read correctly.
3b) Using the kknn package build six models for k = 1,10,100,1000,2500,3500. For test set create a 40×40 grid by subdividing the range of X1 and X2 into 40 equally spaced intervals. The 1600 new points will form the test set which also should be used for graphing the results in the following questions. (Suggestion: Start with only a small portion of the dat, say a random subset of 500 rows. Write your program for that small set. Once you are sure it works, then run it on the full set. Each run on the full set may take several minute.)
3c) For each value of k mentioned in question 1b) create a graph by coloring each test set point orange or blue based on the predicted value. Also draw the boundary between orange and blue points. You may use the knn.r file posted in the lecture notes as your template.
3d) Use validation set technique to find the near optimal k for the k-NN method for this data. To this end use values of k = 1 to k = 1991 with jumps of 10, that is, test k = 1,11,21,…,1991.We could use two funtions- One is the train.kknn function which actually uses the 1-folding method to find the best value of k among a list of values given to it. Another is the cv.kknn function, which uses the k-folding technique, where you would have to provide the folding parameter as well. For this exercise you will use the 1-folding version, that is the train.kknn function. Read the documentation of this function, and in particular the output, and how to extract information form it. Use the Q3dat data frame as your training set. When done find the best k, that is the one resulting in the lowest error rate. Also print the corresponding lowest misclassification rate. Use the plotfunction on the output of the train.kknn function to draw the graph depicting various k values against their misclassification rate.
3e) Using the optimal value of k build the kNN model for the data in parts 3a to 3c of this question. Use, the 40 × 40 grid and predict and plot the outcome of applying the optimal kNN model on this test set.
Need help with this assignment or a similar one? Place your order and leave the rest to our experts!