Classification using K-Nearest Neighbor Classifier with Scikit Learn
K-NN is a very simple machine learning algorithm which can classify a point based on other nearest points. Let's take an example, if you see the below image.
We have set k = 3, this means that we will classify a point based on the nearest three points, in this case two of the three points are orange points therefore the unknown point (blue point) will be classified as an orange point.
OK, let's do a real classification task, in this example I am going to classify the most popular iris flower data set. In this data set there are samples of three different types of Iris flower. Those are Versicolor, Verginica, Setosa and I am going to build a model that is capable of classifying a new flower into one of these categories.
Let's code,
Step 1 - Import the necessary libraries and load the data (I am going to make use of Scikit Learn's Iris Data set)
Let's understand the data set, this data set consist of 150 records (Every Iris type has 50 records each.) which have information like flower's sepal length, sepal width, petal length, petal width and the label/class.
If you look at the "iris_dataset" variable above, it is a python dictionary with 5 key value pairs.
1. DESCR - Description of the dataset
2. data - Feature values for all 150 records (sepal length, sepal width, petal length, petal width)
3. feature_name - Information about the features
4. target - classes of all the 150 records
5. target_name - names of the classes/ flowers.
But among these I am only interested in data and target. In other words features and label.
Step 2 - Prepare the data set
Let's break the above code.
First step is so obvious, getting the features and labels into variables. ( There is no array type of data structure in python so we use numpy arrays )
Before discussing about the second part let's discuss why we need this step, In our data set we have the labels and features separately, which is good in a way because at the end we will separate the features and labels to train the model.
but the real problem is, as you can see in the above image the records in the data set are grouped, which means the first 50 records belongs to class 0 (Setosa), the next 50 records are class 1 (Versicolor), and the final 50 belongs to class 2 (Virginica). This will create a bias model, which is not what we want. So we need to shuffle the data set first.
In ordered to shuffle the data set, first we will concatenate the separated features and labels into one array, and then we will shuffle the array. That's it, we have finished preparing our data set.
In the final step, we have split the features and labels again and stored it into X and y variables.
Step 3 - Let's train and test our model.
1. First we will separate our data set into two parts, one will be used to train the model and the other one will be used to test the model, the training and testing ratio can be anything, but it's preferred to use 80% of the data for training and 20% to test the accuracy of the model. This can be done very easily with train_test_split of sklearn.model_selection.
2. After that we create the KNeighborsClassifier model and fit (train) the training data to the model. An important thing to note here is, I have passed an argument to the Classifier, what is that? That is nothing but the K value I talked about earlier, In this case I have set 5 for that parameter (default value for this parameter is also five, but for the sack of explanation I specified it), that means the model will classify a point based on the nearest 5 points. We have to be careful when choosing the values for K, so let's discuss why I choose 5 for this.
- Our data set have 3 different classes so we cannot choose 3, because there is a possibility that all the classes get 1 nearest point each.
- What about k = 4, not only four but using any even number is not recommended here, because there is a very high possibility for some classes to have same number of nearest points.
- We can give any odd number from 5 on wards, since our data set is small I choose 5.
3. Finally, either we can predict the class for new feature set or we can find how accurate our model is using the test set. In this case I am only interested in finding the accuracy of the model. So I used score function of the Classifier Object and passed the test set.
That's it. Let's run and see the results.
Cool I have got an average accuracy of 0.97 which is really good.
That's it. Let's run and see the results.
Cool I have got an average accuracy of 0.97 which is really good.
Enjoy coding 😉
Actually nice post!
ReplyDeletekeep it up machan!
(If you can share github link to share the relevant project it could be very useful for further reading.)
And another comment, it is very interesting way that you have explained how you chose 'k' value.
ReplyDeleteIt was really helpful!
Thanks for the appreciation machan....,
ReplyDeleteI will put a link to the source code.
Well written. Nicely explained the procedure.
ReplyDeleteThank you praveen
Deletecode is available here (if you want) : https://github.com/CharlesRajendran/Blog/blob/master/MachineLearning/K_NN/k-nearest-iris.py
ReplyDeleteSuper aiya
ReplyDeleteThanks Lasitha
DeleteGood work bro. It will be better if you change the font.
ReplyDeleteThanks bro :)
Delete