How kNN Algorithm Works? (Cont.)


If the X contains some categorical data or nominal or ordinal measurement scale, you may need to compute weighted distance from the multivariate variables. The next step is to find the K-nearest neighbors. We include a training sample as nearest neighbors if the distance of this training sample to the query instance is less than or equal to the Kth smallest distance. In other words, we sort the distance of all training samples to the query instance and determine the Kth minimum distance. If the distance of the training sample is below the K-th minimum, then we gather the category Y of this nearest neighbors’ training samples. In MS excel, we can use MS Excel function SMALL(array,K) to determine the Kth minimum value among the array. Some special case happens in our example that the 4th until the 7th minimum distance happen to be the same.

The kNN prediction of the query instance is based on simple majority of the category of nearest neighbors. In our example, the category is only binary, thus the majority can be taken as simple as counting the number of ‘+’ and ‘-’ signs. If the number of plus is greater than minus, we predict the query instance as plus and vice versa. If the number of plus is equal to minus, we can choose arbitrary or determine as one of the plus or minus. If your training samples contain Y as categorical data, take simple majority among this data. If the Y is quantitative, take average or any central tendency or mean value such as median or geometric mean.