Skip to main content

Abbreviations kNN it means k-Nearest-Neighbors, which is a supervised learning algorithm. It can be used for classification as well as for regression problems.

How does the kNN algorithm work?

kNN decides the class of the new data point based on the maximum number of neighbors the data point has that belong to the same class.

If the neighbors of a new data point are as follows, NY: 7, NJ: 0, IN: 4, then the class of the new data point will be NY.

Let's say you work in a post office and your job is to organize and distribute letters among the postmen to minimize the number of trips to the different neighborhoods. And since we are only imagining things, we can assume that there are only seven different neighborhoods. This is a kind of classification problem. You need to divide the letters into classes, where the classes here refer to the Upper East Side, Downtown Manhattan, and so on.

If you like to waste time and resources, you can give a letter from each neighborhood to each postman, and hope that they meet in the same neighborhood and discover your corrupt plan. That is the worst kind of distribution that can be achieved.

On the other hand, you can organize the letters based on which addresses are closest to each other.

You could start with "If it's within three blocks, give it to the same postman". That number of closest blocks is where it comes from K. You can keep increasing the number of blocks until you reach an efficient layout. That is the most efficient value of k for your classification problem.

kNN in practice - Code

As we did in the last tutorial, we are going to use the module KNN from ml.js to train our classifier kNearestNeighbors. Every Machine Learning problem needs data, and we are going to use the data set IRIS in this tutorial.

The iris data set consists of 3 different types of iris petal lengths and sepals (Setosa, Versicolor and Virginica), along with a field indicating its respective type.

Install the libraries

$ yarn add ml-knn csvtojson prompt

Or if you prefer npm:

npm install ml-knn csvtojson prompt
  • ml-knn: k Nearest neighbors
  • csvtojson: Analysis data
  • in the right time: To allow the user to request predictions

Initialize the library and load the data

The Iris dataset is provided by the University of California, Irvine and is available here. However, due to the way it is organized, you will have to copy the content in the browser (Select All, Copy) and paste it into a file called iris.csv. You can name it whatever you want, except that the extension must be .csv.

Now initialize the library and load the data.

const KNN = require ('ml-knn'); const csv = require ('csvtojson'); const prompt = require ('prompt'); let knn; const csvFilePath = 'iris.csv'; // Data const names = ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'type']; // for the header let seperationSize; // To separate training and test data let data = [], X = [], y = []; let trainingSetX = [], trainingSetY = [], testSetX = [], testSetY = [];

The names of the headings are used for display and understanding. They will be removed later.

Further, seperationSize It is used to divide the data into training and test data sets.

We have imported the package csvto.json, and now we are going to use your method fromFile to load the data. (Since our data does not have a header row, we are providing our own header names.)

csv ({noheader: true, headers: names}) .fromFile (csvFilePath) .on ('json', (jsonObj) => {data.push (jsonObj); // Push each object to data Array}) .on ( 'done', (error) => {seperationSize = 0.7 * data.length; data = shuffleArray (data); dressData ();});

We are pushing each row to the dates variable, and when the process has finished, we are adjusting the seperationSize to 0.7 times the number of samples in our data set. Note that if the size of the training samples is too small, the classifier may not work as well as it would with a larger set.

Since our data set is ordered with respect to types (console.log to confirm), the function shuffleArray is used to shuffle the data set to allow for division. (If you don't shuffle, you may end up with a model that works well for the first two classes, but fails with the third.)

This is how it is defined.

/ ** * https://stackoverflow.com/a/12646864 * Randomize array elements in place. * Using Durstenfeld's shuffling algorithm. * / function shuffleArray (array) {for (var i = array.length - 1; i> 0; i--) {var j = Math.floor (Math.random () * (i + 1)); var temp = array [i]; array [i] = array [j]; array [j] = temp; } return array; }

Dress Data (once again)

Our data is organized as follows:

{sepalLength: '5.1', sepalWidth: '3.5', petalLength: '1.4', petalWidth: '0.2', type: 'Iris-setosa'}

There are two things we need to do with our data before delivering it to the classifier kNN:

  • Turn the string values to floats. (parseFloat)
  • Converts the type to numbered classes.
function dressData () {/ ** * There are three different types of Iris flowers that this dataset * classified: * * 1. Iris Setosa (Iris-setosa) * 2. Iris Versicolor (Iris-versicolor) * 3. Iris Virginica (Iris-virginica) * * Let's change these classes from Strings to numbers. * Such that, a type value equal to * 0 would mean setosa, * 1 would mean versicolor, and * 3 would mean virginica * / let types = new Set (); // To gather UNIQUE classes data.forEach ((row) => {types.add (row.type);}); typesArray = [... types]; // To record the different types of classes. data.forEach ((row) => {let rowArray, typeNumber; rowArray = Object.keys (row) .map (key => parseFloat (row [key])). slice (0, 4); typeNumber = typesArray.indexOf (row.type); // Convert type (String) to type (Number) X.push (rowArray); y.push (typeNumber);}); trainingSetX = X.slice (0, seperationSize); trainingSetY = y.slice (0, seperationSize); testSetX = X.slice (seperationSize); testSetY = y.slice (seperationSize); train (); }

If you are not familiar with Sets, they are like their mathematical counterparts in that they cannot have duplicate elements, and their elements do not have an index. (Unlike Arrays.)

And they can be easily converted to Arrays using the spread operator or using the Set constructor.

Train your model and then test it

function train () {knn = new KNN (trainingSetX, trainingSetY, {k: 7}); test(); }

The training method takes two required arguments, the input data, such as petal length, petal width, and its actual class, such as the Iris-setosa, and so on. It also takes an optional options parameter, which is just a JS object that can be passed to adjust the internal parameters of the algorithm. We are passing the value of k as an option. The default value of k it is 5.

Now that our model has been trained, let's see how it performs on the test rig. We are primarily interested in the number of misclassification that occurs. (That is, the number of times it predicts that the input is something, even though it is actually something else.)

function test () {const result = knn.predict (testSetX); const testSetLength = testSetX.length; const predictionError = error (result, testSetY); console.log (`Test Set Size = $ {testSetLength} and number of Misclassifications = $ {predictionError}`); predict (); }

The error is calculated as follows. We use the humble for loop to loop over the dataset, and see if the predicted output is not equal to the actual output. It is a misclassification.

function error (predicted, expected) {let misclassifications = 0; for (var index = 0; index <predicted.length; index ++) {if (predicted [index]! == expected [index]) {misclassifications ++; }} return misclassifications; }

(Optional) Start predicting

Time to have some pointers and predictions.

function predict () {let temp = []; prompt.start (); prompt.get (['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'], function (err, result) {if (! err) {for (var key in result) {temp.push (parseFloat (result [key]));} console.log (`With $ {temp} - type = $ {knn.predict (temp)}`);}}); }

Feel free to skip this step, if you don't want to test the model in a new post.

All finished!

If you followed all the steps, this is what your index.js should look like:

const KNN = require ('ml-knn'); const csv = require ('csvtojson'); const prompt = require ('prompt'); let knn; const csvFilePath = 'iris.csv'; // Data const names = ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth', 'type']; // for the header let seperationSize; // To separate training and test data let data = [], X = [], y = []; let trainingSetX = [], trainingSetY = [], testSetX = [], testSetY = []; csv ({noheader: true, headers: names}) .fromFile (csvFilePath) .on ('json', (jsonObj) => {data.push (jsonObj); // Push each object into data array}). on ('done', (error) => {seperationSize = 0.7 * data.length; data = shuffleArray (data); dressData ();}); function dressData () {let types = new Set (); // To gather UNIQUE classes data.forEach ((row) => {types.add (row.type);}); typesArray = [... types]; // To record the different types of classes. data.forEach ((row) => {let rowArray, typeNumber; rowArray = Object.keys (row) .map (key => parseFloat (row [key])). slice (0, 4); typeNumber = typesArray.indexOf (row.type); // To record the different types of classes X.push (rowArray); y.push (typeNumber);}); trainingSetX = X.slice (0, seperationSize); trainingSetY = y.slice (0, seperationSize); testSetX = X.slice (seperationSize); testSetY = y.slice (seperationSize); train (); } function train () {knn = new KNN (trainingSetX, trainingSetY, {k: 7}); test(); } function test () {const result = knn.predict (testSetX); const testSetLength = testSetX.length; const predictionError = error (result, testSetY); console.log (`Test Set Size = $ {testSetLength} and number of Misclassifications = $ {predictionError}`); predict (); } function error (predicted, expected) {let misclassifications = 0; for (var index = 0; index <predicted.length; index ++) {if (predicted [index]! == expected [index]) {misclassifications ++; }} return misclassifications; } function predict () {let temp = []; prompt.start (); prompt.get (['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'], function (err, result) {if (! err) {for (var key in result) {temp.push (parseFloat (result [key]));} console.log (`With $ {temp} - type = $ {knn.predict (temp)}`);}}); } / ** * https://stackoverflow.com/a/12646864 * Randomize array element order in-place. * Using Durstenfeld shuffle algorithm. * / function shuffleArray (array) {for (var i = array.length - 1; i> 0; i--) {var j = Math.floor (Math.random () * (i + 1)); var temp = array [i]; array [i] = array [j]; array [j] = temp; } return array; }

Run the node index.js. It should show you this on screen:

$ node index.js Test Set Size = 45 and number of Misclassifications = 2 prompt: Sepal Length: 1.7 prompt: Sepal Width: 2.5 prompt: Petal Length: 0.5 prompt: Petal Width: 3.4 With 1.7,2.5,0.5,3.4 - type = 2

Well done. That's your kNN algorithm at work.

A huge aspect of the kNN algorithm is the value of k, and it is called a hyperparameter. Hyperparameters are a "type of parameters that cannot be learned directly from the regular training process." These parameters express "higher level" properties of the model, such as its complexity or how quickly it must be learned. They are called "hyperparameters."

R Marketing Digital