 # Back to the Basics: How Does Machine Learning Actually Work? Machine learning is a hot topic with many businesses investing in the technology—but often, the businesses investing in this space don’t have a solid understanding of the basics of machine learning, which can lead to poor results.

Let’s remedy this problem by explaining machine learning in simple terms:

### Why Machine Learning?

In traditional software engineering, a problem is decomposed into smaller problems—then each problem is solved using brute force techniques with hard-coded rules.  And by hard codes rules, I mean that for each case of inputs, a new block of logic (code) must be written to handle it.

As a working example throughout this post, let’s say we’re trying to determine whether a given email is spam or not—like Gmail.  In this case, imagine that we’re using the number of all-caps words in the email as input. In a traditional software solution might include writing an if statement like the one in figure 1; but the problem with this approach is that you’d have to re-code your solution when spammers tactics begin to change—for example, if they catch on to the all-caps logic and start using misspelled title case words instead.

def predict_spam(count_all_caps_words):

if count_all_caps_words > 15:

spam_prediction = True

else:

spam_prediction = False

return spam_prediction

Figure 1: A simple function to predict if an email is spam or not.

### What Logic Should We Use?

A machine learning solution for the same problem would be to implement a learning algorithm that receives the data much like the function in figure 1; but rather than utilizing hard-coded rules, it would learn the logic automatically.  When the spammers change tactics, we would just run the training algorithm again rather than re-code it, since this learning is automatic and re-coding is unnecessary.

So, what exactly is a learning algorithm?  Put simply, a learning algorithm is just a block of code that learns a mapping between inputs and outputs.  Throughout this post, we’ll use logistic regression as our example of a learning algorithm.  Logistic regression takes a weighted sum of its inputs and passes it through a special function called a sigmoid that produces numbers between 0 and 1 (probabilities).  For the weighted sum, the weights start as small, random numbers—which obviously won’t produce good results on its own.  Therefore, we’ll need to train the weights such that the algorithm returns a high number when it’s given a spam email and a low number otherwise.  To do this, we’ll use a learning procedure called gradient descent. Figure 2: Logistic Regression decision boundary. X1 could be the number of all caps words in the email, and X2 could be the number of misspelled words in the email. Pluses are spam emails and minuses are non-spam emails. The dashed line represents the decision function learned by logistic regression; new emails that appear on the right-hand side of the boundary will be classified as spam emails, for example.

### How to Learn from Data?

To change our weights from small, random numbers to ones that produce the results we want, we need a way to update them so that our output is close to our ideal output.  Therefore, in addition to the inputs to our logistic regression algorithm, we’ll need ideal outputs, such as 1’s for spam emails and 0’s for non-spam emails.  We use this strategy so that our algorithm will learn to output a number as close as possible to 1 for spam, and a number as close as possible to 0 for non-spam.

For the first round of training, we’ll give logistic regression our inputs; it’ll use its random weights to produce a weighted sum of inputs, and then pass the result through the logistic function and produce a number.  Let’s assume we pass in a spam email as input, but since we still have random weights, our algorithm outputs 0.44.  Since we know the input is a spam email, we’d like our algorithm to produce something closer to a 1.  We can use a measure of how wrong we are (1 – 0.44 = 0.56, often called a cost function) to update our weights in the right direction.

This is where calculus comes in: we’ll need to take a partial derivative of our cost function with respect to each of our weights.  A derivative simply measures the rate of change.  This will give us a positive or negative number, and we’ll use the sign of this number to know whether to make our weights bigger or smaller.  Put simply, if I make a weight a little bigger, does my output get better?  What if I make a weight a little smaller?  We’ll run this procedure over and over, usually until the weights are more stable (often called convergence).  When we’re done, our weights will be vastly different from the original random ones.  This process is called gradient descent.

Prediction = 1 / e(W1 * number_of_all_caps_words + W2 * number_of_mispelled_words + bias)

Figure 3: The logistic regression formula.  The weights we update during training are W1, W2 and bias.

### Why Data is So Important

Now, we have a logistic regression function that returns a 1 if the input email is likely spam and a 0 if not.  That’s a great start, but when we try our algorithm on new data, it may not perform as well as it did during our training procedure. This is why it’s important to have a test set in our training procedure.

Ideally, we should randomly split our available data into a training set and a test set:  we should run our training procedure on the training set and evaluate it on the test set.  The test set performance should be a proxy for how our algorithm might perform when analyzing emails it’s never seen before.

If our error on the training set is better than the performance on the test set, our algorithm is overfitting, or too complex for the task.  To fix this, we need to make our algorithm simpler by running fewer training rounds, using fewer features, or regularizing our algorithm. If our training error is low and our test error is low, we are underfitting, meaning our algorithm is too simple.  To remedy this, we could run more training rounds, use more features, or un-regularize our algorithm.

Another way to improve our algorithm’s performance is to get more data—as much of it as possible.  Even simple algorithms can outperform more complex ones if they are training on vastly more data, because more data means better learning. Figure 4: Plot showing the effect more data has on performance of different models (alternatives to logistic regression). Note that the performance of each one is greatly improved as more data is added. On the X axis is amount of data (millions of words). On the Y axis is performance (test accuracy).

Now, we have a trained logistic regression algorithm that is training on plenty of data and evaluated on a test set.  Once the algorithm’s performance is satisfactory, we’ll be ready to move the algorithm into production to reduce wasted time on deleting spam emails.

Want to learn more about how machine learning can be applied? Check out our blog, “3 Machine Learning Trends That Will Benefit Predictive Analytics,” to learn about a few of the machine learning trends we’re most excited about.