Machine Learning Foundations

Value Estimations

Notes from the course: Machine Learning and AI Foundations: Value Estimations on LinkedIn Learning

Machine Learning: a process where (e.g.) a computer learns how to do new things without having a human explicitly writing a program for it.

Supervised Learning: a part/subset of machine learning (ML) where (e.g.) a computer learns how to perform a new function by analyzing labeled or annotated training data.

The example the author is using to explain the process of creating, training, testing and evaluating a model, is estimating the price of real-estate based on some known parameters or attributes (e.g. size, bedrooms, year, location, etc.).


The market price of a house with an area of 3800 sqft and 5 bedrooms is around 450,000.00. We need to model this, so we are able to predict the price of other houses with similar properties.

The formula that “accurately” describes the model would be:

value = base_line_value + (size_in_sqft*weight1) + (number_of_rooms*weight2)

base_line_value = 50,000.00

where 50,000$ is the expected minimum value of a house

weight1 = 92

is a known correctional factor taking into an account the surface area of a house

weight2 = 10,000

is a known correctional factor taking into an account the number of rooms in a house)

=> value = 449,600.00

Actually not that far from the estimated market value!

But we need to generalize this further, and find the value of the weight factors and thus the model that best evaluates the price of a house on the market with a given set of attributes.

For the purpose of illustration, lets consider we have the following “training” data available:

No. of bedrooms   Size sqft   Est. value

      5             3800       400,000

      4             2200       150,000

      2             1150       300,000

By substituting this in the generic equation above, we get the 3 formulas for the value of each house in the training set, that is value1, value2 and value3 and from here, the goal is to find the best possible values for weight1 and weight2 that bring us as close as possible to the estimated values of each of the houses.

As you can imagine, doing this manually for a training data set with thousands of data is virtually impossible!

The way it’s done on a computer, is to take totally random starting values for the weight factors and compute each value, then compare this value with the estimated one from the training data set and adjust accordingly until we cannot go further!

In a mathematics parlance, the process of modeling the value of something with fixed weights is what’s called linear regression while the (demonstrated) algorithm for estimating the best value for those weights a gradient descent!

Cost Function: an equation that shows how far off (how wrong) an estimator function is with the current weight factors.

For our given example above, the cost function could be written as:

Example Cost function

From here, it’s easy to write the cost function in a more generic form as:

Generic Cost function

where m would be the number of test data.

The reason why we square the difference between the estimated and the calculated value is because we want to give each calculated value an equal chance in the outcome i.e. we want each estimate to be as little wrong as possible, rather than having one estimate be way off!

The Cost Function is called this way by convention and the goal is to minimize it as much as possible!

Gradient Descent: a common mathematical optimization algorithm to get the lowest possible values for the weight factors is called. …and it’s an iterative one – meaning we start with “some” values and we “slowly” adjust the values while minimizing the error, until we cannot further. By ‘we’, I meant ‘the computer’ 🙂

In summary, the process of value estimation using ML would be:

  1. Create a model i.e. create an equation which (best) describes the problem
  2. Create a (or rather calculate the) cost function to quantify the model error
  3. Use an optimization algorithm (e.g. gradient descent) to find the model parameters (e.g. the correction factors) while minimizing the cost function (ideally 0!)

For ML, the Python programming language  seems to be the goto language. For a good reason!

Python comes with modules that support data scientists in their ML endeavors. For example:

  • NumPy – an efficient array handling and linear algebra module/library
  • scikit – a popular ML module/library
  • pandas – working with big data sets in virtual worksheet-like (in-memory) representation module/library (pandas – panel data sets)

What does NumPy bring?

Think of vectors… or arrays. Now think how you would (traditionally) e.g. raise each element of the array to the power of 2? By iterating through all of the elements and raising each element individually to the power of 2.

But, modern processors allow us to do the same on all elements at once (due to multiple cores and such). Well, kind of.

SIMD, or Single Instruction, Multiple Data. You declare an array as a NumPy array, and you can literally raise each element to the power of 2 with one single instruction which  is far more efficient than going through the elements one at a time!

Vectorizing our code: replacing iterative constructs with vectors so that some (mathematical) operations could be executed (almost) in parallel. Such code is far more efficient, or at least for very large data sets it is.

The basic workflow for training a supervised ML model would be:

  1. get the (all) data you will be working with (upon which you will create your model!)
  2. clean and pre-process the data (almost a given in any circumstance – convert all text to useful numbers, and filter out only the relevant data – i.e. the data that is useful for the model)
  3. avoid patterns in your data by shuffling your data – i.e. change the order in which it was collected
  4. split the data into two groups: (1) 70% test data and (2) 30% model validation data (checking for model accuracy)
  5. set the model’s hyper-parameters (e.g. how fast to learn, the desired complexity of the model, etc.)
  6. train the model (using the 1st data set from step 4)
  7. evaluate the model (against the 2nd data set from step 4)
  8. start using the model outside of the initial data set (from step 4)

Gradient Boosting algorithm

  • model (complex) data & data relationships as trees a.k.a decision trees
  • uses an ensemble of smaller decision trees (essentially a smaller ML model) to better estimate a value (better than what the smaller model could do on it own)
  • iteratively improves predictions by introducing and adding additional decision trees to improve the predictions from previous trees

Features, or data attributes, are the values given to a prediction model and are derived from the available data set.

Features (X values) → Supervised Learning Model → Value to Predict (Y value(s))

  • capture as many combinations of features as possible
  • at least x 10 more data points than the number of features (one data point = one ‘row’ per given set of features)
  • the more data though – the better!

Feature Engineering is the process of defining which from the existing features strongly correlates with our model, or coming up with new features that accomplish the same. A good amount of time should be allocated to this activity as it’s paramount to the accuracy of our model. Refrain from including useless features! In other words – represent the data in the simplest way possibledo not over do it!

To accomplish that:

  • add/drop features
  • combine one or more features into one, or
  • apply binning i.e. replacing one numerical feature in favor of a more broader category such as (e.g.) TRUE or FALSE
  • use one-hot encoding i.e. representing categorical data with numerical data in a form more suitable for ML (e.g. Is North (0) vs. Is North (1) or Is North-West (1) vs. Is North-West (0) and so on)

The Curse of Dimensionality
The more features we add, the more the number of data points or rows will increase – most likely exponentially. As a result of this, use as few features as possibly while still accurately modeling the problem.

Mean Absolute Error
A measurement of the average prediction error across your data set.

Too complex – basically values each data point equally as any other in the entire data set (literally memorizing each training data point without figuring out the pattern). In this case, the training set error is very low, while the test set error is very high. This is a sign of a model that’s too complex. To reduce its complexity, we reduce the number of decision trees, making each decision tree smaller or simpler.

Too simple – doesn’t value all data points equally (the model doesn’t see the pattern). In this case, both the training and test set errors are very high. This is a sign of a very simple and acutely inaccurate model. We may improve this by incorporating more decision trees, making them bigger or more complex.

A good fit
Follows the general trend of the data points in the set. In this case, both the training and test set errors are very low.

To fine tune, we can opt to further tweak the hyper-parameters of the optimizer (function). Since there are so many of them, this can prove to be an infinite activity. It’s best in this case if we apply a Grid Search approach (basically a brute force one), where we provide a range for each hyper-parameter – for all of them (!) – and scikit does the job of automatically testing all possible combinations. Of course, it’s our job to set a meaningful range, or we end up overdoing it again.

Feature Selection
tells us the following (only after we have a trained model!) :

  • which features our model mostly uses to predict values
  • thus, which features are most likely most important for our model
  • and which features might be removed as they are seldom used – if ever – to predict values

Retrain your model
The data that we initially fed into our model, might change over time. In this case, we need to re-train our model with an updated data-set – while our code remains the same!