(Machine)Learning = Representation + Evaluation + Optimization

Anni Sapountzi
6 min readFeb 4, 2018

--

A learning algorithm is an algorithm that learns the unknown model parameters based on data patterns. But how data are linked to models? Generally, there are three different angles, i.e., components.

In this post we briefly discuss about the different components of learning algorithms involved in a machine learning process. Let’s see them separately.

The three components of every learning algorithm

1. Representation: what model you choose to represent the data?

Find the mapping (i.e., function) that relates the input to the output f(x) = y. The function f maps data from the input space x to some output values y.

Note that, f isn’t necessary assumed to have a specific functional form (i.e., non parametric).

We have to think what are the variables, how they are structured in space, do they take discrete or continues values, are they nodes in a directed graph or elements of vectors?

1.1: What is the hypothesis space (i.e., the function f that generated your data)?The function f is chosen from a hypothesis space, a set of all hypotheses that might possibly be occurred. Let’s say, we are trying to find the shape (e.g., a line in linear regression) that fits (represents) the data.

1.2: Decide the model that will structure our problem, if you don’t assume a specific functional form:

  • decision theory in RL algorithms,
  • proximity in KNN,
  • rules/split points in decision trees etc.

Representations include models of states (e.g., dynamic bayesian networks, Markov decision process), models of instances (e.g., KNN, SVM), rules (e.g., decision trees), hyperplanes (e.g., Naive Bayes, logistic regression, linear regression), or artificial neural networks (a combination the previous). That distinction is not strict and representations can be combined together.

Feature Engineering: is a representation problem that turns the inputs x into ‘things’ the algorithm can understand. How to represent the x in a feature or in a state space? A good set of features brings you closer to the true underlying structure of the problem that you are trying to learn. If you follow the thought ‘I’ll include many features in my model so as to provide it with all the important information of the problem’, you will probably end up with an overfitted model.

Feature engineering, adapted from Machine Learning Crash course.

It’s recommended to incorporate domain knowledge in feature engineering. Handling overfitting and feature engineering are beautiful topics by themselves and are studied far further in the machine learning community. We simply made a brief reference here.

The system where the analysis is performed (e.g., distributed or parallel systems) is related to the choice of representation, as the latter determines how well one can decompose the data set into smaller components so that analysis can be performed independently on each component.

Bias vs Variance Trade off

There are modeling steps that helps you find out whether you have an incorrect description of the problem that you are solving with the algorithm. These are repeatedly intervene the learning process. They include:

  1. Compare bias (under-fitting) and variance (overfitting) of the model
  2. Balance bias and variance of the model.
  3. Analyze the performance of the model.

The balance step prompts you to: ‘Tune model parameters via bootstrap (usually used by statisticians), cross validation or hold out method to prevent overfitting. Cross validation: all observations are used for both training and validation. Each observation is used for validation exactly once. K-Fold is a cross validation technique that break the data into k equal sized subsamples where you record error on each K and take the average.

Overfitting: your model is being sticked to random patterns instead of the true evidence that generated the dataset. Although an overfitted model will predict well in the current data set, it won’t generalize to new data sets.

2. Evaluation: evaluate the loss function given different values for the parameters

The loss function is also called a utility, scoring, or a reward function. An objective function is a loss function that desired to be maximized. It may be different for the internal algorithm (e.g., log loss) and the external model (e.g., AUC for the ROC). The evaluation function of the external model is typically called a performance metric.The function being learned depends on the representation you choose including the output type of the function:

  1. Classification: outputs discrete values, usually the log (likelihood) loss objective. Discrete values may be probabilities, in which case a cross-entropy loss is used.
  2. Regression: continuous output, usually the mean squared error (or least squares).
  3. Density Estimation: outputs a probability function of the data
  4. Reinforcement: in the simplest form the output is a scalar value and the objective function is the value function.

MLE translates to the loss function in machine learning. In statistics, we almost use always Maximum Likelihood Estimates; while in machine learning we are doing something different. We are minimizing the squared error; but theoretically we could also maximize the likelihood. It turns out that finding the optimal parameters with MLE would result in the same thing by minimizing the squared loss function.

Probabilistic and qualitative evaluation metrics, taken by a Knowledge Tracing literature review paper

3. Optimization: a (solver) function that seeks the optimal values of the loss function. It estimates the parameters and select those values that result in the lowest error (minimum) or highest reward value (maximum).

Combinatorial (e.g., local search, greedy search), unconstrained continuous (e.g., gradient descent), and constrained continuous functions (e.g., linear programming) are types of optimization functions. For an intro to optimization and a visualized taxonomy of optimization methods, see here.

Analytic vs Numerical methods

For some objectives, the optimal parameters can be found exactly, which is known as the analytic solution. For other objectives, the parameters need to be approximated with a computational algorithm. To illustrate that with an example, an analytic method in a linear regression model is to solve the equations of least squares; whilst a numerical method is to compute the optimal values with maximum likelihood estimates.

Gradient descent optimization method for finding the optima of f(x)

Some Notes:

Bayesian networks != Naive Bayes, as the second assumes independence among the features.

Probabilistic graphical models != artificial neural nets, as the latter assumes some nodes to be non-linear computational units. In the hidden layer, each node performs a sigmoid function, hyperbolic tangent nonlinearity, or rectifier linear unit given its inputs from the previous layer. The nodes are usually stacked in 3 layers, the input, output and the hidden layer.

Below, you see a quite complicated representation of variables as states. The data are linked to a bayesian network model, which belongs to a family of models called probabilistic graphical model. It uses bayesian theorems from decision theory with network theory. The goal of a bayesian model is to estimate the posterior probability distribution of the output given some prior knowledge or beliefs about the the data values (i.e, prior distribution) and the likelihood of observing these data (i.e., likelihood function). Network theory uses nodes and edges to represent the relationships between the input and output data.

Bayesian Network Representation: P(node|parents(node)) is the product of the marginal distributions of independent variables (parents) and the conditional distribution of the dependent variable (child). In this case, the probability distribution of the input data and the hidden latent random variables is all we care about. Each layer includes random latent variables (or events in the language of probabilities) that are connected to each other with edges which are the conditional dependencies.

The post draws ideas from the paper: A few useful things to know about machine learning which you can find online.

Did this article helped you to classify the parts that compose a learning algorithm?

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Anni Sapountzi
Anni Sapountzi

No responses yet

Write a response