Probability Theory in Machine Learning: An example with exponential distributions

4 min readDec 11, 2018

Probability theory was born for games of chance in the 16th century

Every day we have to deal with the uncertainty & randomness that life-events present us. To take rational decisions and have better expectation of their future outcomes, we have probability theory.

Although, there is a decrease of the level of math needed for machine learning engineers (due to the increase of empirical data exploitation), having a good command of linear algebra, probability theory & calculus is a necessity for shaping an intuition why the models work the way they do or why they fail.

Let me give you a simple example of how probability distributions played a central role in a machine learning classification problem.

The discussed terms (briefly) are the following:

Roadmap:

Part 0: My bayesian-classification example

With this machine learning problem in mind, I gathered all the really pre-basic concepts in a page divided in four parts (you can find it below).

Product distribution
Optimal Stopping Problem
Exponential Family
Conjugate Prior

Part 1: Physical Probabilities

Experiment or Hypothesis || Sample Space & Random Variable || Event

Part 2: Probability Distribution Function

Probability distribution table|| Probability Mass Function || Probability Density Function || Behavior of a Distribution

Part 3: Evidential Probabilities

Bayesian Probability || Bayesian & Frequentist Interpretation of Probability

Part 4: Fundamental Rules of Probability

Sum || Product

Next Post: Computing Probability Distributions in Python

Although, Bayesian statistics are the commonly used in machine learning, I thought it’s worthy comparing them with the frequentist interpretation of probabilities.

Bayesian Setting

We have two independent random variables’ distributions, Bernoulli (with parameter θ[0, 1]) & Exponential (with parameter λ), as inputs in simulation experiments. The former represents a sequence of independent, identically distributed of action’s correctness; and the latter reflects the speed until the action is completed.

The product distribution of these two defines a third variable that reflects one’s mastery of a goal (set of actions) and also has the interval [0,1] as domain. This is going to be fed in an optimal stopping problem where the output is modeled similar to that of coin tossing, as described below.

The optimal stopping problem: you wish to maximise the amount you get paid by choosing a stopping rule (optimization). You have a set of i.i.d actions that you repeatedly solve. Each time, before you have received a new action, based on the stopping rule, you can choose either to stop, get paid (let’s say in correct actions’ points), and progress to the second goal or to continue practicing with the the first goal (Binary classification).

The free parameters, θ and λ, which shape the behaviors of these two distributions (and hence also the third), are unknown. We know that both distributions belong to the Exponential Family, which gives us the ability to utilize their property of Conjugates priors. The latter offer a closed-form solution (can be evaluated in a finite number of operations) to our posterior. In a bayesian setting, we learn/estimate the parameters as incoming data updates their values via a simple update rule instead of gradients computation.

If the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions

The priors of Bernoulli and Exponential will be estimated by drawing them from their conjugates, Beta and Gamma, respectively:

f(x): Posterior (Beta)= Likelihood Bern(n, θ) * Prior on θ Beta (a,b), a-1 correct actions, b-1 incorrect
f(y):Posterior (Gamma) = Likelihood Exp(λ) * Prior on λ Gamma(n,λ), n-1 observations that sum to λ

Exponential family distributions:
1. Gaussian, 2. Exponential, 3. Beta, 4. Gamma, 5. Binomial, 6. Poisson.

My next step is to refresh continuous and discrete probability distributions, which belong to exponential family, together with some of their inherent properties like the memoryless property and conjugate priors. Although, distributions don’t necessarily have an intuitive utility, I’ll try to go through simple examples to gain some intuition. Beside that, I’ll try to understand the following:

how the distributions model random variables to form a likelihood function,
a property related to each distribution that makes them “desirable” (i.e. poisson point process, central limit theorem),
the parameter space,
& the mass or density functions.

To fill this theoretic-heavy job with fun, I’ll use Python libraries (scipy, pandas, numpy, matplotlib) to

(i) make plots with different behaviors for each of the above distributions &(ii) simulate random variables generated by them.

Next steps:

0. Computing Probability Distributions in Python (coming soon)

The best intro: Chapter 1.2, Pattern Recognition and Machine Learning
A quick reminder: Probability Cheatsheet

2. In a short video: Mathematics of Machine Learning

3. Detailed Course: MIT

Until then, Hap-py-coding ⇧⏎

Feel free to respond, ask a question, express a request or correct a mistake you may find.

Probability Theory in Machine Learning: An example with exponential distributions

Roadmap:

Bayesian Setting

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Anni Sapountzi

No responses yet

More from Anni Sapountzi

Visualize to Organize: Enhance Your Team’s Workflow with Napkin

A Step-by-Step Tutorial

Searching for Policies in Python: An intro to Optimization

The goal of the optimization problem is to find the best solution from all feasible solutions by finding the optima of the objective…

“Decoding the EU AI Act: A Visual Guide”

Over the past decade, AI has advanced at lightning speed. In 2024, it has reached a pivotal moment — similar to where the internet was in…

Build your first neural network in Python

Artificial Neural Networks have gained attention, mainly because of deep learning algorithms. In this post, we will use a multilayer neural…

Recommended from Medium

How Does Our Sense of Humor Change With Age? A Statistical Analysis

How do our comedic sensibilities form and transform over time?

🚅 Information Theory for People in a Hurry

A quick guide to Entropy, Cross-Entropy and KL Divergence. Python code provided. 🐍

Lists

Productivity

Surrogate Modeling: The Secret to Faster, Smarter Engineering

Its fundamentals, capabilities, and engineering applications

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

Statistical Learning Theory Part 1: Hoeffding’s Inequality Derivation & Simulation

Derivation of Hoeffding’s Inequality from first principles

Using Causal ML Instead of A/B Testing

In complex environments, Causal ML is a powerful tool because it is more flexible than A/B Testing, and it doesn’t require strong…