Role of probability

A robot uses sensors to obtain information about it’s state.
In certain cases, such as measuring joint positions and velocities, direct state feedback can be obtained.
However in general, sensor information is noisy and/or only provides partial information about the environment.

In this case, we cannot determine the exact state, but instead try to find what state is “most likely” based off the evidence.

To do this, the conditional probability is used, for state $X$ and measurement $Y$: $$ p(X = x|Y = y) $$

For a known value of measurement $y$, this is a function $f(x) = p(x|y)$.

There are two ways to choose a “best guess” $x^*$:

  1. Maximise the probability distribution: $$ x^* = \textrm{argmax}_x[p(X=x|y=y)] $$
  2. Find the (conditional) expectation over the probability distribution: $$ x^* = \mathbb{E}[x|y] = \int_x x\cdot p(X=x|Y=y) dx $$

Finding the maximum is better in general, since the expectation is only suitable for “unimodal” distributions, where there is a single “clump” of probability density, and the mean characterises the centre of this. For example, if there are two concentrations of probability density, the expectation will give a value between them, which is actually very unlikely.

However, the expectation can be useful, such as with monte-carlo methods (the particle filter), which can better handle the this drawback.

Probabity models

The term “probability model”, will refer to the form of the function $f(x) = p(x|…)$, which may be dependent on other random variables.

For example, a simple probability model for a gps sensor is: $$ p(y|x) = \mathcal{N}(y;x | \Sigma) $$

Interpretation:

  • The “true” position of the sensor is $x$.
  • The sensor returns a position measurement $y$.
  • This measurement follows a gaussian distribution (normal distribution) with mean $x$ and covariance $\Sigma$.
  • Writing $p(y|x) = \mathcal{N}(y;…)$ means that the normal distribution is being evaluated at $y$. ie, if $x$ and $y$ are k-dimensional: $$ p(y|x) = \frac{1}{\sqrt{(2\pi)^k|\Sigma|}} \exp\left({-\frac{1}{2}(y - x)^T\Sigma^{-1}(y - x)}\right)$$

Collections of random variables

Random variable $X_i$ represents the state at time $t = t_i$.
For a collection of datapoints, defined over times $\{t_i\}_{i=1}^n = t_{1:n}$, these values are represented by the collection of random variables $\{X_i\}_{i=1}^n = X_{1:n}$.

Like any other random variable, this has a probability distribution $$ p(X_{1:n} = x_{1:n}) $$

With collections of random variables, the probability models can be much more complex. However, there is usually a way to factorise the probability distribution so it is much more manageable.

For example, in the extreme case that all random variables are independent, the probability distribution becomes: $$ p(x_{1:n}) = p(x_1)\ldots p(x_n) = \prod_i p(x_i)$$

In general, the factorisation depends on “conditional independence” between variables, which can be visually represented using graphical models.
See: (todo: notes on graphical models)

Log-likelihood

Log-likelihood $L(x)$ is defined as the natural log of the probability: $$ L(x) = \ln(p(x)) $$

This maps from $p(x) \in (0, 1]$ to $L(x) \in (-\infty, 1]$.

This is useful because log converts produces into sums: $$ p(x) = \prod_i p_i(x) \implies L(x) = \sum_i L(x) $$

Additionlly, the logarithmic is monotonic, meaning that: $$ \textrm{argmax}_x[p(x)] = \textrm{argmax}_x[L(x)] $$ so maximising the probability is equivalent to maximising the log-likelihood. However, since the log-likeihood is a sum instead of a product, this is a nicer function to optimise.

It also gives a nice result for the gaussian distribution: $$ p(x) = \mathcal{N}(x;\mu,\Sigma) \implies L(x) = -\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu) $$

When a gaussian log-likelihood appears, this gives a quadratic term to be minimised.

Monte-carlo methods

Monte-carlo methods provide a way to find the expectation of a random variable using random sampling: $$ \begin{align*} \mathbb{E}_p[f(x)] &= \int_x f(x) \cdot p(x) dx \\ &= \int_x f(x)\frac{p(x)}{q(x)}q(x) dx \\ &= \int_x f(x)w(x)q(x) dx \\ &= \mathbb{E}_q[f(x)w(x)] \\ &\approx \frac{\sum_i f(X_i)w(X_i)}{w(X_i)} \end{align*}$$

In the final line, $X_i$ denotes a sample of $X$ drawn from the distribution $q(x)$. Then, the weighted mean of $f(X_i)$, weighted by $w(x) = p(x)/q(x)$ estimates the original expectation.

The reason this is useful is because:

  • For a complex $p(x)$, it is difficult to produce a sample $X_i \sim p(X=x)$.
  • However, the distribution can easily be evaluated at a given value.
  • A simpler distribution $q(x)$ can be used to generate samples, and the weighting $q(x)$ corrects the resultant mean to correct for the difference between $q(x)$ and $p(x)$ and how this would affect what samples $X_i$ are more likely.

This can be extended to work with distributions $p(x)$ and $q(x)$ that aren’t normalised, and to work better with random sequences, which is the basis of the particle filter.