# Lecture II – The Problem of Statistical Machine Learning. Formal description of the HiggsML challenge.

Consider the following problem of searching for functional dependency which we call the model of learning from examples or the supervised learning. In this model we are given a set of pairs $(x, y)$ called the training set. The elements $x$ are coming from some space $X$ which we call the input set. In what follows we will always assume that $x$ are i.i.d. and generated according to some fixed but unknown probability distribution $F(x)$. The elements $y$ are the elements of the set $Y$ which we call the output set.

There is some dependence between the elements of $X$ and $Y$ which is given by the unknown probability distribution $F(x, y)$ on the product space $X \times Y$.

The elements of the training set are obtained as follows. Given the input $x$ the operator generates an output response according to the conditional distribution function $F(y|x)$ and we observe the sample of these responses:  $(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)$ – the training set (this includes the case when the operator uses a function $y=f(x)$ to generate a response). The learning machine observes this output and creates an approximation of an unknown operator. In the context of statistical machine learning by approximation we usually mean “selection of the best function from the given class of functions”.

Each time the problem of selecting a function from a given class of functions arises we will use the following model to find the best one. From the totality of all the possible functions we select the one which maximizes some quality criteria. More formally, given the set of functions $f_\alpha, \alpha \in \Lambda$ we are looking for such $\alpha^\ast$ that will minimize the value of the risk functional $R(f_\alpha)$. In the case when the family of functions and the functional are explicitly given this is a problem of calculus of variations.

We will be mostly interested in the functionals of the special form: $R(\alpha) = \int_{X \times Y} L(f_\alpha(x), y) dF(x, y)$. The function $L$ is a so-called loss function. This functional has a very simple interpretation, the value $L(f_\alpha(x), y)$ is an amount of “loss” resulting from the realization of vector $x$.

Our first important observation is that the integral is taken with respect to an unknown distribution which is in fact the one we are trying to estimate. Let us, instead of minimizing the functional $R$ minimize the functional $R^{EMP} = \frac{1}{N} \sum_{i=1}^N L(y_i, f_\alpha (x_i))$.

As a result, we have the following:

Definition: The problem of statistical machine learning consists of

1. The input space $X$ and the output space $Y$ with an unknown probability density function $F(x, y)$,
2. Observed sample $(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)$ called the training set,
3. The class of functions $f_\alpha, \ \alpha \in \Lambda$ called a learning machine or a model,
4. The risk functional $R(\alpha) = \int_{X \times Y} L(f_\alpha(x), y) dF(x, y)$ and an empirical risk functional $R^{EMP} = \frac{1}{N} \sum_{i=1}^N L(y_i, f_\alpha (x_i))$.

Our task is to find the value of $\alpha$ such that empirical risk functional is minimized.

There are two important problems with this definition which we will address later in the course.

1. (Model Selection) How do we select the class of functions $f_\alpha$?
2. (Generalization/Overfitting) How do we guarantee that the value of $\alpha$ that minimizes the empirical risk functional would also minimize the risk functional?

To answer the first question there is a host of models that were brought to machine learning in the last 50 years: linear models, neural networks, support vector machines, Hopfield networks, etc are all the examples of the classes of functions that are usually being considered.

In general, criteria for the selection of the model is a combination of flexibility and good generalization ability. It is usually desirable that the function class is flexible enough to model any function on one hand but does not overfit to training data.

Take neural networks as an example. As follows from the theorem of Cybenko any function on a hypercube can be approximated by some neural network (with one hidden layer) thus making neural networks good at approximating unknown functions. On the other hand neural networks with fixed architecture have small VC dimension making them good at generalization.

While we can infer about the quality of the model using some sophisticated methods like VC-dimension, in practice, we do something very different. Instead, we divide our sample into three parts:

1. Training set. Used to find the value of $\alpha$ that minimizes the value of $R^{EMP}$ for a given model.
2. Validation set. Used to tune the hyperparameters of the model. For example, the architecture of the model.
3. Test set. Used to assess the accuracy of the model. Here, we assume that $R^{EMP}$ computed on the test set is a good estimate of $R$.

We use training set to find the parameters of the model, validation set to fine-tune the parameters and architecture and also to help with the stopping rule and testing set as a final benchmark.

Let us now return back to our toy data set and specify the our classification problem as a problem of statistical machine learning, pick some simple model and see how it works out.

We will start with the definition of the loss function for the Higgs boson challenge. The background process is a decay of Z-bozons into two tau mesons and the signal process is the decay of a Higgs boson into two taus. Let us formulate the statistical hypothesis about our data:

$H_0$ – “background hypothesis”. All data comes from the decay of Z-bosons.

$H_1$ – “signal hypothsis”. Oserved data is a combination of signal and background processes.

To understand the result of the search one quantifies the agreement of an observed data with a null hypothesis by computing the p-value $p = P(X|H)$, i.e. the probability of observing the data given there is only background process. If this probability is very small we reject the hypothesis. The measure of incompatibility is based for example on the number of certain events found in the designated region of the search space.

p-values are usually converted into the significance levels $Z = \Phi^{-1}(1-p)$ where $\Phi$ is standard Gaussian. We will reject a “background hypothesis” if the level of significance $Z \ge 5$ which corresponds to $p = 2.87 \times 10^{-7}$.

Let $\mathcal{D} = \{(x_1, y_1, w_1), (x_2, y_2, w_2), \ldots, (x_n, y_n, w_n)\}$ be the training set where $x_i \in \mathbb{R}^d$, $y_i \in \{b, s\}$, and $w_i \in \mathbb{R}^{+}$. Denote by $\mathcal{S} = \{ y_i = s\}$ and by $\mathcal{B} = \{y_i = b\}$ the indices of the signal and background signals and let $n_s$ and $n_b$ be the numbers of simulated signal and background events. Note that since the dataset from the Higgs challenge is a simulated set the ratio $n_s/n_b$ will not in general be equal to the ratio $P(y=s)/P(y=b)$. This is actually very good for us since $P(y=s)<.

The last thing that needs an explanation in the training set are the weights. We will see later that objective function depends on the unnormalized weights. In order to make it independent on the number of simulated signal and background events the sum of the weights is kept fixed, i.e. $\sum_{i \in \mathcal{S}} w_i = N_s$ and $\sum_{i \in \mathcal{B} }w_i = N_b$ where $N_s$ and $N_b$ are for the expected number of background and signal events in 2012.

Let $g: \mathbb{R}^d \to \{b, s\}$ be any classifier. Selection region is a set $\mathcal{G} = \{ x : g(x) = s\}$ and denote by $\overline{\mathcal{G}}$ the indices of points in $\mathcal{D}$ selected by $g$.

We see that the number $s = \sum_{i \in \mathcal{S} \cap \overline{\mathcal{G}}} w_i$ is an unbiased estimator of the expected number of signal events selected by $g$

$\mu_s = N_s \int_\mathcal{G} p_s(x) dx$

Similarly, the number $b = \sum_{i \in \mathcal{B} \cap \overline{\mathcal{G}}} w_i$ is an unbiased estimate of the expected number of background events

$\mu_s = N_b \int_\mathcal{G} p_b(x) dx$.

We are now in the position to define the (empirical) risk functional:

$AMS = \sqrt{2(s+b+b_{reg})\ln(1+\frac{s}{b + b_{reg}}) - s)}$.

called absolute median significance. The term $b_{reg}$ is a regularization term used to reduce the variance of the AMS. We will discuss regulatization in the next lectures.

Finally, we are in the position to formally define the HiggsML challenge as a machine learning problem. Given the training set $\mathcal{D}$ find an optimal value of parameters in some class of functions $f_\alpha$ that would minimize the value of the empirical risk functional AMS and when tested on the testing set produce the same significance of the test AMS.

Anyone interested in the derivation of the AMS I refer to the paper [Claire Adam-Bourdarios et. al. “Learning to discover: the Higgs boson machine learning challenge”, 2014]