## Lecture IV – Our first model.

After we have finished with the theoretical discussion of the Neural Networks it is time to work on our first model.

#Load the data


This is a Higgs ML challenge. For anyone interested in detailed data analysis please read Lecture I – Exploratory Data Analysis of these lecture series.

We start by dropping all the NaN values from the dataset. It is not right to do so in this particular situation as these missing values are not random and most of them are due to the number of jets produced after the particle decay. However, to make things easier we will not deal with this issue here.


#Remove missing data
#Ideally split by jets here
data = data.replace(-999.0, np.nan)
data.dropna(inplace = True)



Now, we split our dataset into the training and testing parts. We are not splitting into the validation set as this is done automatically by the algorithm that trains the neural net

from sklearn.model_selection import train_test_split

#Split the data into training, validation and test set
X = data[range(1,31)]
y = data['Label']
y[y=='s'] = 1
y[y=='b'] = 0
w = data['KaggleWeight']
sum_weights = sum(w)

N = X.shape[0]
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(X, y, w, test_size=0.2, random_state=42)


On the way we rename the states values to 0 and 1 and isolate the weights as a
separate variable. The split is 80% train set and 20% test set.
As we know already, the key metric for Higgs ML challenge is AMS – Absolute Median Significance of discovery.


def ams(s, b):
return sqrt(2 * ((s + b + 10) * log(1.0 + s/(b + 10)) - s))

def get_ams_score(W, Y, Y_pred):
multiplier = sum_weights/sum(W) #normalization of weight
s = W * (Y == 1) * (Y_pred == 1)
b = W * (Y == 0) * (Y_pred == 1)
s = np.sum(s)
b = np.sum(b)

return ams(s, b)

#Check for random prediction
y_pred = np.random.binomial(1, 0.5, y_test.shape[0])

print(get_ams_score(w_test, y_test, y_pred))



We have tested how much we score in the AMS by randomly assigning the labels to the input variables. The value of AMS is roughly $AMS = 0.5$.

We are in position now to traing our first model.


#Now we may train the Neural Network
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(early_stopping = True)
mlp.fit(X_train, np.array(y_train, dtype='int'))



and predict the values of labels on the test set.


y_pred = mlp.predict(X_test)
print(get_ams_score(w_test, y_test, y_pred))
print(1.*sum(y_pred == y_test)/y_test.shape[0])



As a result we score $AMS = 1.21$ and quess the right labels ~80% of the time.

Before we proceed with data transformation I will describe a trick that can help make the AMS slightly better. As we know, MLP (multi-layer perceptron) will optimize the cross-entropy not directly the AMS. The labels to observations are assigned according to the condition $P(y=1|x)>0.5$. Since the fraction of Higgs bosons is much less than 50 percent, it might be advantageous to label only those particles as Higgs which has much higher probability $P(y=1|x)>\theta, \theta >0.5$.


def AMS_optim(theta):
y_pred = (y_proba > theta)
return get_ams_score(w_test, y_test, y_pred)

theta = 0.8
y_proba = mlp.predict_proba(X_test)[:,1]
y_pred = (y_proba > theta)
print(get_ams_score(w_test, y_test, y_pred))
print(1.*sum(y_pred == y_test)/y_test.shape[0])
Theta = [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]
scores = map(AMS_optim, Theta)

plt.plot(Theta, scores)
plt.show()



As a result we get the following dependence between the AMS and $\theta:$

We have thus increased the value of AMS to about 1.3

Let us now improve the model by applying the transformation of the input set:


X_mod = X
columns_negative = X_mod.apply(lambda x: (x<=0).sum(), axis = 0)
print columns_negative

X_mod['DER_mass_MMC'] = np.log(X_mod['DER_mass_MMC'])

X_mod['DER_mass_MMC'] = np.log(X_mod['DER_mass_MMC'])
X_mod['DER_mass_vis'] = np.log(X_mod['DER_mass_vis'])
X_mod['DER_pt_h'] = np.log(X_mod['DER_pt_h'])
X_mod['DER_mass_jet_jet'] = np.log(X_mod['DER_mass_jet_jet'])
X_mod['DER_deltar_tau_lep'] = np.log(X_mod['DER_deltar_tau_lep'])
X_mod['DER_pt_tot'] = np.log(X_mod['DER_pt_tot'])
X_mod['DER_sum_pt'] = np.log(X_mod['DER_sum_pt'])
X_mod['DER_pt_ratio_lep_tau'] = np.log(X_mod['DER_pt_ratio_lep_tau'])
X_mod['PRI_tau_pt'] = np.log(X_mod['PRI_tau_pt'])
X_mod['PRI_lep_pt'] = np.log(X_mod['PRI_lep_pt'])
X_mod['PRI_met'] = np.log(X_mod['PRI_met'])
X_mod['PRI_met_sumet'] = np.log(X_mod['PRI_met_sumet'])



apply the modifications and rerun the model.


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_mod = scaler.fit_transform(X_mod)

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(X_mod, y, w, test_size=0.2, random_state=42)

mlp = MLPClassifier(early_stopping = True)
mlp.fit(X_train, np.array(y_train, dtype='int'))

y_pred = mlp.predict(X_test)
print(get_ams_score(w_test, y_test, y_pred))
print(1.*sum(y_pred == y_test)/y_test.shape[0])

y_proba = mlp.predict_proba(X_test)[:, 1]
scores = map(AMS_optim, Theta)
plt.plot(Theta, scores)
plt.show()



finally getting to the value of AMS of more than 2.3.

## Comments to Lecture III – “Neural Networks: Tricks of the Trade.”

In the last lecture we’ve discussed backpropagation algorithm for training of Neural Networks. It is all good and well in theory but practical implementation of the neural network might be quite challenging. In this lecture we will discuss  potential issues as well as provide some references.

• Activation functions on the output and hidden layers.

Backpropagation and training by gradient descent will work with different kinds of activation functions. However, some activation functions might perform better than the others. Typical good choices are $\tanh$, sigmoid functions, relu units, etc.

• Initialisation of weights.

This is an important question as setting the initial weights right might greatly increse the speed of learning. The weights are selected in such a way that most of the neurons will be activated in the regions of the most gradient. For the normalized and scaled data the usual choice is $U[-\frac{1}{\sqrt(N)}, \frac{1}{\sqrt(N)}]$ where $N$ is the number of neurons at the hidden layer.

• Choice of the risk function.

There is evidence that in classification tasks cross-entropy is a better choice for the risk function compared with mean-squared error. To avoid overfitting regularization might be used.

• The choice of the learning rate.

This is a classical topic of numerical analysis. Depending on the available information methods that involve second derivatives of the risk functional should be used.

• Transformation of data.

It is always assumed that data is scaled and transformed to the zero mean and unit volatility.

[1] Bengio, Yoshua. “Practical recommendations for gradient-based training of deep architectures.” Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.

[2] LeCun, Y., Bottou, L., Orr, G. B., and Muller, K. (1998a). Efficient backprop. In Neural Networks, Tricks of the Trade.

[3] Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks. “International conference on artificial intelligence and statistics. 2010.

## Kelly Betting and Ergodic Theorem.

In the last Kelly post we have considered the simplest case when the outcomes of the bets are the i.i.d Bernoulli random variables. How do we compute the optimal $f$‘s if the outcome of the bet depends on the outcome of the last bet?  Our main goal for this post would be to understand what the optimal Kelly is in the following example: given that the last flip of a coin is “heads” the next flip will be “heads” with the probability $0.8$ and “tails” with the probability $0.2.$ While the probability of “tails”  and “heads” after “tails” is 0.6 and 0.4 correspondingly.

The example above is a typical example of what is called the Markov chain. The Markov chain consists of the state space $S$ and a transition probability matrix $P$. For every pair of states $i, j\in S$ the probability of transition from $i$ to $j$ is given by $P(j|i) = p_{ji}$.

If $f = (f_1, f_2, f_3, \ldots, f_t, \ldots)$ is any sequence of bet sizes chosen from some bounded set (in reality it is always bounded because of margins) the growth with respect to this betting policy would be defined as $g(f) = \lim_{T\to\infty}\frac{1}{T} \sum_{t=0}^{T-1} \log(1+f_t \xi_t)$ where $\xi_t$ are the results of the bets. Here we always assume we bet on the favorable outcome.

We would like to find such sequence $f$ that will maximize $g(f)$. Without going too much into technical details we can prove the following theorems

Theorem 1: There exists an optimal policy $f$ that maximizes the growth. This optimal policy may be selected to be stationary, i.e. depends only on the state.

Theorem 2: If we denote by $a_t$ the means of $x_t$, if the limit $g(f) = \lim_{T\to\infty}\frac{1}{T} \sum_{t=0}^{T-1} \log(1+f_t \xi_t)$ exists it is equal to $g(f) = \mathbb{E}_\pi a$ where $a$ is the vector of means of $\xi$ and $\pi$ is a stationary distribution.

Let’s see how the theorems work for our example. There are two states which we will denote H and T. The stationary distribution of this Markov chain is given by the vector $\pi = (\frac{2}{3}, \frac{1}{3})$. From the Theorem 1 the optimal policy is defined by two optimal f’s – one for H and one for T. Denote them by $f$ and $h$.

From the Theorem 2 the log growth is given by $g(f, h) = \frac{2}{3} (0.8 (1+f) + 0.2 (1-f)) + \frac{1}{3} (0.6 (1+h) + 0.4(1-h))$. To find its maximum we differentiate with respect to $f$ and $h$ and set the partial derivatives to zero. We can see that in order to maximize the growth we need to maximize each of the Kelly fractions separately – one for H and one for T. We already know that in this case $f=0.6$ and $h = 0.1$.

## Lecture III. Neural Networks and Back-propagation Algorithm.

In this lecture we will discuss the back propagation algorithm for training of neural networks.

Definition: (Feed forward ) Neural network (NN) is a function that is encoded in a graph. It is defined on the subset $X \subset \mathbb{R}^d$  and maps to $Y = \mathbb{R}$ in the case of regression or $Y =\{0, 1\}$ in the case of classification.

The first layer of the network is an input layer and accepts the elements of the vector $(x_1, x_2, \ldots, x_d)$. The last layer is an output layer. All the remaining layers are called hidden layers. NN is feed-forward in a sense that information in a graph flows from one layer to the other and there are no directed cycles.

There is also a numeric weight associated with every edge of the network and a smooth activation function $h$.

We will use the following notations: to every node $i$ of the graph there are two numbers $a_i = \sum_{k} w_k z_k$ where $k$ goes through all edges that point into $i$ and $z_i = h(a_i)$. For the nodes of the input layer $z$ is equal to the argument of the function.

Given the input vector $(x_1, x_2, \ldots, x_d)$ we propagate it into hidden layers using the formulas above. Finally, the value of the function is given by $y=h(\sum_{k} w_k z_k)$ for all nodes that point into the output node.

Thus, neural network is simply a function from the input space into output space. The totality of all the weights constitutes the set of model parameters.

Note, that biases could be included into the neural network by creating extra input layers with the value of $1$.

### Symmetries of the Neural Networks.

One of the most common choices of the activation function is a hyperbolic tan function. Since it is anti-symmetric, it is easy to see that by simultaneously changing the signs of the weights on the edges pointing to and from any node of some node in the hidden layer will not change the function that the neural network represents. Moreover, shuffling the nodes in any hidden layer does not change the function either. One can show easily that if the hidden layer contains $M$ nodes the number of equivalent functions is at least $M! 2^M$. This might get to be a major problem because this might mean that in general position the number of local minima for a neural network is growing exponentially with the size of the network.

NN in the problems of regression. As usual we are minimizing the mean squared error when dealing with regression problems.

NN in the problems of classification. In this case, we are using the function $y = \frac{1}{1+e^{-a}}$ as the activation function in the output layer. The functional of risk that we are trying to minimize is the cross-entropy

$R^{EMP} (w) = - \sum_{n=1}^N (y_n \log f(x_n, w) + (1-y_n) \log(1 - f(x_n, w))$

The output of the network is interpreted in this case as the class probability $P(y=1 | x)$ and the inputs are assumed to be conditionally independent.

### Backpropagation Algorithm.

Backpropagation algorithm is at the core of most training algorithms for neural networks. It computes the gradient of a neural network at a given point.

In most cases there is no analytical solution to the problem of minimization of the empirical risk and it is attained by gradient descent. At every step of the iteration the weight is updated in the direction of $-\nabla R^{EMP}$ according to $w^{(\tau + 1)} = w^{(\tau)} - \eta \nabla R^{EMP}$ where $\eta >0$ is defines the step size. This procedure is guaranteed to terminate in the local minima of the empirical risk.

We will briefly describe the idea of backpropagation without giving complete proofs.

Step 1.

Let $w$ denotes a weight over some edge $ij$ where $i$ is the start of the node and $j$ is the end. Let $a = \sum_k w_k z_k$.

Lets start with the computation of the partial derivative $\frac{\partial R^{EMP}}{\partial w} = \frac{\partial R^{EMP}}{\partial a}\frac{\partial a}{\partial w}$ because empirical risk depends on $w$ only through $a$.

If we denote $\delta = \frac{\partial R^{EMP}}{\partial a}$ and compute $\frac{\partial a}{\partial w} = z$ we can conclude that $\frac{\partial R^{EMP}}{\partial w} = \delta z$. We know $z$ by simply evaluating the neural net at the current value of the weights. Therefore, to compute the gradient we need to know the value of $\delta$ for all nodes.

Step 2.

Let us take some node and index by $k$ all the nodes that are connected with it and are the ends of the corresponding edges. We can write $\delta = \frac{\partial R^{EMP}}{\partial a} = \sum_k \frac{\partial R^{EMP}}{\partial a_k} \frac{\partial a_k}{\partial a} = \sum_k \delta_k \frac{\partial a_k}{\partial a}$. Now, from the definition of $a_k$ it is easy to see that $\frac{\partial a_k}{\partial a} = w_k h'(a)$. Therefore, in order to compute delta we need to know the values of delta for the next layer. If we do, we may recursively apply these computations and move from the output layer to the input layer one by one. In particular, it suffices to compute the value of deltas on the output layer only. This is left as an exersise for to the reader.

## Kelly Betting and the Law of Large Numbers.

Kelly optimal solves for the fraction of capital one needs to bet in an unfair game to maximize its asymptotic (in a long run) rate of capital growth. It was first derived in the 1956 paper [Kelly J.L., “A new interpretation of the information rate”, Bell System Technical Journal, 1956] for the case of a biased coin. Since then, many academics generalized this result in many different settings: horse racing, black jack, stock market, etc. In a series of blog posts I will try to generalize some of these results to solve the following problem: given an open portfolio of trades, if I want to open a new position what is the fraction of my capital I should bet?

I will start with the simplest case of the biased coin. You start with a capital of \$100 and you are offered to play in a following game: there is a flawed coin which comes up “tails” with a known probability $p>0.5$. You make a bet and then flip a coin. If it comes up “tails” you win and in this case you get one dollar per each one dollar bet. If you lose banks takes your bet. You can make bets any number of times, how much should you bet to maximize your capital?

First, we need to understand what does this mean to maximize the capital. Reasonable criterion would be to maximize your capital after one or several bets relative to an extra risk constraint. This is a subject of mean-variance portfolio optimization. Here, we will follow another route and maximize the rate of growth of my capital.

Denote by $\xi$ the random variable which is “tails” with the probability $p$ and “heads” with the probability $(1-p)$. Assume, each time we bet we bet a fixed fraction of our current capital. This is a strong assumption but in what follows we will see that we can in fact prove that it has to be constant. By the definition, the growth after $T$ bets is defined as $G(T) = (\prod_{t=1}^T (1 + f \xi_t))^{\frac{1}{T}}$. Our goal is to maximize it with respect to $f$ when $T$ goes to infinity.

To maximize $\lim_{T \to \infty} G(T)$ is the same as to maximize the logarithm $\lim_{T \to \infty} \log(G(T)) = \lim_{T \to \infty} T^{-1} \sum_{t=1}^T \log (1+f \xi_t).$ It is a direct application of the Law of Large Numbers that this limit is equal to the expected value $\mathbb{E} \log (1+f\xi)$.

Finally, in order to find the value of $f$ we can write $\mathbb{E} \log (1+f\xi) = p \log(1+f) + (1-p)\log(1-f)$ since $\xi$ take only on two values. By differentiating with respect to $f$ and setting the result to zero we obtain $\frac{d}{df} \mathbb{E} \log(1+f\xi) = \frac{p}{1+f} - \frac{1-p}{1-f} = 0$. This equation has a unique solution $f = 2p-1$.

In the next post I will explain why we can always assume the $f$ is constant and how this generalizes to a more general situation.

## Lecture II – The Problem of Statistical Machine Learning. Formal description of the HiggsML challenge.

Consider the following problem of searching for functional dependency which we call the model of learning from examples or the supervised learning. In this model we are given a set of pairs $(x, y)$ called the training set. The elements $x$ are coming from some space $X$ which we call the input set. In what follows we will always assume that $x$ are i.i.d. and generated according to some fixed but unknown probability distribution $F(x)$. The elements $y$ are the elements of the set $Y$ which we call the output set.

There is some dependence between the elements of $X$ and $Y$ which is given by the unknown probability distribution $F(x, y)$ on the product space $X \times Y$.

The elements of the training set are obtained as follows. Given the input $x$ the operator generates an output response according to the conditional distribution function $F(y|x)$ and we observe the sample of these responses:  $(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)$ – the training set (this includes the case when the operator uses a function $y=f(x)$ to generate a response). The learning machine observes this output and creates an approximation of an unknown operator. In the context of statistical machine learning by approximation we usually mean “selection of the best function from the given class of functions”.

Each time the problem of selecting a function from a given class of functions arises we will use the following model to find the best one. From the totality of all the possible functions we select the one which maximizes some quality criteria. More formally, given the set of functions $f_\alpha, \alpha \in \Lambda$ we are looking for such $\alpha^\ast$ that will minimize the value of the risk functional $R(f_\alpha)$. In the case when the family of functions and the functional are explicitly given this is a problem of calculus of variations.

We will be mostly interested in the functionals of the special form: $R(\alpha) = \int_{X \times Y} L(f_\alpha(x), y) dF(x, y)$. The function $L$ is a so-called loss function. This functional has a very simple interpretation, the value $L(f_\alpha(x), y)$ is an amount of “loss” resulting from the realization of vector $x$.

Our first important observation is that the integral is taken with respect to an unknown distribution which is in fact the one we are trying to estimate. Let us, instead of minimizing the functional $R$ minimize the functional $R^{EMP} = \frac{1}{N} \sum_{i=1}^N L(y_i, f_\alpha (x_i))$.

As a result, we have the following:

Definition: The problem of statistical machine learning consists of

1. The input space $X$ and the output space $Y$ with an unknown probability density function $F(x, y)$,
2. Observed sample $(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)$ called the training set,
3. The class of functions $f_\alpha, \ \alpha \in \Lambda$ called a learning machine or a model,
4. The risk functional $R(\alpha) = \int_{X \times Y} L(f_\alpha(x), y) dF(x, y)$ and an empirical risk functional $R^{EMP} = \frac{1}{N} \sum_{i=1}^N L(y_i, f_\alpha (x_i))$.

Our task is to find the value of $\alpha$ such that empirical risk functional is minimized.

There are two important problems with this definition which we will address later in the course.

1. (Model Selection) How do we select the class of functions $f_\alpha$?
2. (Generalization/Overfitting) How do we guarantee that the value of $\alpha$ that minimizes the empirical risk functional would also minimize the risk functional?

To answer the first question there is a host of models that were brought to machine learning in the last 50 years: linear models, neural networks, support vector machines, Hopfield networks, etc are all the examples of the classes of functions that are usually being considered.

In general, criteria for the selection of the model is a combination of flexibility and good generalization ability. It is usually desirable that the function class is flexible enough to model any function on one hand but does not overfit to training data.

Take neural networks as an example. As follows from the theorem of Cybenko any function on a hypercube can be approximated by some neural network (with one hidden layer) thus making neural networks good at approximating unknown functions. On the other hand neural networks with fixed architecture have small VC dimension making them good at generalization.

While we can infer about the quality of the model using some sophisticated methods like VC-dimension, in practice, we do something very different. Instead, we divide our sample into three parts:

1. Training set. Used to find the value of $\alpha$ that minimizes the value of $R^{EMP}$ for a given model.
2. Validation set. Used to tune the hyperparameters of the model. For example, the architecture of the model.
3. Test set. Used to assess the accuracy of the model. Here, we assume that $R^{EMP}$ computed on the test set is a good estimate of $R$.

We use training set to find the parameters of the model, validation set to fine-tune the parameters and architecture and also to help with the stopping rule and testing set as a final benchmark.

Let us now return back to our toy data set and specify the our classification problem as a problem of statistical machine learning, pick some simple model and see how it works out.

We will start with the definition of the loss function for the Higgs boson challenge. The background process is a decay of Z-bozons into two tau mesons and the signal process is the decay of a Higgs boson into two taus. Let us formulate the statistical hypothesis about our data:

$H_0$ – “background hypothesis”. All data comes from the decay of Z-bosons.

$H_1$ – “signal hypothsis”. Oserved data is a combination of signal and background processes.

To understand the result of the search one quantifies the agreement of an observed data with a null hypothesis by computing the p-value $p = P(X|H)$, i.e. the probability of observing the data given there is only background process. If this probability is very small we reject the hypothesis. The measure of incompatibility is based for example on the number of certain events found in the designated region of the search space.

p-values are usually converted into the significance levels $Z = \Phi^{-1}(1-p)$ where $\Phi$ is standard Gaussian. We will reject a “background hypothesis” if the level of significance $Z \ge 5$ which corresponds to $p = 2.87 \times 10^{-7}$.

Let $\mathcal{D} = \{(x_1, y_1, w_1), (x_2, y_2, w_2), \ldots, (x_n, y_n, w_n)\}$ be the training set where $x_i \in \mathbb{R}^d$, $y_i \in \{b, s\}$, and $w_i \in \mathbb{R}^{+}$. Denote by $\mathcal{S} = \{ y_i = s\}$ and by $\mathcal{B} = \{y_i = b\}$ the indices of the signal and background signals and let $n_s$ and $n_b$ be the numbers of simulated signal and background events. Note that since the dataset from the Higgs challenge is a simulated set the ratio $n_s/n_b$ will not in general be equal to the ratio $P(y=s)/P(y=b)$. This is actually very good for us since $P(y=s)<.

The last thing that needs an explanation in the training set are the weights. We will see later that objective function depends on the unnormalized weights. In order to make it independent on the number of simulated signal and background events the sum of the weights is kept fixed, i.e. $\sum_{i \in \mathcal{S}} w_i = N_s$ and $\sum_{i \in \mathcal{B} }w_i = N_b$ where $N_s$ and $N_b$ are for the expected number of background and signal events in 2012.

Let $g: \mathbb{R}^d \to \{b, s\}$ be any classifier. Selection region is a set $\mathcal{G} = \{ x : g(x) = s\}$ and denote by $\overline{\mathcal{G}}$ the indices of points in $\mathcal{D}$ selected by $g$.

We see that the number $s = \sum_{i \in \mathcal{S} \cap \overline{\mathcal{G}}} w_i$ is an unbiased estimator of the expected number of signal events selected by $g$

$\mu_s = N_s \int_\mathcal{G} p_s(x) dx$

Similarly, the number $b = \sum_{i \in \mathcal{B} \cap \overline{\mathcal{G}}} w_i$ is an unbiased estimate of the expected number of background events

$\mu_s = N_b \int_\mathcal{G} p_b(x) dx$.

We are now in the position to define the (empirical) risk functional:

$AMS = \sqrt{2(s+b+b_{reg})\ln(1+\frac{s}{b + b_{reg}}) - s)}$.

called absolute median significance. The term $b_{reg}$ is a regularization term used to reduce the variance of the AMS. We will discuss regulatization in the next lectures.

Finally, we are in the position to formally define the HiggsML challenge as a machine learning problem. Given the training set $\mathcal{D}$ find an optimal value of parameters in some class of functions $f_\alpha$ that would minimize the value of the empirical risk functional AMS and when tested on the testing set produce the same significance of the test AMS.

Anyone interested in the derivation of the AMS I refer to the paper [Claire Adam-Bourdarios et. al. “Learning to discover: the Higgs boson machine learning challenge”, 2014]

## Lecture I – Exploratory Data Analysis

I will talk about the supervised learning. In short the supervised learning is a problem or task of estimation of an unknown function $f$ that maps some input $x$ into some output $y$ using labeled training examples $(x_i, y_i)$. Two main problems that we usually solve in the context of supervised learning are classification and regression.

Classification: Given the set of labeled training examples $(x_i, y_i)$ with $y_i$ from the set of $\{0, 1\}$ learn how to assign a label $0$ or $1$ to a given input $x$

Regression: Loosely speaking regression is about estimating the value of an unknown real-valued function. This is not always true as this “function” might not even exist.

Some notable examples of the supervised learning problems are linear regression, logistic regression, discriminant analysis, support vector machines, neural networks and many other methods.

Every machine learning task starts from the exploratory data analysis. Up to 80% of the data scientist’s time is spent analyzing data and this is one of the main non systematic tasks it needs to do. We will use Higgs Boson Machine Learning Challenge dataset as our toy dataset.

“The ATLAS experiment and the CMS experiment recently claimed the discovery of the Higgs boson. The discovery was acknowledged by the 2013 Nobel prize in physics given to Franc¸ois Englert and Peter Higgs. This particle was theorized almost 50 years ago to have the role of giving mass to other elementary particles. It is the final ingredient of the Standard Model of particle physics, ruling subatomic particles and forces. The experiments are running at the Large Hadron Collider (LHC) at CERN (the European Organization for Nuclear Research), Geneva, which began operating in 2009 after about 20 years of design and construction, and which will continue operating for at least the next 10 years…

…In the LHC, proton bunches are accelerated on a circular trajectory in both directions. When these bunches cross in the ATLAS detector, some of the protons collide, producing hundreds of millions of proton-proton collisions per second. The up to hundreds of particles resulting from each bunch crossing (called an event) are detected by sensors, producing a sparse vector of about a hundred thousand dimensions (roughly corresponding to an image or speech signal in classical machine learning applications). From this raw data, the type, the energy, and the 3D direction of each particle are estimated. In the last step of this feature construction phase, this variable length list of four-tuples is digested into a fixed-length vector of features containing up to tens of real-valued variables. Some of these variables are first used in a real-time multi-stage cascade classifier (called the trigger) to discard most of the uninteresting events (called the background). The selected events (roughly four hundred per second) are then written on disks by a large CPU farm, producing petabytes of data per year. The saved events still, in large majority, represent known processes (called background): they are mostly produced by the decay of particles which are exotic in everyday terms, but known, having been discovered in previous generations of experiments. The goal of the offline analysis is to find a (not necessarily connected) region in the feature space in which there is a significant excess of events (called signal) compared to what known background processes can explain. Once the region has been fixed, a statistical (counting) test is applied to determine the significance of the excess. If the probability that the excess has been produced by background processes falls below a limit, the new particle is deemed to be discovered…”

[Claire Adam-Bourdarios et al., Learning to discover: the Higgs boson machine learning challenge, 2014]

The data for the kaggle challenge is available free on the CERN OpenData portal here. It consists of a .csv file 186 Mb size and contains ~820,000 observations of proton-proton collision events.





The data has the following list of columns in it:

0                         EventId
1                    DER_mass_MMC
2     DER_mass_transverse_met_lep
3                    DER_mass_vis
4                        DER_pt_h
5            DER_deltaeta_jet_jet
6                DER_mass_jet_jet
7             DER_prodeta_jet_jet
8              DER_deltar_tau_lep
9                      DER_pt_tot
10                     DER_sum_pt
11           DER_pt_ratio_lep_tau
12         DER_met_phi_centrality
13         DER_lep_eta_centrality
14                     PRI_tau_pt
15                    PRI_tau_eta
16                    PRI_tau_phi
17                     PRI_lep_pt
18                    PRI_lep_eta
19                    PRI_lep_phi
20                        PRI_met
21                    PRI_met_phi
22                  PRI_met_sumet
23                    PRI_jet_num
30                 PRI_jet_all_pt
31                         Weight
32                          Label
33                      KaggleSet
34                   KaggleWeight

There are two prefixes in the data PRI and DER. PRI means primitives – the values of this parameter are obtained directly from the collider. DER means that data is derived from primitives. PRI and DER parameters are our input features. Except for the PRI_jet_num variables are real. PRI_jet_num is a factor variable taking on four possible values 0, 1, 2, or 3.

The Label parameter has two values ‘s’ for signal and ‘b’ for background. This is our target variable. Signal means that decay of the Higgs boson is observed. Background is some other decay not caused by the Higgs boson.

### Handling Missing Values.

We can see that some of the values in the dataset are equal to -999.0. These values are the missing values. It happens from time to time that for some inputs we either do not observe certain parameters (like with our toy dataset) or the numbers are too noisy to rely on them in which case it is better to label the observations as missing.

There are three potential strategies how to deal with the observations that contain missing data:

1. Drop observations with the missing inputs. However, be cautious as it might happen, for instance, that region with a lot of signal contains a lot of missing values as well.
2. We may replace the missing value with the most probable value. For example we may look at spatially closest points in the input set and take an average of the corresponding parameter that is missing.
3. We may try to split the dataset into two subsets: one subset with all the missing values another one with all the rest of data. Splitting across multiple dimensions might lead to several models each trained independently.

In the code below I simply drop observations with the missing values. It is a Pandas standard to represent missing values as NaN.


data1 = data.replace(-999.0, np.nan)
data1.dropna()



### Data visualisation: histograms and aggregated statistics.

As we will see later, we are mainly interested in exploring an unknown joint probability distribution $F(x, y)$ and conditional distributions $F(x|y=s")$ and $F(x|y=b")$. Obviously we cannot visualize the full joint probability density but can visualize it along any coordinate (in general, any projection on a 1-dimensional subspace of $X$.

For example, let’s compare the conditional densities of the factor ‘PRI_met_sumet’ between the sample densities conditioned on $y=s"$ versus $y=b"$:


signal = data1[data1.Label=='s']
background = data1[data1.Label=='b']
plt.figure(figsize = (20,10))
plt.hist(signal['PRI_met_sumet'], bins=100, normed = True, alpha=0.5, facecolor='green', label='signal')
plt.hist(background['PRI_met_sumet'], bins=100, normed = True, facecolor='blue', alpha=0.5, label='background')
plt.legend(loc='upper right')
plt.show()



Along with the comparisons between the groups it is also useful to compute aggregated stats of the data: minimal and maximal values, mean, median, standard deviation etc.

### Normalization of Data.

It is often useful to perform data preprocessing before applying machine learning models. The simplest normalization method is to put all the variables on the same scale. The simplest normalization is achieved by the transformation

$x \to (x-\overline{x})/std(x),$

where $\overline{x}$ is a sample mean and $std(x)$ is a sample standard deviation. This transformation cancels all the means and puts ones on the diagonal of the covariance matrix of $x$. In particular, this suggests that normalization above should be done after the rotation of data to the basis of eigenvectors of covariance matrix. We will explore this in greater detail below.

Another useful transformation is a log-transform of positive random variables.

Finally, the third useful transformation is uniformization. Given a sample $(x_1, x_2, \ldots, x_N)$, denote by $\overline{F}$ an empirical cdf of $x$. It is easy to show that $\overline{F}(x)$ is uniformly distributed on $[0,1]$.

Note, that none of the transformations above is an isometry and the model trained post transformation will be different from the model that can be trained using an original dataset. So, why do we want to do these transformations?

1. Sometimes our models are sensitive to the magnitude of values along some axis ignoring all the other. For example, in the PCA we may set a threshold to accept certain eigenvalue for dimensionality reduction. In this case most of the dimensions will be ignored except for the one.
2. In generalized linear models some of the data transformations may lead to Gaussian errors which is a much wanted feature.
3. In the models that are being trained by the local search after the normalization we may start from a better point compared to the initial one to guarantee faster convergence.

### Dimensionality Reduction.

It is always beneficial for a model if we drop the dimension of the feature space by removing some of the features without loss of information. For instance, if we know there is a linear relationship between the input variables we may always drop one per relationship because it bears no information.

For example,

“From very fundamental physics principles, all the distributions should be invariant for a rotation around the z axis (the direction along which the two protons are coming), and invariant for a symmetry wrt x-y plane.

This should mean that:
– if PRI_tau_eta <0, then PRI_tau_eta, PRI_lep_eta, PRI_jet_leading_eta, PRI_jet_subleading_eta can all be multiplied by -1 (unless -999 of course) , thus reducing by a factor two the volume of the feature space…”

That is why expert knowledge might be very important for modelling because it allows to make it much easier for a machine to train.

Sometimes we are OK reducing the dimension of the feature space even at the cost of information loss. One of the most widely spread methods is the principal component analysis (PCA for short).

We start with the covariance matrix $\Sigma$ of $x$. This matrix is a non-negative symmetric matrix. In particular, it has an orthogonal basis of non-negative eigenvalues. PCA sorts these eigenvalues in a descending order and selects the top few of them. All the feature vectors are then projected to the linear subspace of the feature space spanned by the corresponding eigenvectors. The idea is that if we pick enough top eigenvalues for this projection we will not loose much relevant information as the remaining eigenvalues and their corresponding eigenvectors will be the noise.

To illustrate how the PCA works in practice let’s look at the scatterplot of ‘DER_mass_MMC’ vs ‘DER_mass_vis’. First, compute the eigenvectors and eigenvalues of the covariance matrix.


cov_mat = np.cov(eta_data_log.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)


Eigenvectors
[[-0.73679439 -0.67611688]
[ 0.67611688 -0.73679439]]

Eigenvalues
[ 0.01941171  0.27251878]

and plot the scatterplot with the eigenvectors on it:


m = eta_data_log.mean()
plt.figure(figsize = (20,20))
plt.scatter(x = eta_data_log['DER_mass_MMC'], y = eta_data_log['DER_mass_vis'])
from_x = m[0]
from_y = m[1]
to1_x = m[0]+eig_vecs[0,0]
to1_y = m[1]+eig_vecs[1,0]
to2_x = m[0]+eig_vecs[0,1]
to2_y = m[1]+eig_vecs[1,1]

plt.plot([from_x, to2_x], [from_y, to2_y], color = 'r')
plt.plot([from_x, to1_x], [from_y, to1_y], color = 'r')
plt.show()



We may now project the two-dimensional array onto a one-dimensional eigenspace that corresponds to a second eigenvalue of $\Sigma$.