Lecture I – Exploratory Data Analysis

I will talk about the supervised learning. In short the supervised learning is a problem or task of estimation of an unknown function $f$ that maps some input $x$ into some output $y$ using labeled training examples $(x_i, y_i)$. Two main problems that we usually solve in the context of supervised learning are classification and regression.

Classification: Given the set of labeled training examples $(x_i, y_i)$ with $y_i$ from the set of $\{0, 1\}$ learn how to assign a label $0$ or $1$ to a given input $x$

Regression: Loosely speaking regression is about estimating the value of an unknown real-valued function. This is not always true as this “function” might not even exist.

Some notable examples of the supervised learning problems are linear regression, logistic regression, discriminant analysis, support vector machines, neural networks and many other methods.

Every machine learning task starts from the exploratory data analysis. Up to 80% of the data scientist’s time is spent analyzing data and this is one of the main non systematic tasks it needs to do. We will use Higgs Boson Machine Learning Challenge dataset as our toy dataset.

“The ATLAS experiment and the CMS experiment recently claimed the discovery of the Higgs boson. The discovery was acknowledged by the 2013 Nobel prize in physics given to Franc¸ois Englert and Peter Higgs. This particle was theorized almost 50 years ago to have the role of giving mass to other elementary particles. It is the final ingredient of the Standard Model of particle physics, ruling subatomic particles and forces. The experiments are running at the Large Hadron Collider (LHC) at CERN (the European Organization for Nuclear Research), Geneva, which began operating in 2009 after about 20 years of design and construction, and which will continue operating for at least the next 10 years…

…In the LHC, proton bunches are accelerated on a circular trajectory in both directions. When these bunches cross in the ATLAS detector, some of the protons collide, producing hundreds of millions of proton-proton collisions per second. The up to hundreds of particles resulting from each bunch crossing (called an event) are detected by sensors, producing a sparse vector of about a hundred thousand dimensions (roughly corresponding to an image or speech signal in classical machine learning applications). From this raw data, the type, the energy, and the 3D direction of each particle are estimated. In the last step of this feature construction phase, this variable length list of four-tuples is digested into a fixed-length vector of features containing up to tens of real-valued variables. Some of these variables are first used in a real-time multi-stage cascade classifier (called the trigger) to discard most of the uninteresting events (called the background). The selected events (roughly four hundred per second) are then written on disks by a large CPU farm, producing petabytes of data per year. The saved events still, in large majority, represent known processes (called background): they are mostly produced by the decay of particles which are exotic in everyday terms, but known, having been discovered in previous generations of experiments. The goal of the offline analysis is to find a (not necessarily connected) region in the feature space in which there is a significant excess of events (called signal) compared to what known background processes can explain. Once the region has been fixed, a statistical (counting) test is applied to determine the significance of the excess. If the probability that the excess has been produced by background processes falls below a limit, the new particle is deemed to be discovered…”

[Claire Adam-Bourdarios et al., Learning to discover: the Higgs boson machine learning challenge, 2014]

Loading the data

The data for the kaggle challenge is available free on the CERN OpenData portal here. It consists of a .csv file 186 Mb size and contains ~820,000 observations of proton-proton collision events.


data = pd.read_csv('atlas-higgs-challenge-2014-v2.csv')



The data has the following list of columns in it:

0                         EventId
1                    DER_mass_MMC
2     DER_mass_transverse_met_lep
3                    DER_mass_vis
4                        DER_pt_h
5            DER_deltaeta_jet_jet
6                DER_mass_jet_jet
7             DER_prodeta_jet_jet
8              DER_deltar_tau_lep
9                      DER_pt_tot
10                     DER_sum_pt
11           DER_pt_ratio_lep_tau
12         DER_met_phi_centrality
13         DER_lep_eta_centrality
14                     PRI_tau_pt
15                    PRI_tau_eta
16                    PRI_tau_phi
17                     PRI_lep_pt
18                    PRI_lep_eta
19                    PRI_lep_phi
20                        PRI_met
21                    PRI_met_phi
22                  PRI_met_sumet
23                    PRI_jet_num
24             PRI_jet_leading_pt
25            PRI_jet_leading_eta
26            PRI_jet_leading_phi
27          PRI_jet_subleading_pt
28         PRI_jet_subleading_eta
29         PRI_jet_subleading_phi
30                 PRI_jet_all_pt
31                         Weight
32                          Label
33                      KaggleSet
34                   KaggleWeight

There are two prefixes in the data PRI and DER. PRI means primitives – the values of this parameter are obtained directly from the collider. DER means that data is derived from primitives. PRI and DER parameters are our input features. Except for the PRI_jet_num variables are real. PRI_jet_num is a factor variable taking on four possible values 0, 1, 2, or 3.

The Label parameter has two values ‘s’ for signal and ‘b’ for background. This is our target variable. Signal means that decay of the Higgs boson is observed. Background is some other decay not caused by the Higgs boson.

Handling Missing Values.

We can see that some of the values in the dataset are equal to -999.0. These values are the missing values. It happens from time to time that for some inputs we either do not observe certain parameters (like with our toy dataset) or the numbers are too noisy to rely on them in which case it is better to label the observations as missing.

There are three potential strategies how to deal with the observations that contain missing data:

1. Drop observations with the missing inputs. However, be cautious as it might happen, for instance, that region with a lot of signal contains a lot of missing values as well.
2. We may replace the missing value with the most probable value. For example we may look at spatially closest points in the input set and take an average of the corresponding parameter that is missing.
3. We may try to split the dataset into two subsets: one subset with all the missing values another one with all the rest of data. Splitting across multiple dimensions might lead to several models each trained independently.

In the code below I simply drop observations with the missing values. It is a Pandas standard to represent missing values as NaN.


data1 = data.replace(-999.0, np.nan)
data1.dropna()



Data visualisation: histograms and aggregated statistics.

As we will see later, we are mainly interested in exploring an unknown joint probability distribution $F(x, y)$ and conditional distributions $F(x|y=s")$ and $F(x|y=b")$. Obviously we cannot visualize the full joint probability density but can visualize it along any coordinate (in general, any projection on a 1-dimensional subspace of $X$.

For example, let’s compare the conditional densities of the factor ‘PRI_met_sumet’ between the sample densities conditioned on $y=s"$ versus $y=b"$:


signal = data1[data1.Label=='s']
background = data1[data1.Label=='b']
plt.figure(figsize = (20,10))
plt.hist(signal['PRI_met_sumet'], bins=100, normed = True, alpha=0.5, facecolor='green', label='signal')
plt.hist(background['PRI_met_sumet'], bins=100, normed = True, facecolor='blue', alpha=0.5, label='background')
plt.legend(loc='upper right')
plt.show()



Along with the comparisons between the groups it is also useful to compute aggregated stats of the data: minimal and maximal values, mean, median, standard deviation etc.

Normalization of Data.

It is often useful to perform data preprocessing before applying machine learning models. The simplest normalization method is to put all the variables on the same scale. The simplest normalization is achieved by the transformation

$x \to (x-\overline{x})/std(x),$

where $\overline{x}$ is a sample mean and $std(x)$ is a sample standard deviation. This transformation cancels all the means and puts ones on the diagonal of the covariance matrix of $x$. In particular, this suggests that normalization above should be done after the rotation of data to the basis of eigenvectors of covariance matrix. We will explore this in greater detail below.

Another useful transformation is a log-transform of positive random variables.

Finally, the third useful transformation is uniformization. Given a sample $(x_1, x_2, \ldots, x_N)$, denote by $\overline{F}$ an empirical cdf of $x$. It is easy to show that $\overline{F}(x)$ is uniformly distributed on $[0,1]$.

Note, that none of the transformations above is an isometry and the model trained post transformation will be different from the model that can be trained using an original dataset. So, why do we want to do these transformations?

1. Sometimes our models are sensitive to the magnitude of values along some axis ignoring all the other. For example, in the PCA we may set a threshold to accept certain eigenvalue for dimensionality reduction. In this case most of the dimensions will be ignored except for the one.
2. In generalized linear models some of the data transformations may lead to Gaussian errors which is a much wanted feature.
3. In the models that are being trained by the local search after the normalization we may start from a better point compared to the initial one to guarantee faster convergence.

Dimensionality Reduction.

It is always beneficial for a model if we drop the dimension of the feature space by removing some of the features without loss of information. For instance, if we know there is a linear relationship between the input variables we may always drop one per relationship because it bears no information.

For example,

“From very fundamental physics principles, all the distributions should be invariant for a rotation around the z axis (the direction along which the two protons are coming), and invariant for a symmetry wrt x-y plane.

This should mean that:
– that the 5 variables PRI_tau_phi, PRI_lep_phi, PRI_met_phi, PRI_jet_leading_phi, PRI_jet_subleading_phi can be replaced by the four variables PRI_lep_phi-PRI_tau_phi, PRI_met_phi-PRI_tau_phi, PRI_jet_leading_phi-PRI_tau_phi, PRI_jet_subleading_phi-PRI_tau_phi (calculation to be made modulo 2pi, and PRI_jet_leading_phi and PRI_jet_subleading_phi should not be used if -999.)
– if PRI_tau_eta <0, then PRI_tau_eta, PRI_lep_eta, PRI_jet_leading_eta, PRI_jet_subleading_eta can all be multiplied by -1 (unless -999 of course) , thus reducing by a factor two the volume of the feature space…”

That is why expert knowledge might be very important for modelling because it allows to make it much easier for a machine to train.

Sometimes we are OK reducing the dimension of the feature space even at the cost of information loss. One of the most widely spread methods is the principal component analysis (PCA for short).

We start with the covariance matrix $\Sigma$ of $x$. This matrix is a non-negative symmetric matrix. In particular, it has an orthogonal basis of non-negative eigenvalues. PCA sorts these eigenvalues in a descending order and selects the top few of them. All the feature vectors are then projected to the linear subspace of the feature space spanned by the corresponding eigenvectors. The idea is that if we pick enough top eigenvalues for this projection we will not loose much relevant information as the remaining eigenvalues and their corresponding eigenvectors will be the noise.

To illustrate how the PCA works in practice let’s look at the scatterplot of ‘DER_mass_MMC’ vs ‘DER_mass_vis’. First, compute the eigenvectors and eigenvalues of the covariance matrix.


cov_mat = np.cov(eta_data_log.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)


Eigenvectors
[[-0.73679439 -0.67611688]
[ 0.67611688 -0.73679439]]

Eigenvalues
[ 0.01941171  0.27251878]

and plot the scatterplot with the eigenvectors on it:


m = eta_data_log.mean()
plt.figure(figsize = (20,20))
plt.scatter(x = eta_data_log['DER_mass_MMC'], y = eta_data_log['DER_mass_vis'])
from_x = m[0]
from_y = m[1]
to1_x = m[0]+eig_vecs[0,0]
to1_y = m[1]+eig_vecs[1,0]
to2_x = m[0]+eig_vecs[0,1]
to2_y = m[1]+eig_vecs[1,1]

plt.plot([from_x, to2_x], [from_y, to2_y], color = 'r')
plt.plot([from_x, to1_x], [from_y, to1_y], color = 'r')
plt.show()



We may now project the two-dimensional array onto a one-dimensional eigenspace that corresponds to a second eigenvalue of $\Sigma$.