This vignette provides a step-by-step tutorial for using the Interaction graphs package “integr”. The package is an implementation of Aleks Jakulin’s Interaction Analysis methodology (http://stat.columbia.edu/~jakulin/Int/) inspired by implementation in Orange 2 data mining software (https://orange.biolab.si/).
In the context of supervised machine learning, an interaction (i.e statistically relevant dependence) between two attributes \(X\) and \(Y\), in the presence of the context (i.e. class) atribute \(C\), is called 3-way interaction. A strength of such interaction is measured with 3-way Interaction gain: \(I(X;Y;C) = I(X,Y;C) − I(X;C) − I(Y;C)\). Here, \(I(X,Y;C) = I(X,Y|C) = H(X|C) + H(Y|C) − H(X,Y|C)\) is conditional Information gain (i.e. conditional Mutual information) between \(X\) and \(Y\) in the context \(C\), and \(I(X;Y) = H(X) + H(Y) − H(X,Y)\) is measure of dependence (i.e. “correlation”) between \(X\) and \(Y\) regardless of context, where \(H(X) = P_i \sum_{i}log_{2}P_i\) is Shannon’s entropy measured in bits, and \(P_i\) the probability of the \(i-th\) class; 2-way Interaction gains of the single attributes \(X\) and \(Y\) is represented with \(I(X;C) = InfoGain_{c}(X) = \sum_{x}\sum_{c}P(x,c)log\frac{P(x,c)}{P(x)P(c)}\) and \(I(Y;C) = InfoGain_{c}(Y) = \sum_{y}\sum_{c}P(y,c)log\frac{P(y,c)}{P(y)P(c)}\), respectively.
Interaction graphs (Figure 1) are a graphical representation of the \(k\)-most significant 3-way interactions (\(2 \leq k \leq 20\)). The graph consists of nodes which represent interracting attributes (and their 2-way interactions indicated below the name), and weighted edges which represent the strength of 3-way interaction. There are two types of edges:
Hence, interaction graphs can be used as a tool for understanding the most important interactions and selection of the attributes suitable for grouping/including in a machine learning model.
In this tutorial, the ‘Golf’ toy-dataset will be used. It is included in the package, and its structure is presented in the Table below. It represents a 14-row discrete data.frame (i.e. all columns are factors) with 6 discrete attributes of which 5 are input, and 1 is the class attribute. The input attributes are used to determine whether a game of golf was played given the conditions, and the decision is recorded in the class attribute:
Outlook | Temperature | Humidity | Windy | Others | Play |
---|---|---|---|---|---|
overcast | hot | high | FALSE | yes | yes |
overcast | cool | normal | TRUE | yes | yes |
overcast | mild | high | TRUE | yes | yes |
overcast | hot | normal | FALSE | yes | yes |
rainy | mild | high | FALSE | yes | yes |
rainy | cool | normal | FALSE | yes | yes |
rainy | cool | normal | TRUE | no | no |
rainy | mild | normal | FALSE | yes | yes |
rainy | mild | high | TRUE | no | no |
sunny | hot | high | FALSE | no | no |
sunny | hot | high | TRUE | no | no |
sunny | mild | high | FALSE | no | no |
sunny | cool | normal | FALSE | yes | yes |
sunny | mild | normal | TRUE | yes | yes |
First the ‘integr’ package, and a dataset needs to be loaded. The dataset needs to be discrete, and to have a class attribute. Here the ‘Golf’ toy-dataset will be used:
When the data is loaded, an interaction graph object needs to be created. A data.frame containing the data needs to be provided, as well as the name of the class attribute as a string:
#create an Interaction graph object
g <- interactionGraph(golf, classAtt = "Play", intNo = 10, speedUp = FALSE)
The additional parameters intNo (integer) and speedUp (boolean) are optional. The first indicates the desired number of interactions to be displayed on the interaction graph (2 <= intNo <= 20, default 16), whilst the latter indicates if during the interactions computation all attributes that have 2-way interaction gain equal to zero (on the 4th decimal) should be pruned; this speeds up computation for larger datasets but it can lead to less precise results so it is turned off (i.e. set to FALSE) by default.
In case the intNo parameter is set to an inappropriate value (i.e <2, >20 or larger than theoretically possible number of interactions for the given dataset) it is automatically adjusted to fit and a warning message is printed.
After the interaction graph object has been obtained, it can be plotted using plotIntGraph():
It only requires an interaction graph object as an input. Here the result of the previous step is used.
The result of this comand is Figure 1.
Integr package allows interaction graphs to be export to a binary file. The supported formats are: a Graphviz graph, SVG image, PNG image, PostScript (PS) file, or PDF. The code for exporting the corresponding binary file is provided below.
#export an Interaction graph object to a Graphviz file
igToGrViz(g, path = "myFolder", fName = "myInteractionGraph")
g is the interaction graph object;
path parameter is a string indicating the path (folder) in which the output should be saved.
fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default.
#export an Interaction graph object to a SVG image
igToSVG(g, path = "myFolder", fName = "myInteractionGraph", h = 2000)
g is the interaction graph object;
path parameter is a string indicating the path (folder) in which the output should be saved.
fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;
h is the desired height of the output image in pixels. If not defined differently, 2000 by default.
#export an Interaction graph object to a PNG image
igToPNG(g, path = "myFolder", fName = "myInteractionGraph", h = 2000)
g is the interaction graph object;
path parameter is a string indicating the path (folder) in which the output should be saved.
fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;
h is the desired height of the output image in pixels. If not defined differently, 2000 by default.
#export an Interaction graph object to a PDF image
igToPDF(g, path = "myFolder", fName = "myInteractionGraph", h = 2000)
g is the interaction graph object;
path parameter is a string indicating the path (folder) in which the output should be saved.
fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;
h is the desired height of the output image in pixels. If not defined differently, 2000 by default.
#export an Interaction graph object to a PS image
igToPS(g, path = "myFolder", fName = "myInteractionGraph", h = 2000)
g is the interaction graph object;
path parameter is a string indicating the path (folder) in which the output should be saved.
fName parameter is a string indicating the name of the output. It should be defined without extension and without spaces. If not specified differently, ‘InteractionGraph’ by default;
h is the desired height of the output image in pixels. If not defined differently, 2000 by default.
See http://stat.columbia.edu/~jakulin/Int/ for more details on the methodology↩