zebu: Local Association Measures

Abstract

Association measures can be local or global. Local association measures quantify the association between specific values of random variables (e.g. chi-squared residuals). Global association measures yield a single value used to summarize the association for all values taken by random variables (e.g. chi-squared). Classical data analysis has focused on global association and overlooked local association. Consequently, software presently available only allows computation of global association measures. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association.
The zebu R package allows estimation of local association measures and implements local association subgroup analysis. It is of interest to a wide range of scientific disciplines such as health and computer sciences and can be used by anyone with a basic knowledge of the R language. It is available in the CRAN and its source code is available at https://github.com/oliviermfmartin/zebu.
Keywords: measure of association, statistical independence, local association, pointwise mutual information, Lewontin’s D, Ducher’s Z, chi-squared residuals.

Summary

Introduction
Background on Association and Independence
Local Association Measures
User’s Guide - An Example with Simulated Data
Future Research and Development
References

Introduction

Association measures can be local or global (Van de Cruys 2011). Local association measures quantify the association between specific values of random variables. In the case of a contingency table, they yield one value for each cell. An example is chi-squared residuals that are computed when constructing a chi-squared test. On the other hand, global association measures yield a single value used to summarize the association for all values taken by random variables. An example is the chi-squared statistic, the sum of squared residuals (Sheskin 2007).

Most often, we are only concerned with the global association and overlook local association. For example, analysis of chi-squared residuals is uncommon practice when compared to the chi-square independence test. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association (Anselin 1995). Accordingly, analysis of association should not limit itself to the global perspective. Indeed, the association between two variables can depend on their values. For example, in threshold mechanisms, variables are only associated with each other when one takes values above a certain critical level. In this case, local association measures allow pinpointing values for which variables are associated. Moreover, the existence of an association between two variables may depend on the value of a third variable. For example, the effect of a drug will depend on the patient’s sensibility to the drug. The local association between drug intake and recovery will not be the same for patients that are sensitive than for those that are resistant to the drug.

The rest of the vignette is organized as follows. We first give the reader the necessary intuition and mathematical background about global and local associations. This leads to the description of chi-square residuals, Lewontin’s $D$ (Lewontin 1964), Ducher’s $Z$ (Ducher et al. 1994), and pointwise mutual information (Van de Cruys 2011). We also introduce a multivariate and normalized measure of local association. Subsequently, we illustrate the usage of local association measures using the zebu R package. The vignette ends with a discussion about future development and research.

Background on Association and Independence

Throughout the vignette, we will suppose that all random variables are discrete and write them in capital letters, such as $A$ and $B$. Lower letters, such as $a$ and $b$, will denote possible values taken by these random variables (i.e. events).

One way to think about a statistical association is as events co-occurring. For example, if event $a$ always occurs with event $b$, then these events are said to be associated. An intuitive measure of association could be the joint probability: $p(a, b)$, the long-term frequency of events showing up together. However, this measure fails if $a$ or $b$ is a rare event. Indeed, joint probabilities are always as small as its individual events are rare: $p(a, b) \leq \min p(a), p(b)$. As a consequence, it is necessary to compare observed probabilities $p(a, b)$ to expected probabilities in which the variables are considered independent. The expected probability, if events are independent, is the factor of marginalized probabilities of events: $p(a) \, p(b)$. Independence is then defined by the following mathematical relation, $p(a, b) = p(a) \, p(b)$, and local association measures are defined to be equal to zero.

Independence implies that knowing one or more variables does not give us any information about the others. This is what we are not interested in. It is, however, possible to define two cases where the former equality does not hold: co-occurrence and mutual exclusivity. Co-occurrence is defined as events showing up more often than expected: $p(a, b) > p(a) \, p(b)$ and local association measures are positive. Mutual exclusivity is defined as events showing up less often than expected: $p(a, b) < p(a) \, p(b)$ and local association measures are negative.

Statistical independence is, however, not the only manner to construct an association measure. Other possibilities are based on the proportion of explained variance such as Pearson’s r. These former measures are parametric and suppose linear or at least monotone relationships between variables. Although intuitive and convenient, this assumption is not always justified. Measures based on statistical independence provide a non-parametric alternative that can detect non-linear relationships.

Local Association Measures

Derivation of Bivariate Forms

For two random variables, $A$ and $B$, we can estimate the local association for each combination of events $A = a$ and $B = b$. This is accomplished by comparing the observed from the expected probability of events $a$ and $b$. If these probabilities are equal, then events $a$ and $b$ are independent. If not, these events are associated; the sign of the measure indicates the orientation of the relationship, and the absolute value indicates its strength.

There are different measures to compare observed and expected probabilities, for example, by using subtraction and division. Hereunder, we define the difference or Lewontin’s $D$ (Lewontin 1964) and the pointwise mutual information $pmi$ (Van de Cruys 2011). To simplify notation, and to show similarities between local association measures, we define $h(a) = - \log p(a)$ as the self-information of $a$.

\[ \begin{aligned} D(a, b) & = p(a, b) - p(a) \, p(b) \\ pmi(a, b) & = \log \frac{p(a, b)} {p(a) p(b)} = - (h(a, b) - h(a) - h(b)) \end{aligned} \]

The bounds of these two measures depend on the frequency of events, which makes it difficult to compare local association values for different combinations of events. For this reason, it is desirable to express association relative to the frequency of events. One common way to do this is using chi-squared residuals $r_{\chi}$ as follows where $N$ is the sample size.

\[ r_{\chi}(a,b) = \sqrt{N} \; \frac{p(a, b) - p(a) \, p(b)}{\sqrt{p(a) \, p(b)}} \]

We may wish to normalize local association so that values are between -1 and 1 included. This can be done by using dividing the non-normalized values by their minimal or maximal values. To identify the theoretical minimal and maximal values of $D$, we will find the bounds of the observed bivariate probability $p(a, b)$ as a function of the marginal probabilities $p(a)$ and $p(b)$.

Using the inclusion-exclusion principle, we know that:

\[ p(a, b) = p(a \cap b) = p(a) + p(b) - p(a \cup b) \]

The intersection probability $p(a, b)$ will be maximized when the union probability $p(a \cup b)$ is equal to zero and minimized when the union probability will be equal to one.

\[ p(a) + p(b) - 1 \le p(a, b) \le p(a) + p(b) \]

Given the intersection probability can not be smaller than zero and can not be larger than the smallest marginal probability, we have:

\[ \max[0, \, p(a) + p(b) - 1] \le p(a, b) \le min[p(a), \, p(b)] \]

Using this result, we can divide $D$ by its theoretical minimal or maximal value which leads to Lewontin’s $D'$ (Lewontin 1964). However, this removes the sign of the association. To preserve the sign, in the case where Lewontin’s $D$ is negative, we divide it by the negative theoretical minimal value. We call this measure Ducher’s $Z$. It should however be noted that the original definition of Ducher’s $Z$ did not probably consider the lower bound of the intersection probability as we do here (Ducher et al. 1994).

\[ Z(a, b) = \begin{cases} \frac{ p(a, b) - p(a) \, p(b) }{ \min[p(a), \, p(b)] - p(a) \, p(b) } & D(a, b) > 0 \\ \\ \frac{ p(a, b) - p(a) \, p(b) }{p(a) p(b) - \max[0, \, p(a) + p(b) - 1]} & D(a, b) < 0 \\ \\ 0 & D(a, b) = 0 \end{cases} \]

Normalization of case of $pmi$ is more subtle because $pmi(a, b)$ tends to $\infty$ when $p(a, b)$ tends to 0. Nonetheless, dividing $pmi(a, b)$ by $- h(a, b)$ solves this problem by making $npmi(a, b)$ tend to -1 when $p(a, b)$ tends to 0 and equal to 1 when $p(a, b) = \min[p(a), p(b)]$ (Bouma 2009).

\[ npmi(a, b) = \frac{ pmi(a, b) }{- h(a, b) } = \frac{ h(a, b) - h(a) - h(b) }{ h(a, b) } \]

The zebu package includes a function called lassie allowing estimation of Lewontin’s $D$, Ducher’s $Z$, $pmi$, $npmi$, and $r_{\chi}$.

Global Association

Global association measures yield a single value used to summarize the association for all values taken by the random variables. For example, mutual information is computed as the sum for all events of their observed probability times their pointwise mutual information. Most global association measures in zebu are defined likewise.

\[ \begin{aligned} GD(A, B) &= \sum_{a, b} p(a, b) D(a, b) \\ GZ(A, B) &= \sum_{a, b} p(a, b) Z(a, b) \\ MI(A, B) &= \sum_{a, b} p(a, b) pmi(a, b) \\ NMI(A, B) &= \sum_{a, b} p(a, b) npmi(a, b) \\ \end{aligned} \]

The global association measure related to chi-squared residuals is the chi-squared $\chi^2$. It is defined as the sum of its squared residuals.

\[ \chi^2 = \sum_{a, b} r_{\chi}(a,b)^2 \]

Statistical Significance Tests

Distinguishing the strength of association from its statistical significance is important. Indeed, a strong association can be non-significant (e.g. some physical law with small sample size) and a weak association can be significant (e.g. epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution or by resampling techniques (Sheskin 2007). Because the theoretical null distribution of local association measures is unknown, the zebu package resorts to estimating p-values by a permutation test. This can be undertaken using the permtest function of the package.

The null hypothesis $H_0$ being tested is that the association measure $L$ is equal to 0, that is, there is no association. The observed association is $L_{obs}$ and the permuted associations are denoted by the set $L_{perm}$. Moreover, we write $\#(\ldots)$ as the number of times and $|\ldots|$ as the absolute value. The two-sided p-value can then be estimated as follows.

\[ p = \frac{\#(|L_{obs}| < |L_{perm}|)}{\#(L_{perm})} \]

With chi-squared residuals in a two-dimensional setting, an alternative analytical method for estimating the p-values can be employed as implemented by the chisqtest function. This function calculates two-sided p-values for each local chi-squared residual, assuming that these residuals are distributed according to a standard normal distribution. Thus, the two-sided p-value $p$ can be computed as follows where $\Phi(\ldots)$ is the cumulative distribution function of the standard normal distribution, and $L_{obs}$ denotes the observed chi-squared residuals:

\[ p = 2 \times (1 - \Phi(|L_{obs}|)) \]

As these local association measures involve conducting multiple statistical tests, it is advisable to apply corrections for multiple testing, such as the method proposed by Benjamini and Hochberg.

Derivation of Multivariate Forms

Multivariate association measures may help identify complex association relationships that cannot be detected only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually (Jakulin and Bratko 2003). The association is only revealed when the two inputs and the output are taken together.

To derive multivariate forms of these local association measures, we assume that events are mutually independent. This means that for $M$ random variables $X_1, \ldots, X_M$, independence is defined by: $p(x_1, \ldots, x_M) = \prod_{i=1}^{M} p(x_i)$. We can thus define the following measures.

\[ \begin{aligned} D(x_1, \ldots, x_M) & = p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) \\ pmi(x_1, \ldots, x_M) & = - [h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) ] \end{aligned} \]

By dividing $D(x_1, \ldots, x_M)$ by the expected probability we obtain multivariate chi-squared residuals

\[ r_{\chi}(x_1, \ldots, x_M) = \sqrt{N} \; \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \sqrt{\prod_{i=1}^{M} p(x_i)} } \]

To obtain a multivariate measure of Ducher’s $Z$, we need to find the bounds of the observed probability $p(x_1, \ldots, x_M)$. We know that the upper bound will be the minimal marginal probability. Additionally, we find a formula to express the lower bound (proof at the end of the section). This leads to the following bounds.

\[ \max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \le p(x_1, \ldots, x_M) \le \min[x_1, \ldots, x_M]$. \]

We thus propose the following multivariate form of Ducher’s $Z$.

\[ Z(x_1, \ldots, x_M) = \begin{cases} \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \min[p(x_1), \ldots, p(x_M)] - \prod_{i=1}^{M} p(x_i) } & D(x_1, \ldots, x_M) > 0 \\ \\ \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{\prod_{i=1}^{M} p(x_i)- \max[0, - M - 1 + \sum_{i=1}^M p(x_i)]} & D(x_1, \ldots, x_M) < 0 \\ \\ 0 & D(x_1, \ldots, x_M) = 0 \end{cases} \]

For pointwise mutual information, the normalization technique suggested by Bouma (2009) is not bounded by 1 for more than two variables. To solve this, we suggest the following normalization scheme which we call $npmi_2$.

\[ npmi_2(x_1, \ldots, x_M) = \begin{cases} \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{ \min[h(x_1), \ldots, h(x_M)] - \sum_{i=1}^{M} h(x_i) } & pmi(x_1, \ldots, x_M) > 0 \\ \\ \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{h(x_1, \ldots, x_M)} & pmi(x_1, \ldots, x_M) < 0 \\ \\ 0 & pmi(x_1, \ldots, x_M) = 0 \end{cases} \]

Proof of lower bound formula

Using induction and the inclusion-exclusion principle we give a formula for the lower bound of the observed intersection probability of $M$ events.

\[ \min[ p(x_1, \ldots, x_M) ] = \max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \]

We first show that this is true for the base case $M=2$. In this case, $p(x_1) + p(x_2) -1 \le p(x_1, x_2)$. We proved this using the inclusion-exclusion principle in the section where we derive a bivariate form of Ducher’s Z.

We now show that the induction step is true: let’s assume that this formula is true for $M$ variables. For $M+1$ variables, the inclusion-exclusion principle tells us that:

\[ \begin{align} p(x_1, \ldots, x_{M+1}) &= p(\{\cap_{i=1}^M x_i \} \cap x_{M+1}) \\ &= p(\cap_{i=1}^M x_i) + p(x_{M+1}) - p(\{\cap_{i=1}^M x_M \} \cup x_{M+1}) \\ \end{align} \]

We assumed that the lower bound $p(\cap_{i=1}^M x_i)$ is $-M - 1 + \sum_{i=1}^M p(x_i)$ and we know that the upper bound of $p(\{\cap_{i=1}^M x_M \} \cup x_{M+1})$ is one. Replacing these values in the last line leads to

\[ \min[ p(x_1, \ldots, x_{M+1}) ] = -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i) \]

However, given that probabilities can not be lower than zero, we can write the following equation which completes the proof.

\[ \min[ p(x_1, \ldots, x_{M+1}) ] = \max[0, -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)] \]

User’s Guide - An Example with Simulated Data

Once R is installed, the first step is to install the zebu package. You can install the released version from CRAN

install.packages("zebu")

We can then load the zebu R package.

library(zebu)

Simulating the Dataset

To illustrate local association measures and the zebu package, we’ll simulate a small culinary dataset.

Each row corresponds to a client of a restaurant. We record the choices made by the client. There are three choices for each of the plates: starters, main dish, and dessert.

The clients take the starter with equal probability. The choice of the following dish depends only on the previous dish. The clients tend to avoid a dish with an ingredient that had in the previous dish. We define these probabilities hereunder.

starter_prob <- c("Tomato Mozzarella Salad" = 1/3, "Rice Tuna Salad" = 1/3, "Lentil Salad" = 1/3)
starter_prob

Tomato Mozzarella Salad         Rice Tuna Salad            Lentil Salad 
              0.3333333               0.3333333               0.3333333

main_given_starter_prob <- matrix(c(5/11, 1/11, 5/11,
                                    5/11, 5/11, 1/10,
                                    1/11, 5/11, 5/11),
                                  3, 3, byrow = TRUE)
rownames(main_given_starter_prob) <- names(starter_prob)
colnames(main_given_starter_prob) <- c("Sausage and Lentil Stew", "Pizza Margherita", "Pilaf Rice")
main_given_starter_prob

                        Sausage and Lentil Stew Pizza Margherita Pilaf Rice
Tomato Mozzarella Salad              0.45454545       0.09090909  0.4545455
Rice Tuna Salad                      0.45454545       0.45454545  0.1000000
Lentil Salad                         0.09090909       0.45454545  0.4545455

dessert_given_main <- matrix(c(2/6, 2/6, 2/6,
                               7/12, 1/12, 2/6, 
                               1/12, 7/12, 2/6),
                             3, 3, byrow = TRUE)
rownames(dessert_given_main) <- colnames(main_given_starter_prob)
colnames(dessert_given_main) <- c("Rice Pudding", "Apple Pie", "Fruit Salad")
dessert_given_main

                        Rice Pudding  Apple Pie Fruit Salad
Sausage and Lentil Stew   0.33333333 0.33333333   0.3333333
Pizza Margherita          0.58333333 0.08333333   0.3333333
Pilaf Rice                0.08333333 0.58333333   0.3333333

We now simulate a dataset of 1000 clients given these probabilities.

set.seed(0)
sample_size <- 1000
df <- t(sapply(seq_len(sample_size), function(i) {
  
  starter <- sample(names(starter_prob), size = 1, prob = starter_prob)
  main <- sample(colnames(main_given_starter_prob), size = 1, prob = main_given_starter_prob[starter, ])
  dessert <- sample(colnames(dessert_given_main), size = 1, prob = dessert_given_main[main, ])
  
  c(Starter = starter, Main = main, Dessert = dessert)
}))
df <- as.data.frame(df)

head(df)

                  Starter                    Main      Dessert
1 Tomato Mozzarella Salad Sausage and Lentil Stew  Fruit Salad
2            Lentil Salad              Pilaf Rice    Apple Pie
3 Tomato Mozzarella Salad        Pizza Margherita  Fruit Salad
4            Lentil Salad        Pizza Margherita Rice Pudding
5         Rice Tuna Salad Sausage and Lentil Stew  Fruit Salad
6 Tomato Mozzarella Salad              Pilaf Rice  Fruit Salad

We look at the contingency table.

table(df)

, , Dessert = Apple Pie

                         Main
Starter                   Pilaf Rice Pizza Margherita Sausage and Lentil Stew
  Lentil Salad                    77               17                      13
  Rice Tuna Salad                 14               14                      44
  Tomato Mozzarella Salad         81                2                      47

, , Dessert = Fruit Salad

                         Main
Starter                   Pilaf Rice Pizza Margherita Sausage and Lentil Stew
  Lentil Salad                    54               48                      17
  Rice Tuna Salad                  7               49                      47
  Tomato Mozzarella Salad         39               11                      58

, , Dessert = Rice Pudding

                         Main
Starter                   Pilaf Rice Pizza Margherita Sausage and Lentil Stew
  Lentil Salad                    11               95                      15
  Rice Tuna Salad                  3              101                      52
  Tomato Mozzarella Salad         13               12                      59

Bivariate Association

The local (and global) association between these variables can be estimated using the lassie function. This function takes at least one argument: a data.frame. Columns are selected using the select arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the continuous argument and the number of discretization bins with the breaks argument (as in the cut function). The local association measure that we use here is Ducher’s Z as specified by setting the measure argument equal to "z". We’ll look at the relationship between the main dish and the dessert.

las <- lassie(df, select = c("Main", "Dessert"), measure = "z")

The permtest function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by nb and the adjustment method of p-values for multiple comparisons by p_adjust (as in the p.adjust function).

las <- permtest(las, 
                nb = 5000, 
                p_adjust = "BH")

The lassie and permtest functions return a lassie S3 object, as well as permtest for permtest. lassie objects can be visualized using the plot and print methods. The plot function returns a heatmap with the local association and p-values displayed between parenthesis.

print(las)

Measure: Ducher's Z
Global: 0.091 (p-value: <1/5000)
                         Dessert
Main                        Apple Pie  Fruit Salad Rice Pudding
  Pilaf Rice               0.38531235  0.006639046 -0.749858716
  Pizza Margherita        -0.69399394 -0.062255796  0.367744192
  Sausage and Lentil Stew -0.04383642  0.027310138 -0.008436162

plot(las)

Alternatively, for two dimensional chi-squared analysis, significance can be estimated using the chisqtest function.

Results can be saved in CSV format using write.lassie. To access the documentation of these functions, please type help("print.lassie"), help("plot.lassie") and help(write.lassie) in the R console.

In this example, we can see that the global association between the two variables is statistically significant. However, we see that there is only a local association between these two variables only for certain combinations. More specifically, knowing that a client took the lentils and sausages does not convey information about what dessert s/he will take, and likewise for the fruit salad. A significant global association can hide a non-significant local association.

Multivariate Association

The number of variables that can be handled in the zebu package is not limited. We can use the lassie function to access the association between the three simulated variables.

In this case, we obtain a multidimensional local association array. Because of this, results cannot be plotted as a tile plot; the plot method is not available. The print method allows visualizing results by melting the array into a data.frame, by default sorted by decreasing local association.

las2 <- lassie(df, measure = "z")
las2 <- permtest(las2, nb = 5000)
print(las2, what_sort = "local_p", decreasing = FALSE)

Measure: Ducher's Z
Global: -0.008 (p-value: 1e+00)
                   Starter                    Main      Dessert       local
2          Rice Tuna Salad              Pilaf Rice    Apple Pie -0.54220571
4             Lentil Salad        Pizza Margherita    Apple Pie -0.54570753
5          Rice Tuna Salad        Pizza Margherita    Apple Pie -0.60779228
6  Tomato Mozzarella Salad        Pizza Margherita    Apple Pie -0.94240428
7             Lentil Salad Sausage and Lentil Stew    Apple Pie -0.65556067
11         Rice Tuna Salad              Pilaf Rice  Fruit Salad -0.78566904
15 Tomato Mozzarella Salad        Pizza Margherita  Fruit Salad -0.70338204
16            Lentil Salad Sausage and Lentil Stew  Fruit Salad -0.57824247
19            Lentil Salad              Pilaf Rice Rice Pudding -0.70631293
20         Rice Tuna Salad              Pilaf Rice Rice Pudding -0.91603179
21 Tomato Mozzarella Salad              Pilaf Rice Rice Pudding -0.62596772
24 Tomato Mozzarella Salad        Pizza Margherita Rice Pudding -0.70420369
25            Lentil Salad Sausage and Lentil Stew Rice Pudding -0.65981754
23         Rice Tuna Salad        Pizza Margherita Rice Pudding  0.20497105
22            Lentil Salad        Pizza Margherita Rice Pudding  0.16908965
1             Lentil Salad              Pilaf Rice    Apple Pie  0.16835345
3  Tomato Mozzarella Salad              Pilaf Rice    Apple Pie  0.19034384
10            Lentil Salad              Pilaf Rice  Fruit Salad  0.07463891
18 Tomato Mozzarella Salad Sausage and Lentil Stew  Fruit Salad  0.07237082
27 Tomato Mozzarella Salad Sausage and Lentil Stew Rice Pudding  0.06433270
8          Rice Tuna Salad Sausage and Lentil Stew    Apple Pie  0.02929618
9  Tomato Mozzarella Salad Sausage and Lentil Stew    Apple Pie  0.04371432
12 Tomato Mozzarella Salad              Pilaf Rice  Fruit Salad  0.02704901
13            Lentil Salad        Pizza Margherita  Fruit Salad  0.02770694
14         Rice Tuna Salad        Pizza Margherita  Fruit Salad  0.03727140
26         Rice Tuna Salad Sausage and Lentil Stew Rice Pudding  0.03439883
17         Rice Tuna Salad Sausage and Lentil Stew  Fruit Salad  0.02932948
     obs        exp           local_p
2  0.014 0.03058142           <1/5000
4  0.017 0.03742083           <1/5000
5  0.014 0.03569537           <1/5000
6  0.002 0.03472480           <1/5000
7  0.013 0.03774250           <1/5000
11 0.007 0.03265977           <1/5000
15 0.011 0.03708474           <1/5000
16 0.017 0.04030752           <1/5000
19 0.011 0.03745483           <1/5000
20 0.003 0.03572781           <1/5000
21 0.013 0.03475636           <1/5000
24 0.012 0.04056846           <1/5000
25 0.015 0.04409398           <1/5000
23 0.101 0.04170236 0.120342857142857
22 0.095 0.04371818            0.1746
1  0.077 0.03205968         0.2203875
3  0.081 0.02974990 0.224258823529412
10 0.054 0.03423849 0.424326315789474
18 0.058 0.03740352 0.424326315789474
27 0.059 0.04091718           0.45117
8  0.044 0.03600221 0.481638461538462
9  0.047 0.03502330 0.481638461538462
12 0.039 0.03177174 0.481638461538462
13 0.048 0.03996399 0.481638461538462
14 0.049 0.03812127 0.481638461538462
26 0.052 0.04206083 0.481638461538462
17 0.047 0.03844896            0.4898

In this case, we see that there is no significant global association. However, we see that for certain combinations of variables, there is a significant local association. A non-significant global association can hide a significant local association.

Future Research and Development

Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. Also, we have assumed mutual exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile.

In zebu, discretization is a necessary step for studying continuous variables. We have restrained ourselves to simple discretization methods: equal-width and user-defined. Other discretization algorithms exist (Dash, Paramguru, and Dash 2011) and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to handle continuous variables better. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in fields such as health sciences, datasets are sparse. Bayesian estimation methods are more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge (Wilkinson 2007). Such an implementation may also prove to be of interest.

References

Anselin, Luc. 1995. “Local Indicators of Spatial Association—LISA.” Geographical Analysis 27 (2): 93–115.

Bouma, Gerlof. 2009. “Normalized (Pointwise) Mutual Information in Collocation Extraction.” Proceedings of the Biennial GSCL Conference, 31–40.

Dash, Rajashree, Rajib Lochan Paramguru, and Rasmita Dash. 2011. “Comparative Analysis of Supervised and Unsupervised Discretization Techniques.” International Journal of Advances in Science and Technology 2 (3).

Ducher, M., C. Cerutti, M. P. Gustin, and C. Z. Paultre. 1994. “Statistical Relationships Between Systolic Blood Pressure and Heart Rate and Their Functional Significance in Conscious Rats.” Medical and Biological Engineering and Computing 32 (6): 649–55.

Jakulin, Aleks, and Ivan Bratko. 2003. “Analyzing Attribute Dependencies.” In Knowledge Discovery in Databases: PKDD 2003, edited by Nada Lavrac, Dragan Gamberger, Ljupco Todorovski, and Hendrik Blockeel, 229–40. Lecture Notes in Computer Science 2838. Springer Berlin Heidelberg.

Lewontin, R. C. 1964. “The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models.” Genetics 49 (1): 49.

Sheskin, David J. 2007. Handbook of Parametric and Nonparametric Statistical Procedures. 4th ed. Chapman & Hall/CRC.

Van de Cruys, Tim. 2011. “Two Multivariate Generalizations of Pointwise Mutual Information.” In Proceedings of the Workshop on Distributional Semantics and Compositionality, 16–20. DiSCo ’11. Stroudsburg, PA, USA: Association for Computational Linguistics.

Wilkinson, Darren J. 2007. “Bayesian Methods in Bioinformatics and Computational Systems Biology.” Briefings in Bioinformatics 8 (2): 109–16.