Kickstarting R - Formulae

Formulae in R bear a passing resemblance to other formulae with which you may be familiar. Their primary purpose is to specify a statistical model, but they may also be used to specify other sorts of relationships between variables. This helps to simplify the user interface, although it does exert some pressure upon the beginner to learn the syntax of formulae .

The most straightforward use of formulae is in specifying a linear model, such as the following:

y = ax1 + bx2 + cx3 + i

where the x terms are variables and the a, b, c and i terms are numeric constants. Requesting a computation of this model in R from the lm() function would look something like this:

lm(y~x1+x2+x3,data=mydata.df)

Note that the The variable on the left of the tilde ('~') is the response (or dependent) variable, and those on the right are the terms of the model (sometimes the independent variables). So far, so good.

Because formulae are a vital concept in specifying models, the syntax is rich and sometimes confusing, allowing the usual interactions between variables and the inclusion of various terms that specify the details of calculation.

The reader will have noticed that the simple formula construction was used to specify how to breakdown a variable by a number of factors in the brkdn() function. In this case, the formula representation was used as a convenient way to specify the breakdown to the function rather than a linear model. As with the xtab() function, formulae may be used to specify a number of relationships between variables in R. It is best to ensure that you know how a formula representation is being used by a particular function, as simply sticking one together and sending it to the function often results in particularly confusing error messages.

For more information, see Introduction to R: Defining statistical models; formulae

Back to Table of Contents