Monte Carlo studies are a common tool in statistics and related fields. They are used for everything from the evaluation of the finite sample properties of new statistical methods to the generation of probability distributions for risk management.
The MonteCarlo package for the R language provides tools to create simulation studies quickly and easily and it also allows to summarize the results in LaTeX tables. This vignette gives details on the implementation of the MonteCarlo package and presents examples for its application.
There are only two main functions in the package:
MonteCarlo() runs a simulation study for a user defined parameter grid. It handles the generation of loops over these parameter grid and parallelizes the computation on a user specified number of CPUs.
MakeTable() creates LaTeX tables from the output of MonteCarlo(). It stacks high dimensional output arrays into tables with a user specified ordering of rows and columns.
To run a simulation study, the user has to nest both - the generation of a sample and the calculation of the desired statistics from this sample - in a single function. This function is passed to MonteCarlo(). No additional programming is required.
The idea behind this approach is to allow the user full control and flexibility with regard to the design of the Monte Carlo experiment. It also makes MonteCarlo() very versatile. Finally, it is very intuitive. The user formulates his experiment as if he/she was only interested in a single draw.
The best way to get working with the MonteCarlo package is to look at an example. Here we evaluate the performance of a standard t-test of the hypothesis \(H_0: \mu=0\) vs \(H_1: \mu\neq0\). We are interested to see how the size and power of the test change with the sample size (n), the distance from the null hypothesis (loc - for location) and the standard deviation of the distribution (scale).
The test statistic is given by
\(t=\frac{\bar x}{\hat \sigma}\),
where \(\bar x\) and \(\hat \sigma\) are the arithmetic mean and the estimated standard deviation. The sample is generated from a normal distribution.
To conduct this analysis, we proceed as follows. First, we load the MonteCarlo package.
library(MonteCarlo)
Then we define the following function.
#########################################
## Example: t-test
# Define function that generates data and applies the method of interest
ttest<-function(n,loc,scale){
# generate sample:
sample<-rnorm(n, loc, scale)
# calculate test statistic:
stat<-sqrt(n)*mean(sample)/sd(sample)
# get test decision:
decision<-abs(stat)>1.96
# return result:
return(list("decision"=decision))
}
As discussed above, ttest() is formulated in a way as if we only want to generate a single test decision. The arguments of the function are the parameters we are interested in.
Our ttest() function carries out 4 steps:
We then define the combinations of parameters that we are interested in and collect them in a list. The elements of this list must have the same names as the parameters for which we want to supply grids.
# define parameter grid:
n_grid<-c(50,100,250,500)
loc_grid<-seq(0,1,0.2)
scale_grid<-c(1,2)
# collect parameter grids in list:
param_list=list("n"=n_grid, "loc"=loc_grid, "scale"=scale_grid)
To run the simulation, the function ttest() and the parameter grid (param_list) are passed to MonteCarlo(), together with the desired number of Monte Carlo repetitions (here nrep=1000).
# run simulation:
MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list)
There is no further coding required. All the mechanics of the Monte Carlo experiment are handled by the MonteCarlo() function.
Calling summary produces a short information on the simulation.
summary(MC_result)
## Simulation of function:
##
## function(n,loc,scale){
##
## # generate sample:
## sample<-rnorm(n, loc, scale)
##
## # calculate test statistic:
## stat<-sqrt(n)*mean(sample)/sd(sample)
##
## # get test decision:
## decision<-abs(stat)>1.96
##
## # return result:
## return(list("decision"=decision))
## }
## <bytecode: 0x000000000bda0e68>
##
## Required time: 20.51 secs for nrep = 1000 repetitions on 1 CPUs
##
## Parameter grid:
##
## n : 50 100 250 500
## loc : 0 0.2 0.4 0.6 0.8 1
## scale : 1 2
##
##
## 1 output arrays of dimensions: 4 6 2 1000
As one can see from the summary, the simulation results are stored in an array of dimension c(4,6,2,1000)
. The Monte Carlo repetitions are collected in the last dimension of the array.
To summarize the results in a reasonable way and to include them as a table in a paper or report, we have to represent them in a matrix. This is handled by the MakeTable() function, that stacks the submatrices collected in the array in the rows and columns of a new and larger matrix and prints the result in the form of LaTeX code to generate the desired table.
To determine in which order the results are stacked in rows and columns, we supply the function arguments rows and cols that are vectors of the names of the parameters in the order in which we want them to appear in the table, sorted from the inside to the outside.
# generate table:
MakeTable(output=MC_result, rows="n", cols=c("loc","scale"), digits=2, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrrrrrrrr }
## \hline\hline\\\\
## scale && \multicolumn{ 6 }{c}{ 1 } & & \multicolumn{ 6 }{c}{ 2 } \\
## n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\
## & & & & & & & & & & & & & & \\
## 50 & & 0.06 & 0.28 & 0.81 & 0.99 & 1.00 & 1.00 & & 0.05 & 0.12 & 0.28 & 0.56 & 0.80 & 0.94 \\
## 100 & & 0.04 & 0.49 & 0.97 & 1.00 & 1.00 & 1.00 & & 0.06 & 0.19 & 0.53 & 0.83 & 0.98 & 1.00 \\
## 250 & & 0.05 & 0.88 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.05 & 0.36 & 0.89 & 1.00 & 1.00 & 1.00 \\
## 500 & & 0.05 & 0.99 & 1.00 & 1.00 & 1.00 & 1.00 & & 0.05 & 0.59 & 1.00 & 1.00 & 1.00 & 1.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ decision }
## \end{table}
To change the ordering, just change the vectors rows and cols. As can be seen below, the results change accordingly.
# generate table:
MakeTable(output=MC_result, rows=c("n","scale"), cols="loc", digits=2, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## scale & n/loc & & 0 & 0.2 & 0.4 & 0.6 & 0.8 & 1 \\
## & & & & & & & & \\
## \multirow{ 4 }{*}{ 1 } & 50 & & 0.06 & 0.28 & 0.81 & 0.99 & 1.00 & 1.00 \\
## & 100 & & 0.04 & 0.49 & 0.97 & 1.00 & 1.00 & 1.00 \\
## & 250 & & 0.05 & 0.88 & 1.00 & 1.00 & 1.00 & 1.00 \\
## & 500 & & 0.05 & 0.99 & 1.00 & 1.00 & 1.00 & 1.00 \\
## & & & & & & & & \\
## \multirow{ 4 }{*}{ 2 } & 50 & & 0.05 & 0.12 & 0.28 & 0.56 & 0.80 & 0.94 \\
## & 100 & & 0.06 & 0.19 & 0.53 & 0.83 & 0.98 & 1.00 \\
## & 250 & & 0.05 & 0.36 & 0.89 & 1.00 & 1.00 & 1.00 \\
## & 500 & & 0.05 & 0.59 & 1.00 & 1.00 & 1.00 & 1.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ decision }
## \end{table}
Now we can simply copy the LaTeX code and add it to our paper, report or presentation. That is all. The user can focus on the design of the experiment that he is interested in and on the interpretation of the results.
We will now discuss the details of MonteCarlo() and MakeTable() and then give further examples.
The most important argument of MonteCarlo() is func. It is a user defined function that handles the generation of data, the application of the method of interest and the evaluation of the result for a single repetition and parameter combination.
MonteCarlo() handles the generation of loops over the desired parameter grids, the repetition of the Monte Carlo experiment for each of the parameter constellations and (if desired) the parallelization of this process.
There are two important formal requirements that func has to fulfill.
In addition to that, func has to work in the current workspace. That means that the required packages, data, other function, etc., have to be loaded.
The second most important argument of MonteCarlo() is param_list, that also has to fulfill some conventions:
The only other required argument of MonteCarlo() is nrep that determines the desired number of repetitions. Apart from that, there is a number of optional arguments that are listed in the table below.
Argument | Description |
---|---|
func | The function to be evaluated. |
nrep | An integer that specifies the desired number of Monte Carlo repetitions. |
param_list | A list whose components are named after the parameters of func and each component is a vector containing the desired grid values for that parameter. |
ncpus | An integer specifying the number of CPUs to be used. Default is ncpus=1. For ncpus>1 the simulation is parallized automatically using ncpus CPU units. |
max_grid | Integer that specifies for which grid size to throw an error, if grid becomes to large. Default is max_grid=1000. |
time_n_test | Boolean that specifies whether the required simulation time should be estimated (useful for large simulations or slow functions). See details. Default is time_n_test=FALSE. |
save_res | Boolean that specifies whether the results of time_n_test should be saved to the current directory. Default is time_n_test=FALSE. |
If ncpus is set to an integer value larger than one, the simulation is conducted parallized on a snow cluster with the respective number of slaves. Make sure that the machine you are using has the respective number of CPU units available. The parallelization builds on the snowfall package. The setup of the cluster including the export of the necessary functions and packages to the workers are handled automatically. If func uses objects in the global workspace, such as lookup tables or data sets, these are also exported. MonteCarlo() autmatically analyzes func to determine the required functions and packages. It can even deal with nested dependencies, where func relies on functions that make use of other functions, and so on. The only precondition is that these functions are available in the workspace - as they have to be if one would just want to run func on its own.
Large parameter grids quickly lead to exessive time requirements for the simulation study. To prevent this from happening unintentionally (for example by thoughtless definition of param_list) an error is thrown, if there are more than 1000 parameter constellations. If it is desired to run the simulation nonetheless, max_grid has to be modified accordingly.
For the estimation of the required simulation time, a separate simulation is run on a reduced grid that only contains the extreme points for each parameter, e.g. the smallest and the largest sample size. This test simulation is carried out with nrep/10 repetitions and the required simulation time is estimated by a linear interpolation. Since the computational complexity is usually a convex function of the sample size and the dimension of the process, this approach tends to overestimate the time required. Nevertheless, it provides a reasonable estimate of the order of magnitude of the time required (minutes, hours or days?).
If time_n_test is set to TRUE, save_res determines whether the results of the smaller test simulation should be saved in the current working directory. This can be useful for larger simulations, where the simulation takes a lot of time and it can be useful to have a look at the preliminary results to stop the main simulation if something is not right.
Sometimes it is not intended to run a large simulation study and to document the results in a LaTeX document. Instead one might want to conduct a little experiment and immediately see the result. If this is desired, one can set raw=FALSE and the output array will be aggregated by taking averages of the results for all nrep repetitions. For small parameter grids these can be viewed and interpreted directly.
Different from functions and packages, MonteCarlo() currently does not support the automatic export of datasets used in func to the cluster. If this is desired, the respective dataset has to be supplied as export_also$data. In addition to that, export_also opens a backdoor in case of bugs that might hamper the automatic function and package export. To manually export a function or dataset or to load a package, pass a list to export_also where the list element is named “functions”, “data” and/or “packages”. For example: export_also=list("data"=rnorm(100), "functions"=c("function_name_1", "function_name_2"), "packages"="package_name")
.
As discussed above, the MakeTable() function handles the generation of LaTeX tables from the output of MonteCarlo(). Like the latter, it only has three required arguments. The first one - output - is a MonteCarlo object returned by the equally named function. The second and third are rows and cols that determine the ordering of the resulting table (from the inside to the outside).
If func returns a list with more than one element, each list element is shown in a separate table by default. This behavior can be modified by including the string “list” at the desired position in either rows or cols.
To compile .tex-files that contain tables generated by MakeTable(), make sure that is included in the preamble.
Similar to MonteCarlo(), MakeTable() has a number of optional arguments that allow to modify the appearance of the generated table and the way the simulation results are summarized. An overview of the function arguments is given in the table below. Details will be discussed thereafter.
Argument | Description |
---|---|
output | A list of class MonteCarlo generated by MonteCarlo(). |
rows | A vector of parameter names to be stacked in the rows of the table. Ordered from the inside to the outside. |
cols | Vector of parameter names to be stacked in the cols of the table. Ordered from the inside to the outside. |
digits | Number of digits displayed in table. Default is digits=4. |
collapse | Optional list of the same length as output giving the names of functions to be applied to the repective components of output when collapsing the results to a table. By default mean() is used. Another example could be sd(). |
transform | Optional argument to transform the output table (for example from MSE to RMSE). If a function is supplied, it is applied to all tables. Alternatively, a list of functions can be supplied that has the same length as output. For tables that are supposed to stay unchanged set the respective list elements to NULL. |
include_meta | Boolean that determines whether the meta data provided by summary() is included in comments below the table. Default is include_meta=TRUE. |
width_mult | Scaling factor for width of the output table. Default is width_mult=1. |
partial_grid | An optional list with the elements named after the parameters for which only a part of the grid values is supposed to be included in the table. Each component of the list is a vector that specifies the grid values of interest. |
digits simply specifies the number of digits to which results are rounded in the table. In contrast to the standard behavior in R, the results will be printed with exactly this number of digits. Trailing zeros will not be dropped. For example, if digits=2, 1 will be displayed as 1.00. This makes for nicer formatting of the resulting tables.
For a given parameter constellation, there will be nrep simulation results in output. By default these are aggregated using mean(). Therefore, if func returns a test decision with the result either specified as 0 (for non-rejection) or 1 (for rejection), as it was the case in the ttest()-example, the results will be aggregated to represent the rejection frequency (size or power - depending on the DGP and the parameter constellation). Similarly, if func returns a point estimate, the mean of the estimated parameter is given in the table. If func returns the deviation of the estimated parameter from its true value, then the table will represent the bias. By the same principle, MSEs can be calculated if func returns squares of these values.
Even though the standard behavior is very versatile, one might be interested in a different aspect of the simulated distribution - for example a quantile such as the median. Therefore the collapse argument allows to supply other functions that can be used to aggregate the output. An example could be median() or sd(). These functions are supplied to collapse in form of a list of the same length as the output of func. Note also that the functions that are specified have to return scalars.
An example for the use of collapse is given below in the section “Further Examples”.
Another option is to apply a transformation directly to the table. This can be useful in situations where the resulting numbers are very small or very large so that one wants to scale them by a power of 10 or 1/10. Another example would be if one wants to show root mean squared errors instead of MSEs.
As for collapse, an example for the use of transform is given in the section “Further Examples”.
By default the information returned by summary() - such as the specification of func, the parameter grids used in the simulation, the number of repetitions nrep and the required simulation time - are included in comments below the table. This is meant as a form of automatic documentation that allows for the replication of results in reports, papers or presentations, even if they were generated some time ago.
If this is not in the interest of the user - or the comments are just perceived as lengthy, this feature can be switched off by setting include_meta=FALSE.
Without further modifications the LaTeX tables returned by MakeTable() are scaled to have the same width as the text in the document. This can be modified directly in the LaTeX code that is returned, but if the user already knows in advance that the table should be smaller, this can already specified by setting width_mult to the desired fraction of the text width.
Sometimes situations will occur in which one is only interested in a subset of the parameter grid originally considered and passed to MonteCarlo(). For example because it turns out that the resulting table is very large, or because the results are not very informative for some values of the grid (e.g. the power of a test remains at 1 if the sample size is increased further).
To drop unwanted parts of the table, a list can be supplied to partial_grid, that specifies which elements of which parameter vector should remain part of the table. The list has to have as many elements as there are parameters in param_list and each element has to be named after the respective parameter. Each list element has to be a vector of natural numbers specifying the positions of desired values in the respective parameter vector. Again, an example is given below in the section “Further Examples”.
MergeResults() is a utility function that allows to merge the MonteCarlo objects generated from multiple simulations using the same func and param_list. This can be useful, if a first test simulation is run with a small nrep and more reliable results are produced later with a larger nrep, or if multiple machines are used.
There are only two required arguments. The first one, path specifies the path of the directory that contains the files that are supposed to be merged. The second one, identifier is a string that identifies which files should be merged. The string has to be a substring of the variable names of all files that are supposed to be merged, but it cannot be contained in the filenames of other files in the specified directory.
It is often usefull to visualize the results of simulation studies with help of density plots, box plots and so on. To facilitate this, the MonteCarlo package does not provide its own suite of plotting functions. Instead, the function MakeFrame() converts the simulation results contained in the output of the MonteCarlo() function into a data.frame that is suitable to be manipulated and plotted with powerful standard tools such as dplyr, ggplot2 and of course base graphics. The structure of the dataframe is such that each row contains the parameters used and the results of the simulation for one repetition. This is illustrated below.
ttest<-function(n,loc,scale){
sample<-rnorm(n, loc, scale)
stat<-sqrt(n)*mean(sample)/sd(sample)
return(list("stat"=stat))
}
MC_result<-MonteCarlo(func=ttest, nrep=1000, param_list=param_list)
df<-MakeFrame(MC_result)
head(df)
## n loc scale stat
## 1 50 0.0 1 0.06306346
## 2 100 0.0 1 -0.14677467
## 3 250 0.0 1 1.24365601
## 4 500 0.0 1 -0.69706901
## 5 50 0.2 1 2.08853104
## 6 100 0.2 1 0.67448822
If func returns a list with more than one element, then each of the list elements has its own column in the dataframe
One can now use the dataframe to conduct further analyses. For example, we can plot estimated densities for the distribution of the t-test with different sample sizes n.
library(dplyr)
library(ggplot2)
tbl <- tbl_df(df)
ggplot(filter(tbl, loc==0, scale==1)) + geom_density(aes(x=stat, col=factor(n)))
In the previous sections we focused on the t-test as an example for the use of MonteCarlo() and MakeTable(). Here, we consider a second example: the evaluation of estimators. In particular, we compare the mean() and the median() as estimators for the expected value of a Gaussian random variable. This example serves to illustrate the use of the optinal arguments of the MakeTable() function.
First, we specify the function of interest: mean_vs_median() (defined below) generates a sample of n normal distributed random variables with standard deviation scale. It then calulates the mean() and the median() and returns both in a list. Therefore, in contrast to the t-test example used above, this function returns a list with two elements.
mean_vs_median<-function(n,scale){
# generate sample
sample<-rnorm(n, 0, scale)
# calculate estimators
mean_sample<-mean(sample)
median_sample<-median(sample)
# return results
return(list("mean"=mean_sample, "median"=median_sample))
}
n_grid<-c(50, 250, 500)
scale_grid<-c(1, 2, 4)
param_list=list("n"=n_grid, "scale"=scale_grid)
# run simulation:
erg_mean_median<-MonteCarlo(func=mean_vs_median, nrep=1000, param_list=param_list)
Like before, we pass the function and the parameter grid to MonteCarlo() and then feed the results to MakeTable().
MakeTable(output=erg_mean_median, rows="n", cols="scale", digits=2, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrr }
## \hline\hline\\\\
## n/scale & & 1 & 2 & 4 \\
## & & & & \\
## 50 & & -0.01 & 0.01 & -0.01 \\
## 250 & & 0.00 & 0.00 & 0.00 \\
## 500 & & 0.00 & 0.00 & 0.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean }
## \end{table}
##
##
##
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrr }
## \hline\hline\\\\
## n/scale & & 1 & 2 & 4 \\
## & & & & \\
## 50 & & -0.01 & 0.01 & 0.00 \\
## 250 & & 0.00 & 0.00 & 0.00 \\
## 500 & & 0.00 & 0.00 & 0.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ median }
## \end{table}
As one can see, the default behavior of MakeTable() is to print separate tables for each value in the list returned by func.
To include both, the estimates for the mean and median in the same table, we simply include the string “list” in either rows or cols. Here, we choose to include it as the last element of cols, to have the results for mean() and median() next to each other.
MakeTable(output=erg_mean_median, rows=c("n"), cols=c("scale","list"), digits=2, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## list && \multicolumn{ 3 }{c}{ mean } & & \multicolumn{ 3 }{c}{ median } \\
## n/scale & & 1 & 2 & 4 & & 1 & 2 & 4 \\
## & & & & & & & & \\
## 50 & & -0.01 & 0.01 & -0.01 & & -0.01 & 0.01 & 0.00 \\
## 250 & & 0.00 & 0.00 & 0.00 & & 0.00 & 0.00 & 0.00 \\
## 500 & & 0.00 & 0.00 & 0.00 & & 0.00 & 0.00 & 0.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean, median }
## \end{table}
For large parameter grids, one might want to focus on a subset of the results. To achive this, just include the optional argument partial_grid. Here, the grids for n and scale both had three values and we only want to use the largest and the smallest values in the grids. We therefore supply a list with the names of the parameters and vectors specifying the positions of the values that we want to keep in the grid.
# use partial_grid
MakeTable(output=erg_mean_median, rows="n", cols=c("scale","list"), digits=2,
partial_grid=list("n"=c(1,3), "scale"=c(1,3)), include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrr }
## \hline\hline\\\\
## list && \multicolumn{ 2 }{c}{ mean } & & \multicolumn{ 2 }{c}{ median } \\
## n/scale & & 1 & 4 & & 1 & 4 \\
## & & & & & & \\
## 50 & & -0.01 & -0.01 & & -0.01 & 0.00 \\
## 500 & & 0.00 & 0.00 & & 0.00 & 0.00 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean, median }
## \end{table}
The results in the tables above are hard to interpret, due to their scale. Both the mean() and median() have a bias that is close to zero, and there is barely a difference in the rounded values. However, if we change the number of digits displayed, we see that the bias of the mean() tends to be smaller than that of the median().
MakeTable(output=erg_mean_median, rows=c("n"), cols=c("scale","list"), digits=4, include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## list && \multicolumn{ 3 }{c}{ mean } & & \multicolumn{ 3 }{c}{ median } \\
## n/scale & & 1 & 2 & 4 & & 1 & 2 & 4 \\
## & & & & & & & & \\
## 50 & & -0.0051 & 0.0071 & -0.0052 & & -0.0084 & 0.0142 & -0.0024 \\
## 250 & & -0.0034 & 0.0012 & -0.0022 & & -0.0013 & 0.0040 & -0.0003 \\
## 500 & & -0.0004 & 0.0032 & 0.0044 & & -0.0019 & 0.0026 & 0.0041 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean, median }
## \end{table}
A simple way to make these differences visible without displaying to many digits is to change the unit, for example by multiplying all results with 100. This is where the transform argument of MakeTable() comes in handy.
MakeTable(output=erg_mean_median, rows=c("n"), cols=c("scale","list"), digits=2, transform=list(function(x){x*100}, function(x){x*100}), include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## list && \multicolumn{ 3 }{c}{ mean } & & \multicolumn{ 3 }{c}{ median } \\
## n/scale & & 1 & 2 & 4 & & 1 & 2 & 4 \\
## & & & & & & & & \\
## 50 & & -0.51 & 0.71 & -0.52 & & -0.84 & 1.42 & -0.24 \\
## 250 & & -0.34 & 0.12 & -0.22 & & -0.13 & 0.40 & -0.03 \\
## 500 & & -0.04 & 0.32 & 0.44 & & -0.19 & 0.26 & 0.41 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean, median }
## \end{table}
Since both estimators - the mean() and the median() are unbiased in this setup, comparing them by their bias might not be the most interesting analysis. Instead, we might be interested in other aspects of their distributions - for example their standard deviations. To modify the function used by MakeTable() to summarize the result from the nrep repetitions of the experiment, a list can be passed to collapse that contains the names of the functions that are supposed to be used to collapse the realisations of the estimators to a meaningful table. Here we use sd(), since we know that mean() is the more efficient estimator in this setup.
MakeTable(output=erg_mean_median, rows="n", cols=c("scale", "list"), digits=2, collapse=list("sd", "sd"), include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrr }
## \hline\hline\\\\
## list && \multicolumn{ 3 }{c}{ mean } & & \multicolumn{ 3 }{c}{ median } \\
## n/scale & & 1 & 2 & 4 & & 1 & 2 & 4 \\
## & & & & & & & & \\
## 50 & & 0.14 & 0.29 & 0.56 & & 0.17 & 0.35 & 0.69 \\
## 250 & & 0.06 & 0.13 & 0.26 & & 0.08 & 0.16 & 0.33 \\
## 500 & & 0.04 & 0.09 & 0.18 & & 0.05 & 0.11 & 0.23 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean, median }
## \end{table}
Now, it is easy to see, that even though the bias terms are very similar, the mean() has a smaller variance than the median().
One of the key concepts behind the design of MonteCarlo() is that it is generally assumed that the samples are generated from a data generated process that nests all the situations of interest. This is also the case in the examples discussed so far, where the data is generated from a normal distribution and only the standard deviation of the distribution and the number of observations change.
However, this does not mean that MonteCarlo() cannot handle non-nested data generating processes. As an example, we consider the case where we want to compare the standard deviation of the mean() and the median() as we did above, but for samples from a normal distribution and from a uniform distribution. These are clearly non-nested. To deal with this situation, we include DGP as an additional argument in mean_vs_median() and create a case distinction, so that the value of DGP determines from which distribution the sample is generated. We then include the possible values of DGP in an additional parameter grid.
mean_vs_median<-function(n,scale,DGP){
# generate sample
if(DGP=="normal"){sample<-rnorm(n, 0, scale)}
if(DGP=="uniform"){
b<-scale/(2*sqrt(1/12))
sample<-runif(n, -b, b)
}
# calculate estimators
mean_sample<-mean(sample)
median_sample<-median(sample)
# return results
return(list("mean"=mean_sample, "median"=median_sample))
}
n_grid<-c(50, 250, 500)
scale_grid<-c(1, 2, 4)
DGP_grid<-c("normal", "uniform")
param_list=list("n"=n_grid, "scale"=scale_grid, "DGP"=DGP_grid)
Note that the width of the uniform distribution is specified so that both the Gaussian and the uniform distribution have a standard deviation of scale.
# run simulation:
erg_mean_median<-MonteCarlo(func=mean_vs_median, nrep=1000, param_list=param_list)
After running the simulation, we obtain the following results.
MakeTable(output=erg_mean_median, rows=c("n","DGP"), cols=c("scale", "list"), digits=2, collapse=list("sd", "sd"), include_meta=FALSE)
## \begin{table}[h]
## \centering
## \resizebox{ 1 \textwidth}{!}{%
## \begin{tabular}{ rrrrrrrrrr }
## \hline\hline\\\\
## & list && \multicolumn{ 3 }{c}{ mean } & & \multicolumn{ 3 }{c}{ median } \\
## DGP & n/scale & & 1 & 2 & 4 & & 1 & 2 & 4 \\
## & & & & & & & & & \\
## \multirow{ 3 }{*}{ normal } & 50 & & 0.14 & 0.29 & 0.56 & & 0.17 & 0.36 & 0.69 \\
## & 250 & & 0.06 & 0.13 & 0.26 & & 0.08 & 0.16 & 0.33 \\
## & 500 & & 0.04 & 0.09 & 0.18 & & 0.06 & 0.11 & 0.23 \\
## & & & & & & & & & \\
## \multirow{ 3 }{*}{ uniform } & 50 & & 0.15 & 0.29 & 0.57 & & 0.24 & 0.48 & 0.98 \\
## & 250 & & 0.06 & 0.12 & 0.24 & & 0.11 & 0.21 & 0.42 \\
## & 500 & & 0.05 & 0.09 & 0.19 & & 0.08 & 0.15 & 0.31 \\
## \\
## \\
## \hline\hline
## \end{tabular}%
## }
## \caption{ mean, median }
## \end{table}
We can see that the standard deviation of the median() is far larger for the uniform distribution whereas that of the mean() remains nearly unchanged.
This example shows that MonteCarlo() is very versatile. Depending on the specification of func, it is possible to deal with non-nested DGPs. Similarly, multivariate outputs could be handled by splitting them in several elements and returning them seperately.