DataFakeR package allows to customize each step of DataFakeR workflow, by setting up proper options using set_faker_opts
function (and option-related specific methods).
All the configurable options are stored with the default values within default_faker_opts
object.
str(default_faker_opts, max.level = 1)
#> List of 27
#> $ opt_pull_character :List of 5
#> $ opt_pull_numeric :List of 5
#> $ opt_pull_integer :List of 5
#> $ opt_pull_logical :List of 2
#> $ opt_pull_date :List of 3
#> $ opt_pull_table :List of 1
#> $ opt_default_character :List of 7
#> $ opt_simul_spec_character :List of 1
#> $ opt_simul_restricted_character :List of 2
#> $ opt_simul_default_fun_character:function (n, not_null, unique, default, nchar, type, na_ratio, levels_ratio,
#> ...)
#> $ opt_default_numeric :List of 8
#> $ opt_simul_spec_numeric :List of 1
#> $ opt_simul_restricted_numeric :List of 3
#> $ opt_simul_default_fun_numeric :function (n, not_null, unique, default, type, na_ratio, levels_ratio, ...)
#> $ opt_default_integer :List of 6
#> $ opt_simul_spec_integer :List of 1
#> $ opt_simul_restricted_integer :List of 3
#> $ opt_simul_default_fun_integer :function (n, not_null, unique, default, type, na_ratio, levels_ratio, ...)
#> $ opt_default_logical :List of 6
#> $ opt_simul_spec_logical :List of 1
#> $ opt_simul_restricted_logical :List of 1
#> $ opt_simul_default_fun_logical :function (n, not_null, unique, default, type, na_ratio, levels_ratio, ...)
#> $ opt_default_date :List of 9
#> $ opt_simul_spec_date :List of 1
#> $ opt_simul_restricted_date :List of 2
#> $ opt_simul_default_fun_date :function (n, not_null, unique, default, type, min_date, max_date, format,
#> na_ratio, levels_ratio, ...)
#> $ opt_default_table :List of 1
Customizable options can be divided into the main three groups:
All the parameters in set_faker_opts
prefixed with opt_pull
:
opt_pull_character
- specifying what information to pull for character columns,opt_pull_numeric
- specifying what information to pull for numeric columns,opt_pull_integer
- specifying what information to pull for integer columns,opt_pull_logical
- specifying what information to pull for logical columns,opt_pull_date
- specifying what information to pull for date columns,opt_pull_table
- specifying what information to pull for tables.See Sourcing structure from database for more details.
Looking at the single column specification of configuration YAML file:
columns:
column_a1:
type: char(8)
not_null: true
unique: true
...
you may find a list of parameters attached to each column. Such parameters are passed to each simulation method and may be used to achieve demanded form of the resulted column.
When the number of columns is large, it may be inconvenient to define such parameters per each column in configuration file. In order to make such configuration easier, you may define the default parameters to each column type with opt_default_<column-type>
method.
Simply put:
my_opts <- set_faker_opts(
opt_default_<column-type> = opt_default_<column-type>(...)
)
The default parameters in DataFakeR can be accessed by default_faker_opts$opt_default_<column-type>
.
For example for character type columns we have:
$opt_default_character
default_faker_opts#> $regexp
#> [1] "text|char|factor"
#>
#> $nchar
#> [1] 10
#>
#> $not_null
#> [1] FALSE
#>
#> $unique
#> [1] FALSE
#>
#> $default
#> [1] ""
#>
#> $na_ratio
#> [1] 0.05
#>
#> $levels_ratio
#> [1] 1
That means, whenever we simulate character column and such parameters are not defined in schema YAML file you will get:
nchar = 10
,not_null = FALSE
,unique = FALSE
,default = ""
as passed parameters and values to simulation methods.
Column type mapping
When looking at the default parameters list, we could find a parameter named regexp
. This is exceptional parameter that is not passed to simulation methods but is responsible to map connection between column type defined in configuration YAML file and the target R type.
For example default_faker_opts$opt_default_character$regexp = "text|char"
, means that whenever column type matches regular expression "text|char"
such column will be treated in R as character class one.
You may modify this regular expression if you want to extend the mapping between source column types and the target R column class.
When simulating the data, except column specific parameters you may also want to pass parameters to the each table. One of them may be specifying number or rows that the resulted table should contain.
Such parameters are configurable by opt_default_table
method. Each parameter specified by the method will be then attached to each table and used in simulation process.
Each parameter passed to opt_default_table
should be either a constant value, or the function that iterates over all the tables, and returns the proper parameter value for each one.
So, specifying:
set_faker_opts(opt_default_table = opt_default_table(nrows = 10))
will result with attaching nrows = 10
to each table, and as a result (based on DataFakeR functionality) each simulated table will have 10 rows.
Setting up (the default setting):
set_faker_opts(opt_default_table = opt_default_table(nrows = nrows_simul_constant(10)))
will result with attaching nrows = 10
to each table, whenever nrows
was not specified in the configuration.
DataFakeR provides also the second method for defining number of rows nrows_simul_ratio
that allows to calculate number of rows based on provided ratio
and total
number of rows in all tables together. For example speficying nrows = nrows_simul_ratio(0.1, 100)
, will result with:
0.1 * 100
rows when the table doesn’t have nrows
specified in YAML file,nrows * 100
rows when nrows
is specified for the table, and nrows
is between 0 and 1,nrows
is specified in yaml file but is larger than 1.To understand how to create custom methods please check the definition of nrows_simul_constant()
and nrows_simul_ratio()
.
Note The only supported opt_default_table
parameter is nrows
. In the future releases, the option to set up custom parameters and actively use them in the simulation process will be enabled.
The last group of configuration parameters is meant to provide an option to customize simulation methods. As presented in simulation methods page, there are four types of simulation:
All the type simulation methods (except deterministic one) can be configured with the set_faker_opts
using:
opt_simul_spec_<column-type>
parameter and method to specify list of possible special simulation methods for selected column type:set_faker_opts(
<column-type> = opt_simul_spec_<column-type>(
opt_simul_spec_<spec-method-name> = <spec-function>
) )
opt_simul_restricted_<column-type>
parameter and method to specify list of possible restricted simulation methods for selected column type:set_faker_opts(
<column-type> = opt_simul_restricted_<column-type>(
opt_simul_restricted_<restricted-method-name> = <restricted-function>
) )
opt_simul_default_fun_<column-type>
parameter to specify default simulation method for selected column type:set_faker_opts(
<column-type> = <default-function>
opt_simul_default_fun_ )
The examples showing how to define custom methods and what each method type means are presented at simulation methods.