The goal of the rolap
package is to define
transformations that allow you to easily obtain ROLAP star databases,
composed by fact and dimension tables, from operational tables. The
starting point for defining a star database with the rolap
package is a flat table.
A flat database or flat-file database is a database that only
contains a single table. A flat table is a generally
denormalized table that is not related to other tables. It is not
necessarily tidy data (in the sense of the tidyverse
package) but in each column all the data must be of the same type, so
that it can be easily stored in a RDBMS (Relational Database
Management System). It is common for transactional systems to allow
their data to be exported as a flat table.
The rolap
package includes a set of operations that
allow defining, transforming and integrating tables to obtain a flat
table from which we can define star databases. To work with tables to
obtain a flat table, the flat_table
class is defined: Its
main objective is to allow the integration of tables from different
sources to obtain a flat table with all the data. It is described in
this document using a practical example.
First, the datasets used in the example are presented. Then, functions are shown to define flat tables from various sources. It continues with a section dedicated to flat table join operations and another to flat table transformation operations. It finishes with the functions to define star databases and constellations from the flat tables and, finally, with the conclusions.
This section describes the data sets used in the example.
To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System dataset.
We have developed the examples in this document using the dataset downloaded from the source. To reproduce them more easily, from the original file, we have stored in the package a file with the same format as the original file but that includes only 1% of its data, selected at random1, which is accessed below.
library(rolap)
file <-
system.file(
"extdata/mrs",
"mrs_122_us_cities_1962_2016.csv",
package = "rolap"
)
mrs_ft <-
read_flat_table_file(name = 'mrs', file, unknown_value = "Not available")
Using the read_flat_table_file()
function we read a
table stored in a text file and create a flat_table
object
with a name. Through the parameter unknown_value
we can
define the value to use when there is no value available.
Below are the first records of the table. We access the table using
the get_table()
function for the object of the
flat_table
class.
Year | WEEK | Week Ending Date | REGION | State | City | Pneumonia and Influenza Deaths | All Deaths | <1 year (all cause deaths) | 1-24 years (all cause deaths) | 25-44 years | 45-64 years (all cause deaths) | 65+ years (all cause deaths) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 1 | 01/06/1962 | 7 | OK | Tulsa | 2 | 69 | 5 | 1 | 5 | 17 | 41 |
1962 | 2 | 01/13/1962 | 1 | MA | Lynn | 1 | 28 | 1 | 0 | 2 | 11 | 14 |
1962 | 2 | 01/13/1962 | 1 | MA | Somerville | 2 | 21 | 2 | 0 | 1 | 5 | 13 |
1962 | 2 | 01/13/1962 | 5 | DC | Washington | 10 | 220 | 20 | 8 | 20 | 72 | 100 |
1962 | 2 | 01/13/1962 | 8 | CO | Colorado Springs | 2 | 12 | 0 | 1 | 0 | 1 | 10 |
1962 | 5 | 02/03/1962 | 2 | NJ | Elizabeth | 2 | 32 | 5 | 0 | 0 | 11 | 16 |
This data set has mortality data from 122 US cities. The geographic data that it includes corresponds to the REGION, State and City attributes of the table.
Apart from this data, we may be interested in adding additional geographical information, such as data on the population of these cities or their specific location.
In package maps
, we find the data set
us.cities
corresponding to a database of US cities with the
information that interests us.
name | country.etc | pop | lat | long | capital |
---|---|---|---|---|---|
Abilene TX | TX | 113888 | 32.45 | -99.74 | 0 |
Akron OH | OH | 206634 | 41.08 | -81.52 | 0 |
Alameda CA | CA | 70069 | 37.77 | -122.3 | 0 |
Albany GA | GA | 75510 | 31.58 | -84.18 | 0 |
Albany NY | NY | 93576 | 42.67 | -73.8 | 2 |
Albany OR | OR | 45535 | 44.62 | -123.1 | 0 |
The content of the table fields is obvious, except for field capital, which is described in the help as containing capital status indication (0 for non-capital, 1 for capital, 2 for state capital).
Our goal is to have all available data, possibly from multiple data
sources, in a single table, a flat table. We have defined the
flat_table
class to represent the tables, transform them
and be able to integrate them.
This section, for the data sets that are used, describes the
functions to create a flat_table
object and to access and
modify its characteristics.
flat_table
objectsFor the mortality data in US cities, we have created the
flat_table
object by directly reading the data from the
file where it was stored, using the read_flat_table_file()
function. The package also includes the
read_flat_table_folder()
function to read and integrate all
text files in a folder into a flat_table
object.
From a dataset in a data frame, using the flat_table()
function, we obtain a flat_table
object, as shown below for
the US cities dataset.
Using this function, we are going to create an additional
flat_table
object that allows us to associate the capital
status codes with their description. To do this, we define before a data
frame with the codes and their descriptions and, from it, we create the
object, as shown below.
flat_table
attributes and
measuresFollowing the criteria of OLAP system design, we distinguish between
attributes and measures in the flat_table
class, considering the focus of attention of the data:
Measures: They answer the question “how much?”, referring to the focus of attention.
Attributes: They define the rest of the context of the focus of attention.
Not all numerical data are necessarily measures, some should be considered as attributes, it mainly depends on the focus of attention that we consider.
When we define a flat table, its fields are classified based on their type as attributes or measures. Numeric fields are considered measures, the rest of the fields are attributes. No functions are used to find out the type, only the type defined in the starting data frame is considered. In the case of obtaining data from text files, all data is defined as character type, that is, it is classified as attributes.
For this example, our focus is the Mortality Reporting System data. We have to define the attributes and measures with this point of view.
Using the functions shown below we consult the attributes and measures.
mrs_ft |>
get_attribute_names()
#> [1] "Year" "WEEK"
#> [3] "Week Ending Date" "REGION"
#> [5] "State" "City"
#> [7] "Pneumonia and Influenza Deaths" "All Deaths"
#> [9] "<1 year (all cause deaths)" "1-24 years (all cause deaths)"
#> [11] "25-44 years" "45-64 years (all cause deaths)"
#> [13] "65+ years (all cause deaths)"
mrs_ft |>
get_measure_names()
#> NULL
Since the data comes from a text file, all fields have been defined as attributes.
This is a starting point, next we will see how to change the initial
classification. Using the transform_to_measure()
function,
we transform attributes into measures. In some cases it is required to
previously transform the data (remove the thousands separator or change
the decimal separator), this can be indicated by the function
parameters, although it is not necessary in this case.
mrs_ft <- mrs_ft |>
transform_to_measure(
attributes = c(
'Pneumonia and Influenza Deaths',
'All Deaths',
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
)
)
mrs_ft |>
get_attribute_names()
#> [1] "Year" "WEEK" "Week Ending Date" "REGION"
#> [5] "State" "City"
mrs_ft |>
get_measure_names()
#> [1] "Pneumonia and Influenza Deaths" "All Deaths"
#> [3] "<1 year (all cause deaths)" "1-24 years (all cause deaths)"
#> [5] "25-44 years" "45-64 years (all cause deaths)"
#> [7] "65+ years (all cause deaths)"
We can change the format of numeric attributes. The main motivation is usually that the alphabetical order coincides with the numerical order. This is often important in user query tools (e.g. Power BI).
Below is the change for the WEEK field. It is enough to indicate the width. The result is that it fills the width with white spaces on the left, thus achieving the stated objective (for example, output ” 2” instead of “2” to come before “10” when data is presented).
For example, in this way we will have as output ” 2” instead of “2”, so that alphabetically ordered it is placed before “10” instead of after it, as it’s shown in the following.
In this case, the starting data was in a data frame, where each column has a data type associated with it. The initial classification is as follows.
usc_ft |>
get_attribute_names()
#> [1] "name" "country.etc"
usc_ft |>
get_measure_names()
#> [1] "pop" "lat" "long" "capital"
From the point of view of the Mortality Reporting System, all of this
data can be considered as attributes. Using the parameters of the
transform_to_attribute()
function, we define the format of
the transformation from numeric data to string, as shown below.
usc_ft <- usc_ft |>
transform_to_attribute(measures = 'capital') |>
transform_to_attribute(measures = 'pop',
width = 5) |>
transform_to_attribute(measures = c('lat', 'long'),
width = 2,
decimal_places = 1)
In particular, by using the width
parameter, as done
before to transform the numerical attributes, we can make the order of
the string data match the numerical order.
We currently have the data of interest in three tables and our goal is to have all the data in a single flat table: we have to join the tables.
To join tables represented by flat_table
objects, we
must define a lookup table, so that the other table must contain a
foreign key that allows them to be related.
Using the lookup_table()
function we define the
attributes that make up the primary key of a loop table, as shown below
for the capital status table.
The primary key is made up of the code field.
In the definition of the lookup table, it is verified that the set of
attributes that are indicated actually form a primary key. Likewise, if
they do not form a primary key, the same lookup_table()
function can be used to group the data using the aggregation functions
indicated for the rest of the attributes and measures in the table.
Once we have a lookup table, we can join any other table to it using
its primary key, which will be the foreign key of the other table, as
shown below for the US cities table using the
join_lookup_table()
function. We only have to indicate the
attributes that make up the foreign key and the lookup table.
Below is the data table with the result obtained.
name | country.etc | capital | pop | lat | long | status |
---|---|---|---|---|---|---|
Abilene TX | TX | 0 | 113,888 | 32.5 | -99.7 | non-capital |
Akron OH | OH | 0 | 206,634 | 41.1 | -81.5 | non-capital |
Alameda CA | CA | 0 | 70,069 | 37.8 | -122.3 | non-capital |
Albany GA | GA | 0 | 75,510 | 31.6 | -84.2 | non-capital |
Albany NY | NY | 2 | 93,576 | 42.7 | -73.8 | state capital |
Albany OR | OR | 0 | 45,535 | 44.6 | -123.1 | non-capital |
First of all we must define table X as a lookup table. Its primary key is formed exclusively by the name attribute.
If we consult the Mortality Reporting System table again (for convenience it is shown again below), we verify that we do not have a field that corresponds directly to the primary key: we must merge two fields (City and State) to obtain it.
Year | WEEK | Week Ending Date | REGION | State | City | Pneumonia and Influenza Deaths | All Deaths | <1 year (all cause deaths) | 1-24 years (all cause deaths) | 25-44 years | 45-64 years (all cause deaths) | 65+ years (all cause deaths) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 1 | 01/06/1962 | 7 | OK | Tulsa | 2 | 69 | 5 | 1 | 5 | 17 | 41 |
1962 | 2 | 01/13/1962 | 1 | MA | Lynn | 1 | 28 | 1 | 0 | 2 | 11 | 14 |
1962 | 2 | 01/13/1962 | 1 | MA | Somerville | 2 | 21 | 2 | 0 | 1 | 5 | 13 |
1962 | 2 | 01/13/1962 | 5 | DC | Washington | 10 | 220 | 20 | 8 | 20 | 72 | 100 |
1962 | 2 | 01/13/1962 | 8 | CO | Colorado Springs | 2 | 12 | 0 | 1 | 0 | 1 | 10 |
1962 | 5 | 02/03/1962 | 2 | NJ | Elizabeth | 2 | 32 | 5 | 0 | 0 | 11 | 16 |
We can add new fields to a table using the
add_custom_column()
function. It is given the name of the
new field and a function that returns its values: the function receives
the original table as a parameter. Depending on the type of data
returned, it will be classified as an attribute or measure in the
flat_table object.
Below is the definition of function city_state()
and the
new field obtained through it. We will make the changes in a temporary
variable, until we are sure that they are appropriate.
# function to define a derived column
city_state <- function(table) {
paste0(table$City, ' ', table$State)
}
mrs_ft_TMP <- mrs_ft |>
add_custom_column(name = 'city_state', definition = city_state)
Once we have the foreign key, before carrying out the join operation,
we can check if all the instances will find a match in the lookup table.
This operation is performed by the check_lookup_table()
function, with the same parameters as the
join_lookup_table()
function.
mrs_ft_TMP |>
check_lookup_table(fk_attributes = 'city_state', lookup = usc_ft)
#> # A tibble: 3 × 1
#> city_state
#> <chr>
#> 1 Washington DC
#> 2 Wilimington DE
#> 3 NA NA
In addition to the data that is not defined, we check that there are two instances that do not have a correspondence in the lookup table.
First, let’s define a value for the undefined data. Through the
replace_empty_values()
function, it assigns them the value
that we indicated at the time of creating the flat_table
object in the unknown_value
parameter.
If we do not indicate any attributes, it considers all the attributes of the table.
We define the custom column again.
Once these changes have been made, we will carry out the rest of the
modifications to the data. We have considered it appropriate to correct
the first value in the lookup table and the second in the table where
the error is. Corrections are carried out using the
replace_attribute_values()
function.
usc_ft <- usc_ft |>
replace_attribute_values(
attributes = 'name',
old = c('WASHINGTON DC'),
new = c('Washington DC')
)
mrs_ft <- mrs_ft |>
replace_attribute_values(
attributes = c('City', 'city_state'),
old = c('Wilimington', 'Wilimington DE'),
new = c('Wilmington', 'Wilmington DE')
)
If we perform the check again, we observe that all the defined data corresponds in the lookup table and that the unknown values have the value indicated in the definition of the flat_table object.
mrs_ft |>
check_lookup_table(fk_attributes = 'city_state', lookup = usc_ft)
#> # A tibble: 1 × 1
#> city_state
#> <chr>
#> 1 Not available Not available
We perform the join operation using the
join_lookup_table()
function.
Let’s check the attributes of the result. Using the
get_attribute_names()
function, we can indicate that we
want them in vector definition format.
mrs_ft |>
get_attribute_names(as_definition = TRUE)
#> [1] "c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City', 'city_state', 'country.etc', 'capital', 'pop', 'lat', 'long', 'status')"
There are attributes that we do not need, although it is not strictly
necessary for our objective, we can exclude them using the
select_attributes()
function. We can use the definition of
the vector obtained by the previous function to perform the
selection.
mrs_ft <- mrs_ft |>
select_attributes(
attributes = c(
'Year',
'WEEK',
'Week Ending Date',
'REGION',
'State',
'City',
'city_state',
'status',
'pop',
'lat',
'long'
)
)
As a result, we have a flat table with all the data we need.
We have made some transformations on the flat_table
objects in the previous section in order to carry out the join
operations. In this section we are going to perform some additional
transformations.
The original Mortality Reporting System data includes details related to the cause of death and others to the age group of the people, but without including the cause. That is, we have data at two different granularities or levels of detail. Let’s separate them into different tables.
The separation is carried out by means of the
separate_measures()
function, in which we can indicate
lists of measures and the name2 of each new flat table that will form the
result.
l_mrs_ft <- mrs_ft |>
separate_measures(measures = list(
c('Pneumonia and Influenza Deaths',
'All Deaths'),
c(
'<1 year (all cause deaths)',
'1-24 years (all cause deaths)',
'25-44 years',
'45-64 years (all cause deaths)',
'65+ years (all cause deaths)'
)
),
names = c('mrs_cause', 'mrs_age'))
mrs_cause_ft <- l_mrs_ft[['mrs_cause']]
mrs_age_ft <- l_mrs_ft[['mrs_age']]
The result is that we have two flat_table
objects, one
with the data referring to the causes and the other with the age
data.
One of the transformations that we can do on the names of the table columns is to change them according to the snake case criterion. We apply this transformation for the table with cause data, as shown below.
We can also rename attributes and measures using the
set_attribute_names()
and set_measure_names()
functions. In this case we don’t need to use them.
Below are the first instances of the flat table with data on people’s ages.
Year | WEEK | Week Ending Date | REGION | State | City | city_state | status | pop | lat | long | <1 year (all cause deaths) | 1-24 years (all cause deaths) | 25-44 years | 45-64 years (all cause deaths) | 65+ years (all cause deaths) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 1 | 01/06/1962 | 7 | OK | Tulsa | Tulsa OK | non-capital | 379,833 | 36.1 | -95.9 | 5 | 1 | 5 | 17 | 41 |
1962 | 2 | 01/13/1962 | 1 | MA | Lynn | Lynn MA | non-capital | 89,786 | 42.5 | -71.0 | 1 | 0 | 2 | 11 | 14 |
1962 | 2 | 01/13/1962 | 1 | MA | Somerville | Somerville MA | non-capital | 74,802 | 42.4 | -71.1 | 2 | 0 | 1 | 5 | 13 |
1962 | 2 | 01/13/1962 | 5 | DC | Washington | Washington DC | non-capital | 548,359 | 38.9 | -77.0 | 20 | 8 | 20 | 72 | 100 |
1962 | 2 | 01/13/1962 | 8 | CO | Colorado Springs | Colorado Springs CO | non-capital | 375,744 | 38.9 | -104.8 | 0 | 1 | 0 | 1 | 10 |
1962 | 5 | 02/03/1962 | 2 | NJ | Elizabeth | Elizabeth NJ | non-capital | 124,552 | 40.7 | -74.2 | 5 | 0 | 0 | 11 | 16 |
Number of rows in the table: 3342
Additionally, the number of rows in the table is shown.
The names of the measures can be considered values of a new attribute called age. The numerical values would correspond to a new measure that we can call all_deaths.
This is precisely the transformation carried out by the
transform_to_values()
function. It is not necessary to
indicate more parameters apart from the new names of the attribute and
the measure because it considers all the available measures to carry out
the described transformation.
Once we have obtained the values of the new variable that were in the form of column names, we can adapt the rest of the name according to the snake case criterion.
Finally, we can eliminate from the literals that describe the age,
the part that is now described by the new measure. To do this, we use
the replace_string()
function that allows us to replace a
string with the value indicated for the selected attributes. This
function, unlike function replace_attribute_values()
, does
not look for the instances to be equal to the indicated value but rather
to contain it.
mrs_age_ft <- mrs_age_ft |>
replace_string(
attributes = 'age',
string = ' (all cause deaths)',
replacement = ''
)
The result can be seen in the table records shown below.
year | week | week_ending_date | region | state | city | city_state | status | pop | lat | long | age | all_deaths |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 1 | 01/06/1962 | 7 | OK | Tulsa | Tulsa OK | non-capital | 379,833 | 36.1 | -95.9 | <1 year | 5 |
1962 | 2 | 01/13/1962 | 1 | MA | Lynn | Lynn MA | non-capital | 89,786 | 42.5 | -71.0 | <1 year | 1 |
1962 | 2 | 01/13/1962 | 1 | MA | Somerville | Somerville MA | non-capital | 74,802 | 42.4 | -71.1 | <1 year | 2 |
1962 | 2 | 01/13/1962 | 5 | DC | Washington | Washington DC | non-capital | 548,359 | 38.9 | -77.0 | <1 year | 20 |
1962 | 2 | 01/13/1962 | 8 | CO | Colorado Springs | Colorado Springs CO | non-capital | 375,744 | 38.9 | -104.8 | <1 year | 0 |
1962 | 5 | 02/03/1962 | 2 | NJ | Elizabeth | Elizabeth NJ | non-capital | 124,552 | 40.7 | -74.2 | <1 year | 5 |
Number of rows in the table: 16565
With this transformation we would have the flat table with the appropriate format to define the star database from it.
In some cases we may be interested in performing the reverse
operation: transforming attribute values into variable names, taking the
value of the available measure. This operation can be performed using
the transform_from_values()
function, as shown below.
The result is shown below, along with the number of rows in the table.
year | week | week_ending_date | region | state | city | city_state | status | pop | lat | long | 1-24 years | 25-44 years | 45-64 years | 65+ years | <1 year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 1 | 01/06/1962 | 7 | OK | Tulsa | Tulsa OK | non-capital | 379,833 | 36.1 | -95.9 | 1 | 5 | 17 | 41 | 5 |
1962 | 2 | 01/13/1962 | 1 | MA | Lynn | Lynn MA | non-capital | 89,786 | 42.5 | -71.0 | 0 | 2 | 11 | 14 | 1 |
1962 | 2 | 01/13/1962 | 1 | MA | Somerville | Somerville MA | non-capital | 74,802 | 42.4 | -71.1 | 0 | 1 | 5 | 13 | 2 |
1962 | 2 | 01/13/1962 | 5 | DC | Washington | Washington DC | non-capital | 548,359 | 38.9 | -77.0 | 8 | 20 | 72 | 100 | 20 |
1962 | 2 | 01/13/1962 | 8 | CO | Colorado Springs | Colorado Springs CO | non-capital | 375,744 | 38.9 | -104.8 | 1 | 0 | 1 | 10 | 0 |
1962 | 5 | 02/03/1962 | 2 | NJ | Elizabeth | Elizabeth NJ | non-capital | 124,552 | 40.7 | -74.2 | 0 | 0 | 11 | 16 | 5 |
Number of rows in the table: 3342
If we compare the number of rows in the table with the original, we see that it coincides.
The definition of the star schemas and star databases for the flat
tables obtained is shown in the following sections and is discussed in
detail in the vignette titled Basics and getting started with the
rolap package, vignette("rolap")
.
We consult the names of the fields to define the star schema.
mrs_cause_ft |>
get_attribute_names(as_definition = TRUE)
#> [1] "c('year', 'week', 'week_ending_date', 'region', 'state', 'city', 'city_state', 'status', 'pop', 'lat', 'long')"
mrs_cause_ft |>
get_measure_names(as_definition = TRUE)
#> [1] "c('pneumonia_and_influenza_deaths', 'all_deaths')"
We define dimensions and facts. We define the dimensions as variables to be able to share them more easily.
when <- dimension_schema(
name = 'when',
attributes = c(
'year',
'week',
'week_ending_date'
)
)
where <- dimension_schema(
name = "where",
attributes = c(
'region',
'state',
'city',
'city_state',
'status',
'pop',
'lat',
'long'
)
)
s_cause <- star_schema() |>
define_facts(fact_schema(
name = 'mrs_cause',
measures = c('pneumonia_and_influenza_deaths', 'all_deaths')
)) |>
define_dimension(when) |>
define_dimension(where)
From the schema, we use the as_star_database()
function
that allows us to obtain a star_database
object from a
flat_table
object.
Below is the representation of the star database tables that we can
export to any RDBMS through the functionality of the dm
package.
In this case, we define the additional dimension of this design and the schema by reusing the previous definition of the other dimensions.
who <- dimension_schema(
name = 'who',
attributes = c(
'age'
)
)
s_age <- star_schema() |>
define_facts(fact_schema(
name = 'mrs_age',
measures = c('all_deaths')
)) |>
define_dimension(when) |>
define_dimension(where) |>
define_dimension(who)
We obtain the star_dabase
object from the
flat_table
object and the defined schema.
We also show the graphical representation of the tables.
From the two star databases we define a constellation that shares common dimensions.
And we show the graphic representation of the tables.
From the data obtained from the operational systems we want to implement ROLAP star databases. The intermediate step that we consider are flat tables: the data from which a star database is obtained must be in the form of a flat table.
This document shows, using an example, the functions available in the
rolap
package to define, transform and integrate flat
tables from tables obtained from various data sources.