Obtaining and transforming flat tables

Introduction

The goal of the rolap package is to define transformations that allow you to easily obtain ROLAP star databases, composed by fact and dimension tables, from operational tables. The starting point for defining a star database with the rolap package is a flat table.

A flat database or flat-file database is a database that only contains a single table. A flat table is a generally denormalized table that is not related to other tables. It is not necessarily tidy data (in the sense of the tidyverse package) but in each column all the data must be of the same type, so that it can be easily stored in a RDBMS (Relational Database Management System). It is common for transactional systems to allow their data to be exported as a flat table.

The rolap package includes a set of operations that allow defining, transforming and integrating tables to obtain a flat table from which we can define star databases. To work with tables to obtain a flat table, the flat_table class is defined: Its main objective is to allow the integration of tables from different sources to obtain a flat table with all the data. It is described in this document using a practical example.

First, the datasets used in the example are presented. Then, functions are shown to define flat tables from various sources. It continues with a section dedicated to flat table join operations and another to flat table transformation operations. It finishes with the functions to define star databases and constellations from the flat tables and, finally, with the conclusions.

Data sets

This section describes the data sets used in the example.

122 US Cities Mortality Reporting System

To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System dataset.

We have developed the examples in this document using the dataset downloaded from the source. To reproduce them more easily, from the original file, we have stored in the package a file with the same format as the original file but that includes only 1% of its data, selected at random1, which is accessed below.

library(rolap)

file <-
  system.file(
    "extdata/mrs",
    "mrs_122_us_cities_1962_2016.csv",
    package = "rolap"
  )

mrs_ft <-
  read_flat_table_file(name = 'mrs', file, unknown_value = "Not available")

Using the read_flat_table_file() function we read a table stored in a text file and create a flat_table object with a name. Through the parameter unknown_value we can define the value to use when there is no value available.

Below are the first records of the table. We access the table using the get_table() function for the object of the flat_table class.

ft <- mrs_ft |> 
  get_table()

pander::pandoc.table(head(ft), split.table = Inf)
Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 1 01/06/1962 7 OK Tulsa 2 69 5 1 5 17 41
1962 2 01/13/1962 1 MA Lynn 1 28 1 0 2 11 14
1962 2 01/13/1962 1 MA Somerville 2 21 2 0 1 5 13
1962 2 01/13/1962 5 DC Washington 10 220 20 8 20 72 100
1962 2 01/13/1962 8 CO Colorado Springs 2 12 0 1 0 1 10
1962 5 02/03/1962 2 NJ Elizabeth 2 32 5 0 0 11 16

This data set has mortality data from 122 US cities. The geographic data that it includes corresponds to the REGION, State and City attributes of the table.

Apart from this data, we may be interested in adding additional geographical information, such as data on the population of these cities or their specific location.

Database of US cities

In package maps, we find the data set us.cities corresponding to a database of US cities with the information that interests us.

pander::pandoc.table(head(maps::us.cities), split.table = Inf)
name country.etc pop lat long capital
Abilene TX TX 113888 32.45 -99.74 0
Akron OH OH 206634 41.08 -81.52 0
Alameda CA CA 70069 37.77 -122.3 0
Albany GA GA 75510 31.58 -84.18 0
Albany NY NY 93576 42.67 -73.8 2
Albany OR OR 45535 44.62 -123.1 0

The content of the table fields is obvious, except for field capital, which is described in the help as containing capital status indication (0 for non-capital, 1 for capital, 2 for state capital).

Flat tables

Our goal is to have all available data, possibly from multiple data sources, in a single table, a flat table. We have defined the flat_table class to represent the tables, transform them and be able to integrate them.

This section, for the data sets that are used, describes the functions to create a flat_table object and to access and modify its characteristics.

Create flat_table objects

For the mortality data in US cities, we have created the flat_table object by directly reading the data from the file where it was stored, using the read_flat_table_file() function. The package also includes the read_flat_table_folder() function to read and integrate all text files in a folder into a flat_table object.

From a dataset in a data frame, using the flat_table() function, we obtain a flat_table object, as shown below for the US cities dataset.

usc_ft <-
  flat_table(name = 'us_cities', instances = maps::us.cities)

Using this function, we are going to create an additional flat_table object that allows us to associate the capital status codes with their description. To do this, we define before a data frame with the codes and their descriptions and, from it, we create the object, as shown below.

capital_status <- data.frame(
  code = c('0', '1', '2'),
  status = c('non-capital', 'capital', 'state capital')
)

cs_ft <-
  flat_table(name = 'capital_status', instances = capital_status)

flat_table attributes and measures

Following the criteria of OLAP system design, we distinguish between attributes and measures in the flat_table class, considering the focus of attention of the data:

Not all numerical data are necessarily measures, some should be considered as attributes, it mainly depends on the focus of attention that we consider.

When we define a flat table, its fields are classified based on their type as attributes or measures. Numeric fields are considered measures, the rest of the fields are attributes. No functions are used to find out the type, only the type defined in the starting data frame is considered. In the case of obtaining data from text files, all data is defined as character type, that is, it is classified as attributes.

For this example, our focus is the Mortality Reporting System data. We have to define the attributes and measures with this point of view.

Mortality Reporting System

Using the functions shown below we consult the attributes and measures.

mrs_ft |> 
  get_attribute_names()
#>  [1] "Year"                           "WEEK"                          
#>  [3] "Week Ending Date"               "REGION"                        
#>  [5] "State"                          "City"                          
#>  [7] "Pneumonia and Influenza Deaths" "All Deaths"                    
#>  [9] "<1 year (all cause deaths)"     "1-24 years (all cause deaths)" 
#> [11] "25-44 years"                    "45-64 years (all cause deaths)"
#> [13] "65+ years (all cause deaths)"

mrs_ft |> 
  get_measure_names()
#> NULL

Since the data comes from a text file, all fields have been defined as attributes.

Measures

This is a starting point, next we will see how to change the initial classification. Using the transform_to_measure() function, we transform attributes into measures. In some cases it is required to previously transform the data (remove the thousands separator or change the decimal separator), this can be indicated by the function parameters, although it is not necessary in this case.

mrs_ft <- mrs_ft |>
  transform_to_measure(
    attributes = c(
      'Pneumonia and Influenza Deaths',
      'All Deaths',
      '<1 year (all cause deaths)',
      '1-24 years (all cause deaths)',
      '25-44 years',
      '45-64 years (all cause deaths)',
      '65+ years (all cause deaths)'
    )
  )

mrs_ft |> 
  get_attribute_names()
#> [1] "Year"             "WEEK"             "Week Ending Date" "REGION"          
#> [5] "State"            "City"

mrs_ft |> 
  get_measure_names()
#> [1] "Pneumonia and Influenza Deaths" "All Deaths"                    
#> [3] "<1 year (all cause deaths)"     "1-24 years (all cause deaths)" 
#> [5] "25-44 years"                    "45-64 years (all cause deaths)"
#> [7] "65+ years (all cause deaths)"

Attributes

We can change the format of numeric attributes. The main motivation is usually that the alphabetical order coincides with the numerical order. This is often important in user query tools (e.g. Power BI).

Below is the change for the WEEK field. It is enough to indicate the width. The result is that it fills the width with white spaces on the left, thus achieving the stated objective (for example, output ” 2” instead of “2” to come before “10” when data is presented).

mrs_ft <- mrs_ft |>
  transform_attribute_format(attributes = c('WEEK'),
                             width = 2)

For example, in this way we will have as output ” 2” instead of “2”, so that alphabetically ordered it is placed before “10” instead of after it, as it’s shown in the following.

table <- mrs_ft |>
  get_table()

sort(unique(table[['WEEK']]))[1:10]
#>  [1] " 1" " 2" " 3" " 4" " 5" " 6" " 7" " 8" " 9" "10"

US cities

In this case, the starting data was in a data frame, where each column has a data type associated with it. The initial classification is as follows.

usc_ft |> 
  get_attribute_names()
#> [1] "name"        "country.etc"

usc_ft |> 
  get_measure_names()
#> [1] "pop"     "lat"     "long"    "capital"

From the point of view of the Mortality Reporting System, all of this data can be considered as attributes. Using the parameters of the transform_to_attribute() function, we define the format of the transformation from numeric data to string, as shown below.

usc_ft <- usc_ft |>
  transform_to_attribute(measures = 'capital') |>
  transform_to_attribute(measures = 'pop',
                         width = 5) |>
  transform_to_attribute(measures = c('lat', 'long'),
                         width = 2,
                         decimal_places = 1)

In particular, by using the width parameter, as done before to transform the numerical attributes, we can make the order of the string data match the numerical order.

Capital status

For the capital status table, the columns have been defined as string, therefore both are attributes, which is appropriate for the focus of attention.

cs_ft |> 
  get_attribute_names()
#> [1] "code"   "status"

cs_ft |> 
  get_measure_names()
#> NULL

Joining flat tables

We currently have the data of interest in three tables and our goal is to have all the data in a single flat table: we have to join the tables.

To join tables represented by flat_table objects, we must define a lookup table, so that the other table must contain a foreign key that allows them to be related.

Joining US cities and capital status tables

Using the lookup_table() function we define the attributes that make up the primary key of a loop table, as shown below for the capital status table.

cs_ft <- cs_ft |>
  lookup_table(pk_attributes = 'code')

The primary key is made up of the code field.

In the definition of the lookup table, it is verified that the set of attributes that are indicated actually form a primary key. Likewise, if they do not form a primary key, the same lookup_table() function can be used to group the data using the aggregation functions indicated for the rest of the attributes and measures in the table.

Once we have a lookup table, we can join any other table to it using its primary key, which will be the foreign key of the other table, as shown below for the US cities table using the join_lookup_table() function. We only have to indicate the attributes that make up the foreign key and the lookup table.

usc_ft <- usc_ft |>
  join_lookup_table(fk_attributes = 'capital', lookup = cs_ft)

Below is the data table with the result obtained.

ft <- usc_ft |> 
  get_table()

pander::pandoc.table(head(ft), split.table = Inf)
name country.etc capital pop lat long status
Abilene TX TX 0 113,888 32.5 -99.7 non-capital
Akron OH OH 0 206,634 41.1 -81.5 non-capital
Alameda CA CA 0 70,069 37.8 -122.3 non-capital
Albany GA GA 0 75,510 31.6 -84.2 non-capital
Albany NY NY 2 93,576 42.7 -73.8 state capital
Albany OR OR 0 45,535 44.6 -123.1 non-capital

Joining Mortality Reporting System and US cities tables

First of all we must define table X as a lookup table. Its primary key is formed exclusively by the name attribute.

usc_ft <- usc_ft |>
  lookup_table(pk_attributes = 'name')

If we consult the Mortality Reporting System table again (for convenience it is shown again below), we verify that we do not have a field that corresponds directly to the primary key: we must merge two fields (City and State) to obtain it.

Year WEEK Week Ending Date REGION State City Pneumonia and Influenza Deaths All Deaths <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 1 01/06/1962 7 OK Tulsa 2 69 5 1 5 17 41
1962 2 01/13/1962 1 MA Lynn 1 28 1 0 2 11 14
1962 2 01/13/1962 1 MA Somerville 2 21 2 0 1 5 13
1962 2 01/13/1962 5 DC Washington 10 220 20 8 20 72 100
1962 2 01/13/1962 8 CO Colorado Springs 2 12 0 1 0 1 10
1962 5 02/03/1962 2 NJ Elizabeth 2 32 5 0 0 11 16

We can add new fields to a table using the add_custom_column() function. It is given the name of the new field and a function that returns its values: the function receives the original table as a parameter. Depending on the type of data returned, it will be classified as an attribute or measure in the flat_table object.

Below is the definition of function city_state() and the new field obtained through it. We will make the changes in a temporary variable, until we are sure that they are appropriate.

# function to define a derived column
city_state <- function(table) {
  paste0(table$City, ' ', table$State)
}

mrs_ft_TMP <- mrs_ft |>
  add_custom_column(name = 'city_state', definition = city_state)

Once we have the foreign key, before carrying out the join operation, we can check if all the instances will find a match in the lookup table. This operation is performed by the check_lookup_table() function, with the same parameters as the join_lookup_table() function.

mrs_ft_TMP |>
  check_lookup_table(fk_attributes = 'city_state', lookup = usc_ft)
#> # A tibble: 3 × 1
#>   city_state    
#>   <chr>         
#> 1 Washington DC 
#> 2 Wilimington DE
#> 3 NA NA

In addition to the data that is not defined, we check that there are two instances that do not have a correspondence in the lookup table.

First, let’s define a value for the undefined data. Through the replace_empty_values() function, it assigns them the value that we indicated at the time of creating the flat_table object in the unknown_value parameter.

mrs_ft <- mrs_ft |>
  replace_empty_values()

If we do not indicate any attributes, it considers all the attributes of the table.

We define the custom column again.

mrs_ft <- mrs_ft |>
  add_custom_column(name = 'city_state', definition = city_state)

Once these changes have been made, we will carry out the rest of the modifications to the data. We have considered it appropriate to correct the first value in the lookup table and the second in the table where the error is. Corrections are carried out using the replace_attribute_values() function.

usc_ft <- usc_ft |>
  replace_attribute_values(
    attributes = 'name',
    old = c('WASHINGTON DC'),
    new = c('Washington DC')
  )

mrs_ft <- mrs_ft |>
  replace_attribute_values(
    attributes = c('City', 'city_state'),
    old = c('Wilimington', 'Wilimington DE'),
    new = c('Wilmington', 'Wilmington DE')
  )

If we perform the check again, we observe that all the defined data corresponds in the lookup table and that the unknown values have the value indicated in the definition of the flat_table object.

mrs_ft |>
  check_lookup_table(fk_attributes = 'city_state', lookup = usc_ft)
#> # A tibble: 1 × 1
#>   city_state                 
#>   <chr>                      
#> 1 Not available Not available

We perform the join operation using the join_lookup_table() function.

mrs_ft <- mrs_ft |>
  join_lookup_table(fk_attributes = 'city_state', lookup = usc_ft)

Let’s check the attributes of the result. Using the get_attribute_names() function, we can indicate that we want them in vector definition format.

mrs_ft |>
  get_attribute_names(as_definition = TRUE)
#> [1] "c('Year', 'WEEK', 'Week Ending Date', 'REGION', 'State', 'City', 'city_state', 'country.etc', 'capital', 'pop', 'lat', 'long', 'status')"

There are attributes that we do not need, although it is not strictly necessary for our objective, we can exclude them using the select_attributes() function. We can use the definition of the vector obtained by the previous function to perform the selection.

mrs_ft <- mrs_ft |>
  select_attributes(
    attributes = c(
      'Year',
      'WEEK',
      'Week Ending Date',
      'REGION',
      'State',
      'City',
      'city_state',
      'status',
      'pop',
      'lat',
      'long'
    )
  )

As a result, we have a flat table with all the data we need.

Flat table transformation

We have made some transformations on the flat_table objects in the previous section in order to carry out the join operations. In this section we are going to perform some additional transformations.

Separate measures

The original Mortality Reporting System data includes details related to the cause of death and others to the age group of the people, but without including the cause. That is, we have data at two different granularities or levels of detail. Let’s separate them into different tables.

The separation is carried out by means of the separate_measures() function, in which we can indicate lists of measures and the name2 of each new flat table that will form the result.

l_mrs_ft <- mrs_ft |>
  separate_measures(measures = list(
    c('Pneumonia and Influenza Deaths',
      'All Deaths'),
    c(
      '<1 year (all cause deaths)',
      '1-24 years (all cause deaths)',
      '25-44 years',
      '45-64 years (all cause deaths)',
      '65+ years (all cause deaths)'
    )
  ),
  names = c('mrs_cause', 'mrs_age'))

mrs_cause_ft <- l_mrs_ft[['mrs_cause']]
mrs_age_ft <- l_mrs_ft[['mrs_age']]

The result is that we have two flat_table objects, one with the data referring to the causes and the other with the age data.

Transform column names

One of the transformations that we can do on the names of the table columns is to change them according to the snake case criterion. We apply this transformation for the table with cause data, as shown below.

mrs_cause_ft <- mrs_cause_ft |>
  snake_case()

We can also rename attributes and measures using the set_attribute_names() and set_measure_names() functions. In this case we don’t need to use them.

Transform measure names into attribute values

Below are the first instances of the flat table with data on people’s ages.

Year WEEK Week Ending Date REGION State City city_state status pop lat long <1 year (all cause deaths) 1-24 years (all cause deaths) 25-44 years 45-64 years (all cause deaths) 65+ years (all cause deaths)
1962 1 01/06/1962 7 OK Tulsa Tulsa OK non-capital 379,833 36.1 -95.9 5 1 5 17 41
1962 2 01/13/1962 1 MA Lynn Lynn MA non-capital 89,786 42.5 -71.0 1 0 2 11 14
1962 2 01/13/1962 1 MA Somerville Somerville MA non-capital 74,802 42.4 -71.1 2 0 1 5 13
1962 2 01/13/1962 5 DC Washington Washington DC non-capital 548,359 38.9 -77.0 20 8 20 72 100
1962 2 01/13/1962 8 CO Colorado Springs Colorado Springs CO non-capital 375,744 38.9 -104.8 0 1 0 1 10
1962 5 02/03/1962 2 NJ Elizabeth Elizabeth NJ non-capital 124,552 40.7 -74.2 5 0 0 11 16

Number of rows in the table: 3342

Additionally, the number of rows in the table is shown.

The names of the measures can be considered values of a new attribute called age. The numerical values would correspond to a new measure that we can call all_deaths.

This is precisely the transformation carried out by the transform_to_values() function. It is not necessary to indicate more parameters apart from the new names of the attribute and the measure because it considers all the available measures to carry out the described transformation.

mrs_age_ft <- mrs_age_ft |>
  transform_to_values(attribute = 'age',
                      measure = 'all_deaths')

Once we have obtained the values of the new variable that were in the form of column names, we can adapt the rest of the name according to the snake case criterion.

mrs_age_ft <- mrs_age_ft |>
  snake_case()

Finally, we can eliminate from the literals that describe the age, the part that is now described by the new measure. To do this, we use the replace_string() function that allows us to replace a string with the value indicated for the selected attributes. This function, unlike function replace_attribute_values(), does not look for the instances to be equal to the indicated value but rather to contain it.

mrs_age_ft <- mrs_age_ft |>
  replace_string(
    attributes = 'age',
    string = ' (all cause deaths)',
    replacement = ''
  )

The result can be seen in the table records shown below.

year week week_ending_date region state city city_state status pop lat long age all_deaths
1962 1 01/06/1962 7 OK Tulsa Tulsa OK non-capital 379,833 36.1 -95.9 <1 year 5
1962 2 01/13/1962 1 MA Lynn Lynn MA non-capital 89,786 42.5 -71.0 <1 year 1
1962 2 01/13/1962 1 MA Somerville Somerville MA non-capital 74,802 42.4 -71.1 <1 year 2
1962 2 01/13/1962 5 DC Washington Washington DC non-capital 548,359 38.9 -77.0 <1 year 20
1962 2 01/13/1962 8 CO Colorado Springs Colorado Springs CO non-capital 375,744 38.9 -104.8 <1 year 0
1962 5 02/03/1962 2 NJ Elizabeth Elizabeth NJ non-capital 124,552 40.7 -74.2 <1 year 5

Number of rows in the table: 16565

With this transformation we would have the flat table with the appropriate format to define the star database from it.

In some cases we may be interested in performing the reverse operation: transforming attribute values into variable names, taking the value of the available measure. This operation can be performed using the transform_from_values() function, as shown below.

mrs_age_ft_TMP <- mrs_age_ft |>
  transform_from_values(
    attribute = 'age'
  )

The result is shown below, along with the number of rows in the table.

year week week_ending_date region state city city_state status pop lat long 1-24 years 25-44 years 45-64 years 65+ years <1 year
1962 1 01/06/1962 7 OK Tulsa Tulsa OK non-capital 379,833 36.1 -95.9 1 5 17 41 5
1962 2 01/13/1962 1 MA Lynn Lynn MA non-capital 89,786 42.5 -71.0 0 2 11 14 1
1962 2 01/13/1962 1 MA Somerville Somerville MA non-capital 74,802 42.4 -71.1 0 1 5 13 2
1962 2 01/13/1962 5 DC Washington Washington DC non-capital 548,359 38.9 -77.0 8 20 72 100 20
1962 2 01/13/1962 8 CO Colorado Springs Colorado Springs CO non-capital 375,744 38.9 -104.8 1 0 1 10 0
1962 5 02/03/1962 2 NJ Elizabeth Elizabeth NJ non-capital 124,552 40.7 -74.2 0 0 11 16 5

Number of rows in the table: 3342

If we compare the number of rows in the table with the original, we see that it coincides.

Definition of the star databases

The definition of the star schemas and star databases for the flat tables obtained is shown in the following sections and is discussed in detail in the vignette titled Basics and getting started with the rolap package, vignette("rolap").

MRS Cause star database

We consult the names of the fields to define the star schema.

mrs_cause_ft |>
  get_attribute_names(as_definition = TRUE)
#> [1] "c('year', 'week', 'week_ending_date', 'region', 'state', 'city', 'city_state', 'status', 'pop', 'lat', 'long')"

mrs_cause_ft |>
  get_measure_names(as_definition = TRUE)
#> [1] "c('pneumonia_and_influenza_deaths', 'all_deaths')"

We define dimensions and facts. We define the dimensions as variables to be able to share them more easily.

when <- dimension_schema(
  name = 'when',
  attributes = c(
    'year', 
    'week', 
    'week_ending_date'
  )
)
where <- dimension_schema(
  name = "where",
  attributes = c(
    'region',
    'state',
    'city',
    'city_state',
    'status',
    'pop',
    'lat',
    'long'
  )
)
s_cause <- star_schema() |>
  define_facts(fact_schema(
    name = 'mrs_cause',
    measures = c('pneumonia_and_influenza_deaths', 'all_deaths')
  )) |>
  define_dimension(when) |>
  define_dimension(where)

From the schema, we use the as_star_database() function that allows us to obtain a star_database object from a flat_table object.

mrs_cause_db <- mrs_cause_ft |>
  as_star_database(s_cause)

Below is the representation of the star database tables that we can export to any RDBMS through the functionality of the dm package.

db_dm <- mrs_cause_db |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

MRS Age star database

In this case, we define the additional dimension of this design and the schema by reusing the previous definition of the other dimensions.

who <- dimension_schema(
  name = 'who',
  attributes = c(
    'age'
  )
)
s_age <- star_schema() |>
  define_facts(fact_schema(
    name = 'mrs_age',
    measures = c('all_deaths')
  )) |>
  define_dimension(when) |>
  define_dimension(where) |>
  define_dimension(who)

We obtain the star_dabase object from the flat_table object and the defined schema.

mrs_age_db <- mrs_age_ft |>
  as_star_database(s_age)

We also show the graphical representation of the tables.

db_dm <- mrs_age_db |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

Constellation

From the two star databases we define a constellation that shares common dimensions.

mrs_db <- constellation("mrs", mrs_cause_db, mrs_age_db)

And we show the graphic representation of the tables.

db_dm <- mrs_db |>
  as_dm_class(pk_facts = FALSE)
db_dm |> 
  dm::dm_draw(rankdir = "LR", view_type = "all")

Conclusions

From the data obtained from the operational systems we want to implement ROLAP star databases. The intermediate step that we consider are flat tables: the data from which a star database is obtained must be in the form of a flat table.

This document shows, using an example, the functions available in the rolap package to define, transform and integrate flat tables from tables obtained from various data sources.


  1. Additionally, data from Boston and Baltimore have been excluded to show the incremental data refresh.↩︎

  2. If we indicate more names than groups of measures, the measures not included in other groups are also included in a new group.↩︎