Harmonizing Product Codes with R
Christoph Baumgartner
Stjepan Srhoj
Janette Walde
Abstract
Innovation is a major engine of economic growth. To compare products
over time, harmonization of product codes is mandatory. This package
provides an 
easy-to-use approach to harmonize product codes.
Moreover, it offers an application that allows finding all new and
dropped products for given firm-level data based on harmonized product
codes.
 
This package provides several functions to harmonize CN8
product codes (Combined
Nomenclature 8 digits) as well as PC8 product codes (Production
Communautaire 8 digits), HS6 (Harmonized
System 6 digits) and BEC (Broad
Economic Categories). All functions are listed below:
Idea behind harmonization
The basic idea that stands behind the harmonization is to keep track
of each single code, during a certain time period. In the simplest case,
a code doesn’t change during the examined period, i.e. no harmonization
needed. In any other case, all changes, which are associated with a
specific code need to be documented. There are different kinds of
changes: 1 to 1, 1 to many, may to 1, many to many, 1 to none, none to
1. The last two described changes, simply mean that a code was dropped,
respectively that a new code was created. Those two procedures are only
possible for PC8- and not for CN8-classification. 1 to many or many to 1
changes can occur, if two or more codes are split, respectively merged.
It is also possible that a ‘mixture’ of changes is present, e.g. a code
can merge and remain the same in terms of notation at the same time.
In a more technical way this means, a (n + m) x k
matrix, consecutively referred to as ‘history matrix’, is designed to
store all the information, where n is the number of
observations, m is the number of added rows due changes and k
the number of years. In this matrix every row represents a history of a
particular code. For every code that has multiple replacements in one
year a new row has to be added to the matrix. This is necessary, since
one has to keep track of the changed codes as well. History matrices are
designed for CN8- and for PC8-classification. These history matrices are
used as a input for the final harmonization. In the further process, the
history matrices are extended by new variables, like additional
classification systems (HS6, BEC), binary indication variables and the
harmonized code itself.
The goal of the harmonization is to make a comparison possible for
the give time period. Therefore a baseline is created, defined in the
last year of the period. This information is stored in the variable
CN8plus (or PC8plus respectively). In order to achieve
this information, the harmonization is done in several steps. Firstly,
one has to check if a code didn’t change, i.e. all codes between the
first and the last year of interest are the same and no change appeared
in addition. If this is the case, this code does not need further
harmonization and is used as CN8plus already. Secondly, all
connections among codes need to be documented. This means if a code
split or merged into several codes for example, this codes are grouped
together from now on (consecutively referred to as family). If
a mixture of changes, e.g. a code split and remained the same in terms
of notation in the same year, this information is stored in the variable
‘flag’. However, these codes with flag-variables being 1, are
also part of a certain family. Summarizing this means, if one takes all
this information into account, one reaches the professed goal of
CN8plus, which therefore includes all codes that did not change
and all families. Furthermore, the varaibale flag can be 2 or
3. The variable flag being equal to 2, means that this code is
either new or was dropped during the periode of interest. If it is 3,
the code had at least one simple change, but is not associated with a
family.
The additional classification systems, are based on CN8plus
(or PC8plus respectively). Since CN8 and HS6 are closely
connected (HS6 are the first six digits of CN8), this transformation is
mostly straight forward and is stored in the variable HS6.
However, HS6 changes its classification due a separate change-list as
well. Considering these lists, yields the variable HS6plus. For
PC8 codes it is not that easy. Here, first all PC8 codes need to be
translated into CN8 codes (if possible) and afterwards the same
procedure like with CN8 codes is used. Note that not all PC8 codes do
have a corresponding CN8 code. Therefore some codes might be lost due to
this issue.
The BEC classification system is also based on CN8plus.
Concordance lists between CN8 and BEC are used to derive this
classification. This information is stored in several ways: higher (one
digit) and lower (up to 3 digits) aggregation as well as the basic class
of the good. If CN8plus contains a family, it is not possible
to assign a HS6- nor a BEC-classification, because a family can include
several different codes, with again several different related HS6- and
BEC-codes. The resulting matrix, we will call it harmonization
matrix from now on, is therefore a (n + m) x (k +
l) matrix, where the additional parameter l describes the
the added columns.
Concordance files between HS6 and BEC Rev. 4 exist for 1996, 2002,
and 2007. For 2012 and 2017, there exists a concordance between HS6 and
BEC Rev. 5. Therefore, we provide BEC codes from Rev. 4 until 2011 and
BEC codes from Rev. 5 thereafter. Moreover, BEC codes can be classified
into three classes defined by the System
of National Accounts (SNA), which focus on the end-use of the
product.
 
Main Functions
harmonize_cn8() 
provides for a given time period a data frame that contains all
CN8 product codes and their history, harmonized
CN8plus codes, harmonized HS6plus codes, and
BEC classification. The “plus-codes” are the main outcome of
the function. They provide harmonized information of the product codes,
i.e. comparable codes. Every harmonization refers to the last/first year
of interest. The following table offers an overview of all provided
variables.
| CN8_xxxx | a specific CN8 code in a given year | 
| CN8plus | the harmonization code for CN8, which refers to the
last/first year of the time period | 
| HS6plus | the harmonization code of HS6, which refers to the
last/first year of the time period | 
| BEC | provides the BEC classification at a high aggregation
level (1 digit) | 
| BEC_agr | provides the BEC classification at a lower aggregation
level (up to 3 digits) | 
| SNA | provides information if the code is classified as
consumption, capital or intermediate good in SNA | 
| flag | integer from 0 to 3; 1 indicates that this code
remained the same in notation over the whole time period but was split
or merged in addition; 2 indicates that this code is either new or was
dropped during the period of interest; 3 indicates the code had at least
one simple change, but is not associated with a family | 
| flagyear | indicates the first year in which the flag was set | 
For more application details, see ?harmonization_cn8.
harmonize_pc8() 
provides for a given time period a data frame that contains all
PC8 product codes and their history, harmonized
PC8plus codes, harmonized HS6plus codes, and
BEC classification. The “plus-codes” are the main outcome of
the function. They provide harmonized information of the product codes,
i.e. comparable codes. Every harmonization refers to the last/first year
of interest. The following table offers an overview of all provided
variables.
| PC8_xxxx | a specific PC8 code in a given year | 
| PC8plus | the harmonization code for PC8, which refers to the
last/first year of the time period | 
| HS6plus | the harmonization code of HS6, which refers to the
last/first year of the time period | 
| BEC | provides the BEC classification at a high aggregation
level (1 digit) | 
| BEC_agr | provides the BEC classification at a lower aggregation
level (up to 3 digits) | 
| SNA | provides information if the code is classified as
consumption, capital or intermediate good in SNA | 
| flag | integer from 0 to 3; 1 indicates that this code
remained the same in notation over the whole time period but was split
or merged in addition; 2 indicates that this code is either new or was
dropped during the period of interest; 3 indicates the code had at least
one simple change, but is not associated with a family | 
| flagyear | indicates the first year in which the flag was set | 
For more application details, see ?harmonization_pc8.
 
Support Functions
All support functions are used within the main functions. They
provide intermediate steps to harmonize the data. However, they can be
used as stand-alone functions as well.
history_cn8() 
provides a data frame that contains all CN8 product codes
and their history over time for the demanded time period. This dataset
is the basis for the main function harmonize_cn8() and can
be obtained therewith as well. The following table offers an overview of
all provided variables.
| CN8_xxxx | a specific CN8 code in a given year | 
| flag | integer from 0 to 2; 1 indicates that this code
remained the same in notation over the whole time period but was split
or merged in addition; 2 indicates that this code is either new or was
dropped during the period of interest | 
| flagyear | indicates the first year in which the flag was set | 
For more application details, see ?history_cn8.
history_pc8() 
provides a data frame that contains all PC8 product codes
and their history over time for the demanded time period. This dataset
is the basis for the main function harmonize_PC8() and can
be obtained therewith as well. The following table offers an overview of
all provided variables.
| PC8_xxxx | a specific PC8 code in a given year | 
| flag | integer from 0 to 3; 1 indicates that this code
remained the same in notation over the whole time period but was split
or merged in addition; 2 indicates that this code is either new or was
dropped during the period of interest | 
| flagyear | indicates the first year in which the flag was set | 
For more application details, see ?history_pc8.
cn8_to_bec() 
provides a data frame that contains all CN8 product codes
and related BEC and HS6 codes in a given time period.
Therefore, this data serves as a connection between CN8 and
BEC classification and between CN8 and HS6
classification. It forms the basis of some output of the main function,
namely: BEC, BEC_agr, SNA and
HS6plus. The following table offers an overview of all provided
variables.
| CN8 | a specific CN8 code | 
| HS6 | provides the HS6 classification of the CN8plus
code | 
| BEC | provides the BEC classification on a high aggregation
level (1 digit) | 
| BEC_agr | provides the BEC classification on a lower aggregation
level (up to 3 digits) | 
For more application details, see ?cn8_to_bec.
pc8_to_bec() 
provides a data frame that contains all PC8 product codes
and related BEC and HS6 codes in a given time period.
Therefore, this data serves as a connection between PC8 and
BEC classification and between PC8 and HS6
classification. It forms the basis of some output of the main function,
namely: BEC, BEC_agr, SNA and
HS6plus. The following table offers an overview of all provided
variables.
| PC8_xxxx | a specific PC8 code | 
| HS6 | provides the HS6 classification of the PC8plus
code | 
| BEC | provides the BEC classification on a high aggregation
level (1 digit) | 
| BEC_agr | provides the BEC classification on a lower aggregation
level (up to 3 digits) | 
For more application details see ?pc8_to_bec.
get_data_directory() 
provides the directory where custom data must be stored and the used
data (e.g., concordance lists, list of codes) can be edited. However,
before editing the employed data or using additional concordance lists
for example, it is highly recommended to read first the instructions in
this vignette carefully (also see section Data
Sets and Custom Data). The directory is
provided in the R console. Further features (like open an explorer,
print available data in console) are only executable if the directory
path does not contain any blanks.
For more application details see ?get_data_directory.
 
Additional Functions
These functions go beyond the primary purpose of this package. The
additional functions provide an application of the data frames obtained
by the main functions. To use these additional functions, data on
firm-level is required, which is data that is not provided by the
package. The firm-level data must provide columns with the following
names: ID, year and CN8 or PC8.
Other columns may exist; however, they will not be used by the function.
The following table summarizes the variables that need to be included in
the firm-level data.
| ID | specific code that describes a firm over the years
(this code does not change over time) | 
| year | year in which the firm produced a product | 
| CN8 | CN8 code of firm product | 
| PC8 | PC8 code of firm product | 
  utilize_cn8() 
may provide two data frames:
- 
A data frame that contains all changed CN8 product codes per
firm per year. In more detail, this means how many products remained the
same, were added, were dropped, how many products were produced by a
certain firm in a given year, and how many products were produced in the
year after. As a base of this computation CN8plus codes or HS6plus codes
can be used.
- 
A data frame that is based on the entered firm data. The entered firm
data data is extended by harmonized data (that is CN8plus,
flag, flagyear, HS6plus, BEC,
BEC_agr, SNA).
The tables at the end of this section offer an overview of all
provided variables.
  utilize_pc8() 
may provide two data frames:
- 
A data frame that contains all changed PC8 product codes per
firm per year. In more detail, this means how many products remained the
same, were added or dropped - the value of the same/added/dropped
products - how many products were produced by a certain firm in a given
year, and how many products were produced in the year after. As a base
of this computation PC8plus codes or HS6plus codes can be used.
- 
A data frame that is based on the entered firm data. The entered firm
data data is extended by harmonized data (that is PC8plus,
flag, flagyear, HS6plus, BEC,
BEC_agr, SNA).
The tables at the end of this section offer an overview of all
provided variables.
Since the provided data frames do not differ between
utilize_cn8() and utilize_pc8(), in terms of
notation, the tables are only provided once here.
Table that summarizes the output, described by the notation
a. above:
| firmID | specific code that describes a firm over the years
(this code does not change over time) | 
| period_UL | lower limit of the time period | 
| period | time period in which the product was produced | 
| gap | indicating if the time period is greater than one
(i.e. upper limit - lower limit > 1) | 
| same_products | number of products that were produced in both years
(i.e. remained in the product portfolio of this firm) | 
| value_same_products | value of products that were produced in both years
(i.e. remained in the product portfolio of this firm); the value is
calculated in the upper limit of the time period | 
| new_products | number of added products in the upper limit of the time
period (i.e. added to the product portfolio of this firm) | 
| value_new_products | value of added products in the upper limit of the time
period (i.e. added to the product portfolio of this firm) | 
| dropped_products | number of dropped products in the upper limit of the
time period (i.e. removed of the product portfolio of this firm) | 
| value_dropped_products | value of dropped products in the upper limit of the
time period (i.e. removed of the product portfolio of this firm); the
value is calculated in the lower limit of the time period | 
| nbr_of_products_period_LL | number of all products produced in the lower limit of
the time period (i.e. entire product portfolio of this firm) | 
| nbr_of_products_period_UL | number of all products produced in the upper limit of
the time period (i.e. entire product portfolio of this firm) | 
Table that summarizes the output, described by the notation
b. above:
| firmID | specific code that describes a firm over the years
(this code does not change over time, provided by user) | 
| year | year in which the firm produced a product (provided by
user) | 
| CN8 | CN8 code of firm product (provided by user) | 
| PC8 | PC8 code of firm product (provided by user) | 
| (value) | value of the corresponding product code (may be
provided by user) | 
| … | additional columns from original firm data (provided by
user) | 
| CN8plus | final harmonization, which refers to the last year of
the time period | 
| PC8plus | final harmonization, which refers to the last year of
the time period | 
| flag | integer from 0 to 3; 1 indicates that this code
remained the same in notation over the whole time period but was split
or merged in addition; 2 indicates that this code is either new or was
dropped during the period of interest; 3 indicates the code had at least
one simple change, but is not associated with a family | 
| HS6 | provides the HS6 classification of the PC8plus /
CN8plus code | 
| HS6plus | also adjusts for the change lists of HS6 | 
| BEC | provides the BEC classification on a high aggregated
level (1 digit) | 
| BEC_agr | provides the BEC classification on a less aggregated
level (up to 3 digits) | 
| SNA | provides information if the code is classified as
consumption, capital or intermediate good in SNA | 
 
Data Sets
By default, the package provides several data sets for CN8-, PC8-,
HS6- and BEC-classification. This data allows for harmonization of CN8
product codes between 1995 and 2022 and PC8 product codes between 2007
and 2017. All available data sets are stored within the package. The
function get_data_directory() provides support to access
the data more easily. All data included in the package was downloaded
from EU server Ramon
originally and altered if needed.
Provided data in more detail:
- CN8 data  is provided in
the corresponding CN8 folder. This folder contains two different types
of files. Firstly, a list of all existing CN8 codes for every year,
e.g. for the year 2000, CN8_2000.rds. More technically
speaking, these files provided a data frame with one column and
n rows, where n is the number of existing CN8 codes in
a given year. An example (first six lines) of the year 2000 is the
following: -      group
1 01011100
2 01011910
3 01011990
4 01012010
5 01012090
6 01021010
 - Secondly, the CN8 folder contains a concordance list of all CN8 codes
over time, a .csv file, where the separator is a semicolon,
i.e. “;”. A header is necessary. The header names must be the
following: from, to, obsolete, new.
The period between “from” and “to” is always one year and describes when
the code changed. The “obsolete” and “new” codes represent the outdated
code and the replacement, respectively. The first six lines of the
default csv-file look like the following: - from;to;obsolete;new
1988;1989;02012011;02012021
1988;1989;02012011;02012029
1988;1989;02012019;02012029
1988;1989;03036010;03036011
1988;1989;03036010;03036019
 
- PC8 data  is provided in
the corresponding PC8 folder. This folder contains two different types
of files. Firstly, a list of all existing PC8 codes for every year,
e.g. for the year 2010, PC8_2010.rds. More technically
speaking, these files provided a data frame with one column and
n rows, where n is the number of existing CN8 codes in
a given year. An example (first six lines) of the year 2010 is the
following: -       2010
1 07101000
2 07291100
3 07291200
4 07291300
5 07291400
6 07291500
 - Secondly, a concordance between every year is necessary. These files
contain two years in their filenames, with a period of one year in
between, e.g. between 2010 and 2011 this results in
PC8_2010_2011.rds. More technically speaking, these files are
data frames with two columns, which must be named “new” and “old” and
n rows, where n is the number of changes in a given
year. An example (first six lines) of the changes between 2010 and 2011
is the following: -        new      old
1 07101000 07101000
2 07291100 07291100
3 07291200 07291200
4 07291300 07291300
5 07291400 07291400
6 07291500 07291500
 - Thirdly, the PC8 folder contains concordance lists between PC8- and
CN8- classifications for every year. This data is needed in terms of
translating PC8 into BEC. An example for the year 2010 would be
PC8_CN8_2010.rds. Technically this means, a data frame with two
columns, named “PRCCODE” for PC8 codes and “CNCODE” for CN8 codes and
n rows, where n is the number of concordances between
specific codes is provided by every year. However, no concordance
between PC8 and CN8 may be possible. In this case, the missing value is
filled by NA. Some examples out of the associated file for the
year 2010 can be found below: -    PRCCODE CNCODE
1 10131430   <NA>
2 10139100   <NA>
3 10399100   <NA>
4 13301110   <NA>
5 13301121   <NA>
6 13301122   <NA>
     PRCCODE   CNCODE
2400 8111136 25151200
2401 8111150 25152000
2402 8111233 25161100
2403 8111236 25161200
2404 8111250 25162000
2405 8111290 25169000
 
- HS6 data  is provided in
the corresponding HS6 folder. This folder only contains one type of
file, which are correspondence lists between the changes of HS6 codes
over time. Those changes happened in several years: 1992, 1996, 2002,
2007, 2012 and 2017. For every period, a separate concordance list is
necessary. csv-files provided this data, where the separator is a
semicolon, i.e. “;” and the filenames contain both years. For
example, between 1996 and 2002, the file is called
HS_1996_to_HS_1992.csv. Also, headers are included in this
file. For this specific case, they are “HS 1996” and “HS 1992”. For
other periods the headers change accordingly. An example (first six
lines) of the changes between 1996 and 1992 is the following: - HS 1996;HS 1992
10111;10111
10119;10119
10120;10120
10210;10210
10290;10290
 
- BEC data  data is provided
in the HS6toBEC folder. This folder contains only one type of file,
which are correspondence lists between HS6- and BEC-classification in
the years HS6 codes changed (i.e. 2002, 2007, 2012, 2017). For each
year, a separate concordance list is necessary. csv-files are used for
this data, where the separator is a semicolon, i.e. “;” and the
filenames contain the year. For example, in 2002, the file is called
HS2002toBEC.csv. Also, headers are included in this file,
namely “HS” for the HS6 codes and “BEC” for the BEC codes. An example
(first six lines) of the concordance in 2002 is the following: - HS;BEC
10110;41
10190;111
10210;41
10290;111
10310;41
 
Custom Data 
The use of additional concordance lists for example or altering
provided data is possible. However, it is highly recommended to read
first the instructions in this vignette carefully. If new data is added,
there are some mandatory aspects and some valuable aspects to
acknowledge.
Mandatory aspects:
- New data must be stored inside the package. This can be easily
done by adding new files in the appropriate subfolder of the package
database. - get_data_directory()may provide help to find the
correct folder to store new data.
 
- Chosen filenames must be analogue to already existing
files. 
- The structure of the new data is crucial. The section Data Sets may provide more details. In short:
file-type, header names, column numbers and datatype (numeric,
character, …) are very important. 
- All new added .csv-files must be encoded using
UTF-8. 
Valuable aspects:
- It is highly recommended to download new data from EU server Ramon and only
alter content-related data if necessary.
- Product codes need to have the correct length, e.g. CN8 codes must
be eight digits long. Some programs tend to interpret codes as numeric
values and cut of leading zeros, which leads to completely wrong
results.