The flexibility of text
provides researchers with a lot
of freedom to select different options. For example, a researcher can
select among many different layers (in BERT base 12 and in BERT large
24); and these layers can be aggregated in different ways including
using mean, minimum or maximum. It is also possible to use different
number of PCA components (or not use PCA at all) in training; as well as
selecting different regression algorithms including (multiple linear
regression or ridge). All these options are great for learning more
about these methods. However, when hypotheses testing is important to
not fall pray for researcher degrees of freedom and avoid the
risk of (unconsciously) p-hacking (e.g., see Simmons, Nelson, &
Simonsohn, 2011).
Researcher degrees of freedom refers to the inherent flexibility
involved in conducting research including carrying out experiments as
well as analyzing the data. Researchers can choose among many ways of
analyzing their data, and these ways can, for example, be selected
arbitrarily or on the basis that certain ways result in more desirable
outcomes such as a statistically significant result (Simmons, Nelson,
& Simonsohn, 2011). Or put another way, the flexibility in
text
is a double edged sward where abusing the options
leads to p-hacking: the analytic process of consciously or unconsciously
trying several types of analyses until achieving the desired
results.
Specify language model , specify which layers that will be used and how they will be aggregated.
Example of aspects to consider in a pre-registration of
hypotheses testing
This is not an exhaustive list; rather think through your analyses as
carefully as possible and consider which decisions that can be
appropriately be made in advance. For example,
Type of model (e.g., BERT-base, BERT-large, multilingual BERT, RoBERTa, XLnet, etc.)
Which layers (e.g., all, 11 and 12 etc.)
Layer aggregation method (e.g., mean, minimum, and maximum)
Exclusion of some token (e.g., [CLS] and [SEP])
Type of ML algorithm (e.g., ridge, Random Forest etc.)
Number of cross validation folds in textTrain
Criteria for plotting (e.g., number of words to significance test, plots etc.)
Number of permutations (e.g., in textSimilarityTest, textProjection)
Not(ing) change of random seed. In computer science literature it has recently been discussed that different random seeds can give very different results (e.g., see Mosbach et al., 2020). So perhaps even consider pointing out that seeds will not be changed or commit to a specific seed
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science. Mosbach, M., Andriushchenko, M., & Klakow, D. (2020). On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines.