Templates for Reproducible Research: Documentation#
Introduction#
An empirical or computational research project only becomes a useful building block for science and policy when all steps can be easily repeated and modified by others.
Hence, some actions should be absent as much as possible. This includes copying and pasting, pointing and clicking with a mouse, or other forms of interactive input, which are not stored as part of the project.
The idea behind these templates is that the researcher specifies a set of tasks, which are executed in the correct order as required. The only input for (re-)producing results will be the action setting this pipeline to run.
This code base aims to provide two stepping stones to assist you in achieving this goal:
A sensible directory structure. This will save you a bunch of thinking about this structure time and again, which typically happens when incrementally building up a new project. Put differently, instead of starting from scratch, you modify an example for your needs.
A pre-configured computational environment including useful tools pytask and pre-commit hooks. These tools help you to automate the workflow of your project and to maintain a clean code base.
The first should lure you in quickly. The second should convince you to stick to the tools in the long run – unless you have fought with large research projects before, at this point you may think that all of this is overkill and far more difficult than necessary. It is not. [although I am always happy to hear about easier alternatives]
The example uses Python code also for the “research part”. However, pytask supports several popular languages (R, Julia, Stata). Since pytask does not require a whole lot of Python knowledge, you may find the template useful in order to make your pipeline reproducible in languages you are more comfortable with. It is also an easy option in order to mix languages in your project. In fact, until version 0.9, the template included its worked example in R, too. We dropped it purely for lack of resources to maintain it.