R / Stata Example

Note that this instruction is written for the R example. The Stata example works analogously.

Design rationale

The design of the project templates is guided by the following main thoughts:

Separation of logical chunks: A minimal requirement for a project to scale.

Only execute required tasks, automatically: Again required for scalability. It means that the machine needs to know what is meant by a “required task”.

Re-use of code and data instead of copying and pasting: Else you will forget the copy & paste step at some point down the road. At best, this leads to errors; at worst, to misinterpreting the results.

Be as language-agnostic as possible: Make it easy to use the best tool for a particular task and to mix tools in a project.

Separation of inputs and outputs: Required to find your way around in a complex project.

I will not touch upon the last point until the Organisation section below. The remainder of this page introduces an example and a general concept of how to think about the first four points.

Running example

To fix ideas, let’s look at the example of Albouy’s [Alb12] replication study of Acemoglu, Johnson, and Robinson’s (AJR) [AJR01] classic 2001 paper. In his replication, Albouy [Alb12] raises two main issues: lack of appropriate clustering and measurement error in the instrument (settler’s mortality) that is correlated with expropriation risk and GDP. To keep it simple, the example only replicates figure 1 and part of table 2 and table 3 of Albouy [Alb12].

Figure 1 is supposed to visualize the relationship between expropriation risk and settler’s mortality. In table 2, the first stage results are replicated (the effect of settler’s mortality on expropriation risk). This is estimated using the original mortality rates of AJR (Panel A) and one alternative proposed by Albouy, namely using the conjectured mortality data (Panel B). For each panel, several specifications are supposed to be estimated using varying geographic controls. Table 3 contains the second stage estimates for Panel A and Panel B. For that, different standard error adjustments, as proposed by Albouy, are estimated additionally.

This replication exercise requires three main steps.

Combine Albouy’s (2012) and AJR’s (2005) data (Data Management)

Estimating the first and the second stage for each Panel and creating the figure. (Analysis, Final)

In this instruction, we will focus on the replication of the tables. Creating the figure is straightforward. For each Panel, one has to follow four steps:

Compute the first stage estimates considering different geographic controls. (Analysis)

Compute the second stage estimates considering different geographic controls and different standard error specifications (Analysis)

Create nice tables for the results of 1 and 2 (Final)

Including the figure and the tables in a final LaTeX document and writing some text. (Paper)

It is very useful to explicitly distinguish between steps 1./2. and 3. because computation time in 1. and 2. (the actual estimation) can become an issue: If you just want to change the layout of a table or the color of a line in a graph, you do not want to wait for days. Not even for 3 minutes or 30 seconds as in this example.

How to organise the workflow?

A naïve way to ensure reproducibility is to have a master-script (do-file, m-file, …) that runs each file one after the other. One way to implement that for the above setup would be to have code for each step of the analysis and a loop over the different subsamples within each step:

You will still need to manually keep track of whether you need to run a particular step after making changes, though. Or you run everything at once, all the time. Alternatively, you may have code that runs one step after the other for each mortality series/specification:

The equivalent comment applies here: Either keep track of which model needs to be run after making changes manually, or run everything at once.

Ideally though, you want to be even more fine-grained than this and only run individual elements. This is particularly true when your entire computations take some time. In this case, running all steps every time via the master-script simply is not an option. All my research projects ended up running for a long time, no matter how simple they were… The figure shows you that even in this simple example, there are now quite a few parts to remember:

This figure assumes that your data management is being done for all models at once, which is usually a good choice for me. Even with only two models, we need to remember 6 ways to start different programs and how the different tasks depend on each other. This does not scale to serious projects!

Directed Acyclic Graphs (DAGs)

The way to specify dependencies between data, code and tasks to perform for a computer is a directed acyclic graph. A graph is simply a set of nodes (files, in our case) and edges that connect pairs of nodes (tasks to perform). Directed means that the order of how we connect a pair of nodes matters, we thus add arrows to all edges. Acyclic means that there are no directed cycles: When you traverse a graph in the direction of the arrows, there may not be a way to end up at the same node again.

This is the dependency graph for a simplified version of the Albouy’s replication study [Alb12] as implemented in the R example of the project template:

To keep the dependency graph simple, we ignore the figure for now. baseline.json contains the sample specification for panel A and rmconj.json for panel B.

The arrows of the graph have different colors in order to distinguish the steps of the analysis, from left to right:

Blue for data management (=combining the data sets in this case)

Orange for the main estimation

Teal for the visualisation of results

Red for compiling the pdf of the paper

Bluish nodes are pure source files – they do not depend on any other file and hence none of the edges originates from any of them. In contrast, brownish nodes are targets, they are generated by the code. Some may serve as intermediate targets only – e.g. there is not much you would want to do with the ajrcomment.dta except for processing it further.

In a first run, all targets have to be generated, of course. In later runs, a target only needs to be re-generated if one of its direct dependencies changes. E.g. when we make changes to baseline.json, we will need to rerun first_stage_estimation.r and second_stage_estimation.r using this subsample/specification. Then we will need to rerun table_first_stage_est.r and table_second_stage_est.r to renew table_first_stage_est.tex and table_first_stage_est.tex. Lastly, we need to re-compile the pdf as well. We will dissect this example in more detail in the next section. The only important thing at this point is to understand the general idea.

Of course this is overkill for a textbook example – we could easily keep the code closer together than this. But such a strategy does not scale to serious papers with many different specifications. As a case in point, consider the DAG for an early version of [vG15]:

Do you want to keep those dependencies in your head? Or would it be useful to specify them once and for all in order to have more time for thinking about research? The next section shows you how to do that.

Introduction to pytask

pytask is our tool of choice to automate the dependency tracking via a DAG (directed acyclic graph) structure. It has been written by Uni Bonn alumnus Tobias Raabe out of frustration with other tools.

pytask is inspired by pytest and leverages the same plugin system. If you are familiar with pytest, getting started with pytask should be a very smooth process.

pytask will look for Python scripts named task_[specifier].py in all subdirectories of your project. Within those scripts, it will execute functions that start with task_.

Have a look at its excellent documentation. At present, there are additional plugins to run R scripts, Stata do-files, and to compile documents via LaTeX.

We will have more to say about the directory structure in the pyorganisation section. For now, we note that a step towards achieving the goal of clearly separating inputs and outputs is that we specify a separate build directory. All output files go there (including intermediate output), it is never kept under version control, and it can be safely removed – everything in it will be reconstructed automatically the next time you run pytask.

Pytask Overview

From a high-level perspective, pytask works in the following way:

pytask reads your instructions and sets the build order.

Think of a dependency graph here.

It stops when it detects a circular dependency or ambiguous ways to build a target.

Both are major advantages over a master-script, let alone doing the dependency tracking in your mind.

pytask decides which tasks need to be executed and performs the required actions.

Minimal rebuilds are a huge speed gain compared to a master-script.

These gains are large enough to make projects break or succeed.

We have just touched upon the tip of the iceberg here; pytask has many more goodies to offer. Its documentation is an excellent source.

Wanted: Feedback

This is very fresh; please let us know what you would like to see here and what needs better explanation.

Organisation

On this page, we describe how the files are distributed in the directory hierarchy.

Directory structure

[The pictures are a little outdated, but you get the idea]

The left node of the following graph shows the contents of the project root directory after executing pytask:

_images/project_hierarchies_big_pic1.png

Files and directories in brownish colours are constructed by pytask; those with a bluish background are added directly by the researcher. You immediately see the separation of inputs and outputs (one of our guiding principles) at work:

All source code is in the src directory.

All outputs are constructed in the bld directory.

The paper and presentation are put there so they can be opened easily.

The contents of both the root/bld and the root/src directories directly follow the steps of the analysis from the workflow section.

The idea is that everything that needs to be run during the, say, analysis step, is specified in root/src/analysis and all its output is placed in root/bld/analysis.

Some differences:

Because they are accessed frequently, figures and tables get extra directories in root/bld

The directory root/src contains many more subdirectories:

original_data is the place to store the data in its raw form, as downloaded / transcribed / … The original data should never be modified and saved under the same name.

model_code contains source files that might differ by model and that are potentially used at various steps of the analysis.

model_specs contains JSON files with model specifications. The choice of JSON is motivated by the attempt to be language-agnostic: JSON is quite expressive and there are parsers for nearly all languages (for Stata there is a converter in root/src/model_specs/task_models.py file of the Stata version of the template)

library provides code that may be used by different steps of the analysis. Little code snippets for input / output or stuff that is not directly related to the model would go here. The distinction from the model_code directory is a bit arbitrary, but I have found it useful in the past.

As an example of how things look further down in the hierarchy, consider the analysis step:

_images/project_hierarchies_analysis1.png

The same function (task_estimate) is run twice for the models baseline and rmconj. All specification of files is done in pytask.

It is imperative that you do all the task handling inside the task_xxx.py-scripts, using the pathlib library. This ensures that your project can be used on different machines and it minimises the potential for cross-platform errors.

For running scripts in languages other than Python, pass all required files (inputs, log files, outputs) as arguments to the @pytask.mark.[x]-decorator. You can then read them in. Check this R template for examples.

For running Python source code from pytask, simply include depends_on and produces as inputs to your function.