Templates for Reproducible Research Projects in Economics¶
An empirical or computational research project only becomes a useful building block for science when all steps can be easily repeated and modified by others. This means that we should automate as much as possible, compared to pointing and clicking with a mouse or, more generally, keeping track yourself of what needs to be done.
This is a collection of templates where much of this automation is pre-configured via describing the research workflow as a directed acyclic graph (DAG) using pytask. You just need to:
Install the template for the main language in your project (Python, R, Stata, Matlab, …)
Move your programs to the right places and change the placeholder scripts
Run pytask, which will build your entire project the first time you run it. Later, it will automatically figure out which parts of the project need to be rebuilt.
Getting Started¶
Here, we first describe in Preparing your system how you need to set up your computer so that everything plays well together. In dialogue, you will find detailed explanations on what you may want to choose when configuring the templates for your needs. Once you are done with that, you may want to check the Tips and tricks for starting a new project or Suggestions for porting an existing project.
So, …
If you want to first get an idea of whether this is the right thing for you, start by reading through the Introduction to the Example Code and the Python / Matlab Example or the R / Stata Example, whichever is most relevant for you.
If you are hooked already and want to try it out, continue right here with Preparing your system.
If you have done this before, you can jump directly to dialogue.
Preparing your system¶
Make sure you have the following programs installed and that these can be found on your path. This template requires
Miniconda or Anaconda. Windows users: please consult Tips and Tricks for Windows Users
Note
This template is tested with python 3.6 and higher and conda version 4.7.12 and higher. Use conda 4.6-4.7.11 at your own risk; conda versions 4.5 and below will not work under any circumstances.
a modern LaTeX distribution (e.g. TeXLive, MacTex, or MikTex)
Git, windows users please also consult Integrating git tab completion in Windows Powershell
The text editor VS Code, unless you know what you are doing.
If you are on Windows, please open the Windows Powershell. On Mac or Linux, open a terminal. As everything will be started from the Powershell/Terminal, you need to make sure that all programmes you need in your project (for sure Anaconda Python, Git, and LaTeX; potentially VS Code, Stata, R, Matlab) can be found on your PATH. That is, these need to be accessible from your shell. This often requires a bit of manual work, in particular on Windows.
To see which programmes can be found in your path, type (leave out the leading dollar sign, this is just standard notation for a command line prompt):
Windows
$ echo $env:path
Mac/Linux
$ echo $PATH
This gives you a list of directories that are available on your PATH.
Check that this list contains the path to the programs you want to use in your project, in particular, Anaconda (this contains your Python distribution), a LaTeX distribution, the text editor VS Code, Git, and any other program that you need for your project (Stata, R, Matlab). Otherwise add them by looking up their paths on your computer and follow the steps described here PATH environmental variable in Windows or Adding directories to the PATH: MacOS and Linux.
If you added any directory to PATH, you need to close and reopen your shell, so that this change is implemented.
To be on the safe side regarding your paths, you can check directly whether you can launch the programmes. For Python, type:
$ python $ exit()
This starts python in your shell and exits from it again. The top line should indicate that you are using a Python distribution provided by Anaconda. Here is an example output obtained using Windows PowerShell:
Python 3.7.4 (default, Aug 9 2019, 18:34:1) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
For Git, type:
$ git status
Unless you are in a location where you expect a Git repository, this should yield the output:
fatal: not a git repository (or any of the parent directories): .git
To start and exit pdflatex.
$ pdflatex $ X
An editor window should open after typing:
$ code
If required, do the same for Stata, R, or Matlab — see here for the precise commands you may need.
In the Powershell/Terminal, navigate to the parent folder of your future project.
Now type
pwd
, which prints the absolute path to your present working directory. There must not be any spaces or special characters in the path (for instance ä, ü, é, Chinese or Kyrillic characters).If you have any spaces or special characters on your path, change to a folder that does not have these special characters (e.g., on Windows, create a directory
C:\projects
. Do not rename your home directory).Type
git status
, this should yield the output:fatal: not a git repository (or any of the parent directories): .git
The template uses cookiecutter to enable personalized installations. Before you start, install cookiecutter on your system.
$ pip install cookiecutterAll additional dependencies will be installed into a newly created conda environment which is installed upon project creation.
Warning
If you do not opt for the conda environment later on, you need to take care of these dependencies by yourself. A list of additional dependencies can be found under Prerequisites if you decide not to have a conda environment.
If you intend to use a remote Git repository, create it if necessary and hold the URL ready.
Configuring your new project¶
If you are on Windows, please open the Windows Powershell. On Mac or Linux, open a terminal.
Navigate to the parent folder of your future project and type (i.e., copy & paste):
$ cookiecutter https://github.com/OpenSourceEconomics/econ-project-templates/archive/v0.4.4.zip
The dialogue will move you through the installation. Make sure to keep this page side-by-side during the process because if something is invalid, the whole process will break off (see When cookiecutter exits with an error on how to recover from there, but no need to push it).
author – Separate multiple authors by commas
email – Just use one in case of multiple authors
affiliation – Separate by commas for multiple authors with different affiliations
project_name – The title of your project as it should appear in papers / presentations. Must not contain underscores or anything that would be an invalid LaTeX title.
project_slug – This will become your project identifier (i.e., the directory will be called this way). The project slug must be a valid Python identifier, i.e., no spaces, hyphens, or the like. Just letters, numbers, underscores. Do not start with a number. There must not be a directory of this name in your current location.
project_short_description* – Briefly describe your project.
python_version – Default is 3.9. Please use python 3.7 or higher.
create_conda_environment_with_name – Just accept the default. If you don’t, the same caveat applies as for the project_slug. If you really do not want a conda environment, type “x”.
set_up_git – Set up a fresh Git repository.
git_remote_url – Paste your remote URL here if applicable.
make_initial_commit – Usually yes.
add_basic_pre_commit_hooks – Choose yes if you are using python. This implements black and some basic checks as pre-commit hooks. Pre-commit hooks run before every commit and prohibit committing before they are resolved. For a full list of pre-commit hooks implemented here take a look at the Pre-Commit Hooks.
add_intrusive_pre_commit – adds flake8 to the pre-commit hooks. flake8 is a python code linting tool. It checks your code for style guide (PEP8) adherence.
example_to_install – This should be the dominant language you will use in your project. A working example will be installed in the language you choose; the easiest way to get going is simply to adjust the examples for your needs.
configure_running_matlab – Select “y” if and only if you intend to use Matlab in your project and the Matlab executable may be found on your path.
configure_running_r – Select “y” if and only if you intend to use R in your project and the R executable may be found on your path.
configure_running_stata – Select “y” if and only if you intend to use Stata in your project and the Stata executable may be found on your path.
python_version – Usually accept the default. Must be a valid Python version 3.6 or higher.
use_biber_biblatex_for_tex_bibliographies – This is a modern replacement for bibtex, but often this does not seem to be stable in MikTeX distributions. Choose yes only if you know what you are doing.
open_source_license – Whatever you prefer.
After successfully answering all the prompts, a folder named according to your project_slug will be created in your current directory. If you run into trouble, please follow the steps explained When cookiecutter exits with an error
Skip this step if you did not opt for the conda environment. Type:
$ conda activate <env_name>This will activate the newly created conda environment. You have to repeat the last step anytime you want to run your project from a new terminal window.
Skip this step if you did not opt for the pre-commit hooks. Pre-commit have to be installed in order for them to have an effect. This step has to be repeated every time you work on your project on a new machine. To install the pre-commit hooks, navigate to the project’s folder in the shell and type:
$ pre-commit install
Navigate to the folder in the shell and type the following commands into your command line to see whether the examples are working:
$ conda develop . $ pytaskAll programs used within this project template need to be found on your path, see above (Preparing your system and the Frequently Answered Questions / Troubleshooting).
If all went well, you are now ready to adapt the template to your project.
Tips and tricks for starting a new project¶
Your general strategy should be one of divide and conquer. If you are not used to thinking in computer science / software engineering terms, it will be hard to wrap your head around a lot of the things going on. So write one bit of code at a time, understand what is going on, and move on.
Install the template for the language of your choice as described in dialogue
I suggest you leave the examples in place.
Now add your own data and code bit by bit, append the task_xxx files as necessary. To see what is happening, it might be useful to comment out some steps
Once you got the hang of how things work, remove the examples (both the files and the code in the task_xxx files)
Suggestions for porting an existing project¶
Your general strategy should be one of divide and conquer. If you are not used to thinking in computer science / software engineering terms, it will be hard to wrap your head around a lot of the things going on. So move one bit of code at a time to the template, understand what is going on, and move on.
Assuming that you use Git, first move all the code in the existing project to a subdirectory called old_code. Commit.
Now set up the templates.
Start with the data management code and move your data files to the spot where they belong under the new structure.
Move (the first steps of) your data management code to the folder under the templates. Modify the task_xxx files accordingly or create new ones.
Run pytask, adjusting the code for the errors you’ll likely see.
Move on step-by-step like this.
Delete the example files and the corresponding sections of the task_xxx files / the entire files in case you created new ones.
Introduction to the Example Code¶
An empirical or computational research project only becomes a useful building block of science when all steps can be easily repeated and modified by others. This means that we should automate as much as possible, as opposed to pointing and clicking with a mouse. This code base aims to provide two stepping stones to assist you in achieving this goal:
Provide a sensible directory structure that saves you from a bunch of annoying steps and thoughts that need to be performed sooner or later when starting a new project
Facilitate the reproducibility of your research findings from the beginning to the end by letting the computer handle the dependency management
The first should lure you in quickly, the second convince you to stick to the tools in the long run—unless you are familiar with the programs already, you might think now that all of this is overkill and far more difficult than necessary. It is not. [although I am always happy to hear about easier alternatives]
The templates support a variety of programming languages already and are easily extended to cover any other. Everything is tied together by pytask, which is written in Python. You do not need to know Python to use these tools, though.
If you are a complete novice, you should read through the entire documents instead of jumping directly to the Getting Started section. First, let me expand on the reproducibility part.
The case for reproducibility¶
The credibility of (economic) research is undermined if erroneous results appear in respected journals. To quote McCullough and Vinod [MV03]:
Replication is the cornerstone of science. Research that cannot be replicated is not science, and cannot be trusted either as part of the profession’s accumulated body of knowledge or as a basis for policy. Authors may think they have written perfect code for their bug-free software package and correctly transcribed each data point, but readers cannot safely assume that these error-prone activities have been executed flawlessly until the authors’ efforts have been independently verified. A researcher who does not openly allow independent verification of his results puts those results in the same class as the results of a researcher who does share his data and code but whose results cannot be replicated: the class of results that cannot be verified, i.e., the class of results that cannot be trusted.
It is sad if not the substance, but controversies about the replicability of results make it to the first page of the Wall Street Journal [WallSJournal05], covering the exchange between Hoxby and Rothstein ([Hox00] – [Rot07a] – [Hox07] – [Rot07b]). There are some other well-known cases from top journals, see for example Levitt and McCrary ([Lev97] – [McC02] – [Lev02]) or the experiences reported in McCullough and Vinod [MV03]. The Reinhart and Rogoff controversy is another case in point, Google is your friend in case you do not remember it. Assuming that the incentives for replication are much smaller in lower-ranked journals, this is probably just the tip of the iceberg. As a consequence, many journals have implemented relatively strict replication policies, see this figure taken from [McC09]:

Economic Journals with Mandatory Data + Code Archives, Figure 1 in McCullough (2009)¶
Exchanges such as those above are a huge waste of time and resources. Why waste? Because it is almost costless to ensure reproducibility from the beginning of a project — much is gained by just following a handful of simple rules. They just have to be known. The earlier, the better. From my own experience, I can confirm that replication policies are enforced these days — and that it is rather painful to ensure ex-post that you can follow them. The number of journals implementing replication policies is likely to grow further — if you aim at publishing in any of them, you should seriously think about reproducibility from the beginning. And I did not even get started on research ethics…
Python / Matlab Example¶
Note that this instruction is written for the Python example. The Matlab example works analogously.
Design rationale¶
The design of the project templates is guided by the following main thoughts:
Separation of logical chunks: A minimal requirement for a project to scale.
Only execute required tasks, automatically: Again required for scalability. It means that the machine needs to know what is meant by a “required task”.
Re-use of code and data instead of copying and pasting: Else you will forget the copy & paste step at some point down the road. At best, this leads to errors; at worst, to misinterpreting the results.
Be as language-agnostic as possible: Make it easy to use the best tool for a particular task and to mix tools in a project.
Separation of inputs and outputs: Required to find your way around in a complex project.
I will not touch upon the last point until the pyorganisation section below. The remainder of this page introduces an example and a general concept of how to think about the first four points.
Running example¶
To fix ideas, let’s look at the example of Schelling’s (1969, [Sch69]) segregation model, as outlined here in Stachurski’s and Sargent’s online course [SS19]. Please look at their description of the Schelling model. Say we are thinking of two variants for the moment:
Replicate the figures from Stachurski’s and Sargent’s course.
Check what happens when agents are restricted to two random moves per period; after that they have to stop regardless whether they are happy or not.
For each of these variants (called models in the project template and the remainder of this document), you need to perform various steps:
Draw a simulated sample with initial locations (this is taken to be the same across models, partly for demonstration purposes, partly because it assures that the initial distribution is the same across both models)
Run the actual simulation
Visualise the results
Pull everything together in a paper.
It is very useful to explictly distinguish between steps 2. and 3. because computation time in 2. becomes an issue: If you just want to change the layout of a table or the color of a line in a graph, you do not want to wait for days. Not even for 3 minutes or 30 seconds as in this example.
How to organise the workflow?¶
A naïve way to ensure reproducibility is to have a master-script (do-file, m-file, …) that runs each file one after the other. One way to implement that for the above setup would be to have code for each step of the analysis and a loop over both models within each step:
You will still need to manually keep track of whether you need to run a particular step after making changes, though. Or you run everything at once, all the time. Alternatively, you may have code that runs one step after the other for each model:
The equivalent comment applies here: Either keep track of which model needs to be run after making changes manually, or run everything at once.
Ideally though, you want to be even more fine-grained than this and only run individual elements. This is particularly true when your entire computations take some time. In this case, running all steps every time via the master-script simply is not an option. All my research projects ended up running for a long time, no matter how simple they were… The figure shows you that even in this simple example, there are now quite a few parts to remember:
This figure assumes that your data management is being done for all models at once, which is usually a good choice for me. Even with only two models, we need to remember 6 ways to start different programs and how the different tasks depend on each other. This does not scale to serious projects!
Directed Acyclic Graphs (DAGs)¶
The way to specify dependencies between data, code and tasks to perform for a computer is a directed acyclic graph. A graph is simply a set of nodes (files, in our case) and edges that connect pairs of nodes (tasks to perform). Directed means that the order of how we connect a pair of nodes matters, we thus add arrows to all edges. Acyclic means that there are no directed cycles: When you traverse a graph in the direction of the arrows, there may not be a way to end up at the same node again.
This is the dependency graph for the modified Schelling example from Stachurski and Sargent, as implemented in the Python branch of the project template:
The arrows have different colors in order to distinguish the steps of the analysis, from left to right:
Blue for data management (=drawing a simulated sample, in this case)
Orange for the main simulation
Teal for the visualisation of results
Red for compiling the pdf of the paper
Bluish nodes are pure source files – they do not depend on any other file and hence none of the edges originates from any of them. In contrast, brownish nodes are targets, they are generated by the code. Some may serve as intermediate targets only – e.g. there is not much you would want to do with the raw simulated sample (initial_locations.csv) except for processing it further.
In a first run, all targets have to be generated, of course. In later runs, a target only needs to be re-generated if one of its direct dependencies changes. E.g. when we make changes to baseline.json, we will need to build schelling_baseline.pickle and schelling_baseline.png anew. Depending on whether schelling_baseline.png actually changes, we need to re-compile the pdf as well. We will dissect this example in more detail in the next section. The only important thing at this point is to understand the general idea.
Of course this is overkill for a textbook example – we could easily keep the code closer together than this. But such a strategy does not scale to serious papers with many different specifications. As a case in point, consider the DAG for an early version of [vG15]:
Do you want to keep those dependencies in your head? Or would it be useful to specify them once and for all in order to have more time for thinking about research? The next section shows you how to do that.
Introduction to pytask¶
pytask is our tool of choice to automate the dependency tracking via a DAG (directed acyclic graph) structure. It has been written by Uni Bonn alumnus Tobias Raabe out of frustration with other tools.
pytask is inspired by pytest and leverages the same plugin system. If you are familiar with pytest, getting started with pytask should be a very smooth process.
pytask will look for Python scripts named task_[specifier].py in all subdirectories of your project. Within those scripts, it will execute functions that start with task_.
Have a look at its excellent documentation. At present, there are additional plugins to run R scripts, Stata do-files, and to compile documents via LaTeX.
We will have more to say about the directory structure in the pyorganisation section. For now, we note that a step towards achieving the goal of clearly separating inputs and outputs is that we specify a separate build directory. All output files go there (including intermediate output), it is never kept under version control, and it can be safely removed – everything in it will be reconstructed automatically the next time you run pytask.
Pytask Overview¶
From a high-level perspective, pytask works in the following way:
pytask reads your instructions and sets the build order.
Think of a dependency graph here.
It stops when it detects a circular dependency or ambiguous ways to build a target.
Both are major advantages over a master-script, let alone doing the dependency tracking in your mind.
pytask decides which tasks need to be executed and performs the required actions.
Minimal rebuilds are a huge speed gain compared to a master-script.
These gains are large enough to make projects break or succeed.
We have just touched upon the tip of the iceberg here; pytask has many more goodies to offer. Its documentation is an excellent source.
Wanted: Feedback¶
This is very fresh; please let us know what you would like to see here and what needs better explanation.
Organisation¶
On this page, we describe how the files are distributed in the directory hierarchy.
Directory structure¶
[The pictures are a little outdated, but you get the idea]
The left node of the following graph shows the contents of the project root directory after executing pytask
:
Files and directories in brownish colours are constructed by pytask; those with a bluish background are added directly by the researcher. You immediately see the separation of inputs and outputs (one of our guiding principles) at work:
All source code is in the src directory.
All outputs are constructed in the bld directory.
The paper and presentation are put there so they can be opened easily.
The contents of both the root/bld and the root/src directories directly follow the steps of the analysis from the workflow section.
The idea is that everything that needs to be run during the, say, analysis step, is specified in root/src/analysis and all its output is placed in root/bld/analysis.
Some differences:
Because they are accessed frequently, figures and tables get extra directories in root/bld
The directory root/src contains many more subdirectories:
original_data is the place to store the data in its raw form, as downloaded / transcribed / … The original data should never be modified and saved under the same name.
model_code contains source files that might differ by model and that are potentially used at various steps of the analysis.
model_specs contains JSON files with model specifications. The choice of JSON is motivated by the attempt to be language-agnostic: JSON is quite expressive and there are parsers for nearly all languages (for Stata there is a converter in root/src/model_specs/task_models.py file of the Stata version of the template)
library provides code that may be used by different steps of the analysis. Little code snippets for input / output or stuff that is not directly related to the model would go here. The distinction from the model_code directory is a bit arbitrary, but I have found it useful in the past.
As an example of how things look further down in the hierarchy, consider the analysis step:
The same function (task_schelling) is run twice for the models baseline and max_moves_2. All specification of files is done in pytask.
It is imperative that you do all the task handling inside the task_xxx.py-scripts, using the pathlib library. This ensures that your project can be used on different machines and it minimises the potential for cross-platform errors.
For running Python source code from pytask, simply include depends_on and produces as inputs to your function.
For running scripts in other languages, pass all required files (inputs, log files, outputs) as arguments to the @pytask.mark.[x]-decorator. You can then read them in. Check the other templates for examples.
R / Stata Example¶
Note that this instruction is written for the R example. The Stata example works analogously.
Design rationale¶
The design of the project templates is guided by the following main thoughts:
Separation of logical chunks: A minimal requirement for a project to scale.
Only execute required tasks, automatically: Again required for scalability. It means that the machine needs to know what is meant by a “required task”.
Re-use of code and data instead of copying and pasting: Else you will forget the copy & paste step at some point down the road. At best, this leads to errors; at worst, to misinterpreting the results.
Be as language-agnostic as possible: Make it easy to use the best tool for a particular task and to mix tools in a project.
Separation of inputs and outputs: Required to find your way around in a complex project.
I will not touch upon the last point until the Organisation section below. The remainder of this page introduces an example and a general concept of how to think about the first four points.
Running example¶
To fix ideas, let’s look at the example of Albouy’s [Alb12] replication study of Acemoglu, Johnson, and Robinson’s (AJR) [AJR01] classic 2001 paper. In his replication, Albouy [Alb12] raises two main issues: lack of appropriate clustering and measurement error in the instrument (settler’s mortality) that is correlated with expropriation risk and GDP. To keep it simple, the example only replicates figure 1 and part of table 2 and table 3 of Albouy [Alb12].
Figure 1 is supposed to visualize the relationship between expropriation risk and settler’s mortality. In table 2, the first stage results are replicated (the effect of settler’s mortality on expropriation risk). This is estimated using the original mortality rates of AJR (Panel A) and one alternative proposed by Albouy, namely using the conjectured mortality data (Panel B). For each panel, several specifications are supposed to be estimated using varying geographic controls. Table 3 contains the second stage estimates for Panel A and Panel B. For that, different standard error adjustments, as proposed by Albouy, are estimated additionally.
This replication exercise requires three main steps.
Combine Albouy’s (2012) and AJR’s (2005) data (Data Management)
Estimating the first and the second stage for each Panel and creating the figure. (Analysis, Final)
In this instruction, we will focus on the replication of the tables. Creating the figure is straightforward. For each Panel, one has to follow four steps:
Compute the first stage estimates considering different geographic controls. (Analysis)
Compute the second stage estimates considering different geographic controls and different standard error specifications (Analysis)
Create nice tables for the results of 1 and 2 (Final)
Including the figure and the tables in a final LaTeX document and writing some text. (Paper)
It is very useful to explicitly distinguish between steps 1./2. and 3. because computation time in 1. and 2. (the actual estimation) can become an issue: If you just want to change the layout of a table or the color of a line in a graph, you do not want to wait for days. Not even for 3 minutes or 30 seconds as in this example.
How to organise the workflow?¶
A naïve way to ensure reproducibility is to have a master-script (do-file, m-file, …) that runs each file one after the other. One way to implement that for the above setup would be to have code for each step of the analysis and a loop over the different subsamples within each step:
You will still need to manually keep track of whether you need to run a particular step after making changes, though. Or you run everything at once, all the time. Alternatively, you may have code that runs one step after the other for each mortality series/specification:
The equivalent comment applies here: Either keep track of which model needs to be run after making changes manually, or run everything at once.
Ideally though, you want to be even more fine-grained than this and only run individual elements. This is particularly true when your entire computations take some time. In this case, running all steps every time via the master-script simply is not an option. All my research projects ended up running for a long time, no matter how simple they were… The figure shows you that even in this simple example, there are now quite a few parts to remember:
This figure assumes that your data management is being done for all models at once, which is usually a good choice for me. Even with only two models, we need to remember 6 ways to start different programs and how the different tasks depend on each other. This does not scale to serious projects!
Directed Acyclic Graphs (DAGs)¶
The way to specify dependencies between data, code and tasks to perform for a computer is a directed acyclic graph. A graph is simply a set of nodes (files, in our case) and edges that connect pairs of nodes (tasks to perform). Directed means that the order of how we connect a pair of nodes matters, we thus add arrows to all edges. Acyclic means that there are no directed cycles: When you traverse a graph in the direction of the arrows, there may not be a way to end up at the same node again.
This is the dependency graph for a simplified version of the Albouy’s replication study [Alb12] as implemented in the R example of the project template:
To keep the dependency graph simple, we ignore the figure for now. baseline.json contains the sample specification for panel A and rmconj.json for panel B.
The arrows of the graph have different colors in order to distinguish the steps of the analysis, from left to right:
Blue for data management (=combining the data sets in this case)
Orange for the main estimation
Teal for the visualisation of results
Red for compiling the pdf of the paper
Bluish nodes are pure source files – they do not depend on any other file and hence none of the edges originates from any of them. In contrast, brownish nodes are targets, they are generated by the code. Some may serve as intermediate targets only – e.g. there is not much you would want to do with the ajrcomment.dta except for processing it further.
In a first run, all targets have to be generated, of course. In later runs, a target only needs to be re-generated if one of its direct dependencies changes. E.g. when we make changes to baseline.json, we will need to rerun first_stage_estimation.r and second_stage_estimation.r using this subsample/specification. Then we will need to rerun table_first_stage_est.r and table_second_stage_est.r to renew table_first_stage_est.tex and table_first_stage_est.tex. Lastly, we need to re-compile the pdf as well. We will dissect this example in more detail in the next section. The only important thing at this point is to understand the general idea.
Of course this is overkill for a textbook example – we could easily keep the code closer together than this. But such a strategy does not scale to serious papers with many different specifications. As a case in point, consider the DAG for an early version of [vG15]:
Do you want to keep those dependencies in your head? Or would it be useful to specify them once and for all in order to have more time for thinking about research? The next section shows you how to do that.
Introduction to pytask¶
pytask is our tool of choice to automate the dependency tracking via a DAG (directed acyclic graph) structure. It has been written by Uni Bonn alumnus Tobias Raabe out of frustration with other tools.
pytask is inspired by pytest and leverages the same plugin system. If you are familiar with pytest, getting started with pytask should be a very smooth process.
pytask will look for Python scripts named task_[specifier].py in all subdirectories of your project. Within those scripts, it will execute functions that start with task_.
Have a look at its excellent documentation. At present, there are additional plugins to run R scripts, Stata do-files, and to compile documents via LaTeX.
We will have more to say about the directory structure in the pyorganisation section. For now, we note that a step towards achieving the goal of clearly separating inputs and outputs is that we specify a separate build directory. All output files go there (including intermediate output), it is never kept under version control, and it can be safely removed – everything in it will be reconstructed automatically the next time you run pytask.
Pytask Overview¶
From a high-level perspective, pytask works in the following way:
pytask reads your instructions and sets the build order.
Think of a dependency graph here.
It stops when it detects a circular dependency or ambiguous ways to build a target.
Both are major advantages over a master-script, let alone doing the dependency tracking in your mind.
pytask decides which tasks need to be executed and performs the required actions.
Minimal rebuilds are a huge speed gain compared to a master-script.
These gains are large enough to make projects break or succeed.
We have just touched upon the tip of the iceberg here; pytask has many more goodies to offer. Its documentation is an excellent source.
Wanted: Feedback¶
This is very fresh; please let us know what you would like to see here and what needs better explanation.
Organisation¶
On this page, we describe how the files are distributed in the directory hierarchy.
Directory structure¶
[The pictures are a little outdated, but you get the idea]
The left node of the following graph shows the contents of the project root directory after executing pytask
:
Files and directories in brownish colours are constructed by pytask; those with a bluish background are added directly by the researcher. You immediately see the separation of inputs and outputs (one of our guiding principles) at work:
All source code is in the src directory.
All outputs are constructed in the bld directory.
The paper and presentation are put there so they can be opened easily.
The contents of both the root/bld and the root/src directories directly follow the steps of the analysis from the workflow section.
The idea is that everything that needs to be run during the, say, analysis step, is specified in root/src/analysis and all its output is placed in root/bld/analysis.
Some differences:
Because they are accessed frequently, figures and tables get extra directories in root/bld
The directory root/src contains many more subdirectories:
original_data is the place to store the data in its raw form, as downloaded / transcribed / … The original data should never be modified and saved under the same name.
model_code contains source files that might differ by model and that are potentially used at various steps of the analysis.
model_specs contains JSON files with model specifications. The choice of JSON is motivated by the attempt to be language-agnostic: JSON is quite expressive and there are parsers for nearly all languages (for Stata there is a converter in root/src/model_specs/task_models.py file of the Stata version of the template)
library provides code that may be used by different steps of the analysis. Little code snippets for input / output or stuff that is not directly related to the model would go here. The distinction from the model_code directory is a bit arbitrary, but I have found it useful in the past.
As an example of how things look further down in the hierarchy, consider the analysis step:
The same function (task_estimate) is run twice for the models baseline and rmconj. All specification of files is done in pytask.
It is imperative that you do all the task handling inside the task_xxx.py-scripts, using the pathlib library. This ensures that your project can be used on different machines and it minimises the potential for cross-platform errors.
For running scripts in languages other than Python, pass all required files (inputs, log files, outputs) as arguments to the @pytask.mark.[x]-decorator. You can then read them in. Check this R template for examples.
For running Python source code from pytask, simply include depends_on and produces as inputs to your function.
Project-specific Program Environments¶
Progams change. Nothing is as frustrating as coming back to a project after a long time and spending the first {hours, days} updating your code to work with a new version of your favourite data analysis library. The same holds for debugging errors that occur only because your coauthor uses a slightly different setup.
The solution is to have isolated environments on a per-project basis. Conda environments allow you to do precisely this. This page describes them a little bit and explains their use.
The following commands can either be executed in a terminal or the Anaconda prompt (Windows).
Using the environment¶
In the installation process of the template a new environment was created if it was not explicitly declined. It took its specification from the environment.yml file in your projects root folder.
To activate it, execute:
$ conda activate <env_name>
Repeat this step every time you want to run your project from a new terminal window.
Updating packages¶
Make sure you activated the environment by conda activate <env_name>
. Then use conda or pip directly:
conda update [package]
or pip install -U [package]
For updating conda all packages, replace [package]
by --all
.
Installing additional packages¶
To list installed packages, type
$ conda list
If you want to add a package to your environment, run
$ conda install [package]
or
$ pip install [package]
Choosing between conda and pip
Generally it is recommended to use conda whenever possible (necessary for most scientific packages, they are usually not pure-Python code and that is all that pip can handle, roughtly speaking). For pure-Python packages, we sometimes fall back on pip.
Saving your environment¶
After updating or changing your environment you should save the status in the environment.yml file to avoid version conflicts and maintain coherent environments in a project with multiple collaborators. Just make sure your environment is activated and run the following in the project’s root directory:
$ conda env export -f environment.yml
After exporting, manually delete the last line in the environment file, as it is system specific.
Setting up a new environment¶
If you want to create a clean environment, execute:
$ conda create --name myenv
For setting up an environment from a specification file (like environment.yml), type:
$ conda create --name <myenv> -f <filename>
Information about your conda environments¶
For listing your installed conda environments, type
$ conda info --envs
The currently activated one will be marked.
Pre-Commit Hooks¶
Pre-commit hooks are checks and syntax formatters that run upon every commit. If one of the hooks fails, the commit is aborted and you have to commit again after you resolved the issues raised by the hooks. Pre-commit hooks are defined in the .pre-commit-config.yaml. If you opt for the basic pre-commits, the following checks will be installed into your project:
reorder-python-imports: Reorders your python imports according to PEP8 guidelines.
check-yaml: Checks whether all .yaml and .yml files wihtin your project are valid yaml files.
check-added-large-files: Checks that all committed files do not exceed 100MB in size. This is the maximal file size allowed by Github.
check-byte-order-marker: Fails if file has a UTF-8 byte-order marker.
check-json: Checks whether all files that end with .json are indeed valid json files.
pyupgrade: Converts Python code to make use of newer syntax.
pretty-format-json: Reformats your json files to be more readable.
trailing-whitespace: Removes trailing whitespaces in all your text files.
black: Runs the python code formatter black on all your comitted python files.
blacken-docs: Formats python code (according to black’s formatting style) that occurs within documentation files.
If you additionally opt for intrusive pre-commit hooks, then python syntax linter flake8 will be installed as pre-commit hook as well. It is important to note that flake8 is quite strict regarding PEP8 Style Guide adherence and -as opposed to black- it only raises issues but does not automatically resolve them. You have to fix the issues yourself.
Note
If you want to skip the pre-commit hooks for a particular commit, you can run:
$ git commit -am <your commit message> --no-verify
For more advanced usages of pre-commit please consult its website.
Frequently Answered Questions / Troubleshooting¶
Tips and Tricks for Windows Users¶
Anaconda Installation Notes for Windows Users¶
Please follow these steps unless you know what you are doing.
Download the Graphical Installer for Python 3.x.
Start the installer and click yourself through the menu. If you have administrator privileges on your computer, it is preferable to install Anaconda for all users. Otherwise, you may run into problems when running python from your powershell.
Make sure to (only) tick the following box:
‘’Register Anaconda as my default Python 3.x’’. Finish installation.
Navigate to the folder containing your Anaconda distribution. This folder contains multiple subfolders. Please add the path to the folder called condabin to your PATH environmental variable. This path should end in Anaconda3/condabin. You can add paths to your PATH by following these instructions.
Please start Windows Powershell in administrator mode, and execute the following:
$ set-executionpolicy remotesigned
Now (re-)open Windows Powershell and initialize it for full conda use by running
$ conda init
Warning
If you still run into problems when running conda and python from powershell, it is advisable to use the built-in Anaconda Prompt instead.
Adding directories to the PATH: MacOS and Linux¶
Open the program Terminal. You will need to add a line to the file .bash_profile
and potentially create the file. This file lives in your home directory, in the Finder it is hidden from your view by default.
Linux users: For most distributions, everything here applies to the file .bashrc
instead of .bash_profile
.
I will now provide a step-by-step guide of how to create / adjust this file using the editor called code
. If you are familiar with editing text files, just use your editor of choice.
Open a Terminal and type
code ~/.bash_profile If you use an editor other than `VS Code <https://code.visualstudio.com/>`_, replace ``code`` by the respective editor. If ``.bash_profile`` already existed, you will see some text at this point. If so, use the arrow keys to scroll all the way to the bottom of the file.
Add the following line at the end of the file
export PATH="${PATH}:/path/to/program/inside/package" You will need to follow the same steps as before. Example for Stata:: # Stata directory export PATH="${PATH}:/Applications/Stata/StataMP.app/Contents/MacOS/" In ``/Applications/Stata/StataMP.app``, you may need to replace bits and pieces as appropriate for your installation (e.g. you might not have StataMP but StataSE). Similarly for Matlab or the likes.
Press
Return
and thenctrl+o
(= WriteOut = save) andReturn
once more.
When cookiecutter exits with an error¶
If cookiecutter breaks off, you will get a lengthy error message. It is important that you work through this and try to understand the error (the language used might seem funny, but it is precise…).
Then type:
$ code ~/.cookiecutter_replay/econ-project-templates-0.4.5.json
If you are not using VS Code as your editor of choice, adjust the line accordingly.
This command should open your editor and show you a json file containing your answers to the previously filled out dialogue. You can fix your faulty settings in this file. If you have spaces or special characters in your path, you need to adjust your path.
When done, launch a new shell if necessary and type:
$ cookiecutter --replay https://github.com/OpenSourceEconomics/econ-project-templates/archive/v0.4.5.zip
Starting stats/maths programmes from the shell¶
pytask needs to be able to start your favourite (data) analysis programme from the command line, it might be worthwile trying that out yourself, too. These are the programme names that pytask looks for:
R:
RScript
,Rscript
Stata
Windows:
StataMP-64
,StataMP-ia
,StataMP
,StataSE-64
,StataSE-ia
,StataSE
,Stata-64
,Stata-ia
,Stata
,WMPSTATA
,WSESTATA
,WSTATA
MacOS:
Stata64MP
,StataMP
,Stata64SE
,StataSE
,Stata64
,Stata
Linux:
stata-mp
,stata-se
,stata
Matlab:
matlab
Remember that Mac/Linux are case-sensitive and Windows is not. If you get errors that the programme is not found for all of the possibilities on your platform, the most likely cause is that your path is not set correctly yet. You may check that by typing echo $env:path
(Windows) or echo $PATH
(Mac/Linux). If the path to the programme you need is not included, you can adjust it as detailed above (Windows, Mac/Linux).
If the name of your programme is not listed among the possibilities above, please open an issue on Github
Prerequisites if you decide not to have a conda environment¶
This section lists additional dependencies that are installed via the conda environment.
General:¶
$ conda install pandas python-graphviz=0.8
$ pip install maplotlib click==7.0
For sphinx users:¶
$ pip install sphinx nbsphinx sphinx-autobuild sphinx-rtd-theme sphinxcontrib-bibtex
For Matlab and sphinx users:¶
$ pip install sphinxcontrib-matlabdomain
For pre-commit users:¶
$ pip install pre-commit
For R users:¶
R packages can, in general, also be managed via conda environments. The environment of the template contains the following R-packages necessary to run the R example of this template:
AER
aod
car
foreign
lmtest
rjson
sandwich
xtable
zoo
Quick ‘n’ dirty command in an R shell:
install.packages(
c(
"foreign",
"AER",
"aod",
"car",
"lmtest",
"rjson",
"sandwich",
"xtable",
"zoo"
)
)
Stata failure: FileNotFoundError¶
The following failure:
.. code:: pytb
FileNotFoundError: No such file or directory: ‘/Users/xxx/econ/econ-project templates/bld/add_variables.log’
has a simple solution: Get rid of all spaces in the path to the project. (i.e., econ-project-templates
instead of econ-project templates
in this case). To do so, do not rename your user directory, that will cause havoc. Rather move the project folder to a different location.
I have not been able to get Stata working with spaces in the path in batch mode, so this has nothing to do with Python/Pytask. If anybody finds a solution, please let me know.
Stata failure: missing file¶
If you see an error like this one:
-> missing file: '/Users/xxx/econ/econ-project/templates/bld/add_variables.log'
check that you have a license for the Stata version that is found (the Stata tool just checks availability top-down, i.e., MP-SE-IC, in case an MP-Version is found and you just have a license for SE, Stata will silently refuse to start up).
The solution is to remove all versions of Stata from its executable directory (e.g., /usr/local/stata) that cost more than your license did.
Feedback welcome¶
I have had a lot of feedback from former students who found this helpful. But in-class exposure to material is always different than reading up on it and I am sure that there are difficult-to-understand parts. I would love to hear about them! Please drop me a line or, if you have concrete suggestions, file an issue on GitHub.
References¶
- AJR01
Daron Acemoglu, Simon Johnson, and James A. Robinson. The Colonial Origins of Comparative Development: An Empirical Investigation. American Economic Review, 91(5):1369–1401, December 2001.
- Alb12
David Y. Albouy. The Colonial Origins of Comparative Development: An Investigation of the Settler Mortality Data. American Economic Review, 102(6):3059–3076, 2012.
- Hox00
Caroline M. Hoxby. Does Competition among Public Schools Benefit Students and Taxpayers? American Economic Review, 90(5):1209–1238, December 2000.
- Hox07
Caroline M. Hoxby. Does Competition among Public Schools Benefit Students and Taxpayers? Reply. American Economic Review, 97(5):2038–2055, December 2007.
- Lev97
Steven D. Levitt. Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime. American Economic Review, 87(3):270–290, June 1997.
- Lev02
Steven D. Levitt. Using Electoral Cycles in Police Hiring to Estimate the Effects of Police on Crime: Reply. American Economic Review, 92(4):1244–1250, September 2002.
- McC02
Justin McCrary. Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on Crime: Comment. American Economic Review, 92(4):1236–1243, September 2002.
- McC09
B. D. McCullough. Open Access Economics Journals and the Market for Reproducible Economic Research. Economic Analysis & Policy, 39(1):117–126, March 2009.
- MV03
B. D. McCullough and Hrishikesh D. Vinod. Verifying the Solution from a Nonlinear Solver: A Case Study. American Economic Review, 93(3):873–892, June 2003.
- Rot07a
Jesse Rothstein. Does Competition among Public Schools Benefit Students and Taxpayers? Comment. American Economic Review, 97(5):2026–2037, December 2007.
- Rot07b
Jesse Rothstein. Rejoinder to Hoxby. Available at http://gsppi.berkeley.edu/faculty/jrothstein/hoxby/rejoinder.pdf, November 2007.
- Sch69
Thomas C. Schelling. Models of Segregation. American Economic Review, 59(2):488–493, 1969.
- SS19
John Stachurski and Thomas J. Sargent. Quantitative Economics. http://quant-econ.net/index.html, 2019.
- vG15
Hans-Martin von Gaudecker. How Does Household Portfolio Diversification Vary with Financial Sophistication and Financial Advice? Journal of Finance, 70(2):489–507, April 2015.
- WallSJournal05
Wall Street Journal. Novel Way to Assess School Competition Stirs Academic Row. Available at http://gsppi.berkeley.edu/faculty/jrothstein/hoxby/wsj.pdf, October 2005.
Release Notes¶
v0.4 – January 2021¶
Move from Waf to Pytask (#86, @tobiasraabe, @hmgaudecker)
Move to GitHub Actions for CI (@janosg, WIP)
v0.3 – October 2019¶
Much improved documentation (@raholler)
Extensive instructions for use on Windows (@raholler)
Re-use previously-entered data when cookiecutter fails (@tobiasraabe, @raholler)
Fix Stata template by setting –shell-escape=1 (#63, @raholler)
Add pyupgrade to pre-commit hooks (#59)
Thanks to students at LMU for pointing lots of this out!
v0.2 – September 2019¶
Full continuous integration testing on the Azure platform
R example completely working in Miniconda environment out of the box (@raholler)
Documentation for Stata / R examples (@raholler)
Much improved instructions for usage on Windows (@raholler)
Improved structure of docs
v0.1 – October 2018¶
First version with cookiecutter (thanks, @tobiasraabe and @julienschat)
All the stuff that accumulated over the years with the help of many. I wish my memory was better so I would be able to list the contributions separately. Thanks, @PKEuS, @philippmuller, @julienschat, @janosg, @tdrerup and many more who provided feedback!
For Developers¶
This part is only for developers of the project template.
Pre-Release Tasks/Checks¶
Attach version numbers to the packages in environment.yml.
Update all pre-commit hooks to their newest version.
Check whether template works with the most current conda version on Windows by
3.1 Running the tests after updating conda.
3.2 Separately creating an example project and activating the environment.
All other OS are tested via Azure CI.
Check that the documentation is correctly built by navigating to the docs folder and executing waf and sphinx-build html.
Releasing the template¶
Checkout the branch / commit with the template version to be released and create a tag with a version and a Description:
$ git tag -a [version] -m "Description"
Push the tag to your remote git repository
$ git push origin [version]
The release will be available here
Check that the documentation is correctly build by readthedocs.
How to compile the documentation on Windows¶
Install Imagemagick. Upon installing check the box “Install legacy components (convert.exe etc)”
Add Imagemagick to PATH.
Go to the folder which contains Imagemagick and rename the convert executable to imgconvert.
Now you can compile the documentation by navigating in the docs folder and running waf.