Avoid Stata include

The problematic code

For example, you may have in your main.do file the following:

//constants
local PROJECT_CODE "myProject"
local DATA_DIR "P:/VDEC/statatemp/`PROJECT_CODE'"
local LIB_DIR "P:/VDEC/LIB"
local STUDY_START_YEAR = 2001
local STUDY_END_YEAR = 2015
...

//load dependencies
include "`LIB_DIR'/lib1.do"
include "`LIB_DIR'/lib2.do"

//call my subroutines
include "extractData.do"
include "analyzeData.do"
...

Then, in extractData.do, you may have something like:

local var1 "diab"
local var2 "asthma"
...

use `DATA_DIR'/data1
drop if year(indexDate) <  `STUDY_START_YEAR'
...

The problems

In a nutshell, Stata concatenates all your code, including library routines and your project-specific subroutines, into one single gigantic file. As a result: - all your “local” macros/variables (e.g., `var1’ in extractData.do) are now visible to all the code you ‘included’ after defining that macro which could result in subtle and hard-to-find bugs, - the behaviour of the code in your project-specific subroutine files depends on the order of their inclusion in main.do because they rely either intentionally or unintentionally on code defined in other subroutine files, - there is no true encapsulation or isolation of concerns, and - there is limited potential for reusing the code.

The solution

  • Instead of including files, do/run them (same thing except that run does not echo its output): stata //include "`LIB_DIR'/lib1.do" do "`LIB_DIR'/lib1.do"
  • Convert code in subroutine files into programs. This is as trivial as adding 3 lines of code. Pass the parameters required by the subroutines as arguments.
\\ in extractData.do
capture program drop extractData
program define extractData
    syntax, dataFileName(string) studyStartYear(int) ...etc
    local var1 "diab"
    local var2 "asthma"
    ...

    use `dataFileName'
    drop if year(indexDate) <  `studyStartYear'
    ...
end

Now at the top of main.do

do extractData.do

When you need to call extractData

extractData, dataFileName("`DATA_DIR'/data1") studyStartYear(`STUDY_START_YEAR')

Advantages

  • More readable self-documenting code (through the use of expressive program and argument names)
  • The local macros in extractData.do are now truly local to that program.
  • Easier to test because of the better defined interface.
  • The command ‘syntax’ checks the types of the arguments for you, and provides for default values (but handle with care).
  • Additional validation of arguments and other preconditions can be performed within program extractData, ideally with errors returned as ‘r(error)’ for the caller to check.
  • Because programs are loaded once, this is generally more efficient than including source files.

Tips

  • Avoid using local macros except within ‘define program’ blocks. Constants in main.do are better declared as global to convey the intention that these are global variables. Consider grouping these constants in a Stata class.
  • All subroutines must return an error status that can be checked by the caller. The caller must check the error status of called subroutines and handle any errors appropriately (typically by propagating the error up the chain of callers or aborting with a clear message).
  • Do not over-engineer. Do not attempt to convert a subroutine into a reusable (VDEC-wide) library routine until you are sure of its utility for other projects and have consulted with Christiaan. If your project code is nicely structured into well-defined subroutines (using the above approach), it should be easy later on to extract and generalize any useful routines. Typically a piece of code should not be a VDEC routine if only one project uses it, even if it is foreseeable to be used by future projects.
  • The syntax command is very powerful and allows for validating the existence and types of variables passed as arguments, automatic handling of if statements, correct (os-independent) handling of filenames and much more. It also makes your programs easier to use because they follow the same rules as built-in Stata commands. For passing 1 or 2 parameters, you could also use the simpler command args (although you will likely regret that once you need to edit the program).
24/08/2018