Vision: use configuration files to organize data analysis projects

Proposed change:

Use configuration files (using a standard format such as TOML, MDSon etc) to direct (parameterize) the execution of the macros and other scripts currently used for cleaning and analyzing data. See example below.

Objectives

  • Encourage re-use of existing code base.
  • Simplify interactions with study leads.
  • Simplify documenting the main (high-level) features of a project implementation.
  • Eventually increase productivity by abstracting low-level details and reducing boiler-plate code.

Rationale:

  • Reusing macros and scripts is difficult because changes need to be made in several places requiring intimate knowledge of the code base which is difficult for new analyst and students.
  • Breakdown in communications between study lead, analysts and other team members occurs often and the resulting discrepancies are hard to detect. Having all salient features (e.g. exclusion criteria, variable groupings etc.) of the study in an easy-to-read format ensures agreement and simplifies documentation and handover.
  • Most projects are similar in essence and involve a lot of boiler-plate code and documentation.
  • Using macro parameters has several limitations as parameter lists can be become unwieldly and error-prone, encourages duplication (e.g. path to same dataset is specified in several places) and therefore can result in bugs and finally differs between SAS and Stata (and potentially other software tools).
  • Use of configuration files simplifies tooling e.g. use of code generators, code formatters, automatic project archiving, etc.;
  • configuration files can be created and edited using a GUI reducing errors.

Challenges

  • How to integrate project-specific code. One solution is to provide a hook to call execute additional scripts (e.g. Import “path/to…” After “Step 2”).
  • How to ensure that analysts do not overlook important errors and exceptions. One solution is to require test/assert scripts.
  • How to achieve the right balance between code reuse and simplicity of the framework. Overly complicated configuration files may hamper use whereas too simplistic framework may not add much.

##Example configuration file

[data files]
subjects = "path/to/file/containing/subject info"
hospital= "path/to/file/containing/hospital admissions"
....

[Variable names]
ID  = scrphin
Entrydate = doe
Exitdate =dox
Indexdate = dodx
Gender= sex
Birthdate=dob
....


[Design]
Studydesign = Cohort


[Step 1 subject identification]
[Exlusions]
Ex1 = Gender == 1  //male
Ex2 = Age < 1  
....
Studystartdate = 01Jan1999
Studyenddate = 31Dec2010 

[Step2 Merging]
Merge subject with xx
Matching=Import "path/to/custom or standard script"

[Step xx Generate Derived variables]
Chronicdisease=Import "path/to/ChronciDiseaseIdentificationCongfigfile"
Outcomes=[Asthma, IBD...]
DrugIdentification = Import "path/to/DrugIdentificationCongfigfile"
CustomCode1 = Import "path/to/custom scripts" After "Step xx"
DrugClasses=[Statins, ..]
Exposure=EverStatins
Demographics=[AgeGroup, Gender, Region, Urban, Income5]
AgeGroup=[1-16,17-25,...]
....

[Step x Tables]
Tab1 = EverStatins x [Demographics, Chronicdisease,...]
....