index > Posts > The X commandments for data analysis

The X commandments for data analysis

thou shalt understand your project

"There is nothing more frightful than ignorance in action." Goethe

read the project proposal, analysis plans etc.
review UpToDate to learn about the condition/disease of interest.
scan recent systematic reviews/meta-analyses about the association of interest: where is the field at? what results are expected?
review few seminal papers and note how studies were designed, what analyses and sensitivity analyses were done, what (and how) data was presented, what biases were identified.
develop a causal diagram (DAG) to reflect your understanding of the topic. Use http://www.dagitty.net/.
verify with the principal investigator (PI) and team your understanding of the plan.
develop a statistical analysis plan (SAP) and get it reviewed by the PI and team. Document all decisions.
find out if a similar project was implemented before. Borrow ideas and code liberally.
read about any statistical techniques or epidemiologic designs that you are not familiar with. Do not wing it.

understand how the data was generated: administrative, electronic health records, chart review, prospectively collected etc.
understand the limitations of the sources and processes of data gathering. What populations were not included? What variables were not measured accurately? What periods were not covered?
read the data dictionary: understand what each dataset contains; what identifiers are used; what are the primary and foreign keys; draw an entity-relationship diagram; know what each variable means; what codes,classification systems or lookup tables were used.
check the quality of your data for missing values, gaps in coverage, etc (also see next commandment).

Do not violate the law of Preservation of Cohort Identify: No one joins or leaves a cohort without first meeting All entry and exit criteria. Once formed, a cohort size (N) and composition must never change.
check (ideally using asserts in your code) that you have the correct N after each operation that might (often inadvertently) change your N, e.g., importing data, refreshing the data, merging datasets (including SQL joins), dropping records or even recording important variables (coding an important variable as missing in an observation is equivalent to dropping that observation).
know what observations are excluded and for what reason. Draw an attrition diagram.
be careful when merging datasets. Always check what records were merged successfully and what records remained unmerged. Document these numbers in your code. for this reason..
avoid inner joins in SQL (and similar merge techniques). Use full joins and then check for unmerged records.
always check for duplicate records (duplicate primary keys, duplicate records, duplicate essential fields)

as much as possible, use the built-in commands, functions, macros etc build-it into your software package (know your software; do not recode an existing functionality).
use VDEC macros when available. Never duplicate an already written function. Ask before you create your own.
use well-tested third-party macros/ados/add-ons. Search using Stata’s findit command and Google; ask in discussion lists, etc.
do not write a macro without checking with the team and without clear specifications.
in your project, organize your code into clearly named subroutines (macros, programs, etc.) to avoid repetition, improve quality, facilitate testing and to make code self-documenting
Whenever possible, automatically generate repetitive commands using e.g., Excel’s concatenate or template-processors.
use lookup files (in a standard format, e.g., CSV) in favour of hard coding classifications and categories.
be consistent; use the same variable names and labels in all your project.

do not over-engineer: do not write code for all possible scenarios and do not try to anticipate the needs of future projects.
use the simplest implementation possible. Avoid clever code.
name your routines, datasets and variables using short descriptive names. See footnote 1 below.
do not spend time optimizing the implementation or beautifying the code until you prove through testing that the code is correct and through benchmarking and profiling that the existing code is not performant enough and is in need of optimization.
do not hard-code magic numbers.

build a test rig as you go. Do not wait until the end to start testing.
use both black-box and clear-box testing
use Golden Files to ensure that routines continue to produce the same results.
generate thorough descriptive tables; assess whether your results are consistent with what is known about the topic.
build tests into your code to simplify running the testing code whenever changes are made.
do not do any significant refactoring without first ensuring an good test coverage.

learn how to use a version control system (e.g., Git); commit changes frequently and push changes to your remote repo at least once if every one or two days.
create branches when planning major changes (eg refactoring, implementing a new algorithm) to simplify comparison and reverting (if necessary) to older working code.
use VDEC’s standard directory structure to store your data, code and docs.
use standard and consistent filenames.
keep your code files reasonably short.

comment your code early, revise as you make changes and avoid stale comments and documentations.
comments should explain the intent rather than the mechanics of the code (unless your are documenting tricky algorithm or unusual feature). See footnote 2.
use documenting tools to automate collation of your comments but do not go overboard with structure and conventions. A simple and clear comment written in grammatically correct English is superior to a page of fragmented “machine-readable” comments.
learn markdown

never copy and paste tables and figures.
use standard output formats eg markdown, svg.
never throw earlier results; always compare with new results and understand the reason for any changes.
keep a changelog file to explain why results have changed.
never change the source data directly manually or otherwise, derive a new dataset using scripts
use confidence intervals in preference to p values.

Variable naming: While x and y are not ideal, i for a loop variable is preferable to index and subjectID is preferable to subject_identifier and definitely no average_number_of_statin_prescription_per_person! In data analysis, the standard should not be that variables names stand on their own since the context can never be ignored when analyzing data and because we have other mechanisms to document variables (e.g., labels for dataset variables). Instead use short names (e.g., avgStatinRxs) and when possible (e.g., dataset variables) add short, accurate and grammatically correct labels and value labels. Long dataset variable names discourse interactive exploration of the data and make it harder to understand models (when they are spread over multiple lines). Global constants (often implemented as variables in statistical software) can still have longer names. But it is best to replace them (whenever possible, e.g., using classes in Stata) and/or pass them as parameters to well-defined routines. Generally, when you find that you are using long variable names (e.g., for name-spacing and to prevent conflict), the solution is usually to embed these variables in small routines where they will have well-defined context and usually a local scope.
A comment like ``