Methods for Considering social risk alongside genetic risk for bipolar disorder in the All of Us Research Program
The pipeline runs in four stages:
| Stage | Directory | Where it runs | What it produces |
|---|---|---|---|
| 01 | 01_LocalPRSComputation/ |
Your HPC cluster (SLURM) | PRS-CS-auto posterior SNP weights |
| 02 | 02_PRSonAoU/ |
AoU Workbench (Hail / Dataproc / PLINK) | Per-person BD PRS + covariate table |
| 03 | 03_socialSurveys/ |
AoU Workbench (R) | Six scored SDoH surveys |
| 04 | 04_mainPRS_Pipeline/ |
AoU Workbench (R) | Cohorts, PRS performance, SDoH associations, PRS×SDoH interaction |
Stages 01–02 compute the PRS (method-agnostic from Step 2 on). Stages 03–04 perform the statistical analyses presented in the manuscript.
PRS-CS-auto runs on your HPC cluster (public GWAS + reference panels, no AoU data). The rest runs in the AoU Workbench: liftover to hg38, harmonize against the AoU WGS MatrixTable, score with PLINK, and build covariates.
| Step | Script | Where to run |
|---|---|---|
| 1 | runPRScs_array.sh |
HPC cluster (SLURM) |
| 2 | 001a_LiftoverWeightsFiles.py |
AoU Workbench (Hail) |
| 3 | 001b_PRS_byChr.py |
AoU Workbench (Dataproc) |
| 4 | 001c_runPlink.py |
AoU Workbench (standard) |
| — | getCovariates.py |
AoU Workbench (standard) |
Runs PRS-CS-auto per chromosome on the BD GWAS summary statistics files, producing posterior SNP weights. Concatenate into one genome-wide weights file.
PRS-CSx is a multi-ancestry alternative to PRS-CS-auto. You can run it using the same starting script as is provided for PRS-CS-auto
and multi-population inputs. More details provided in runPRScs_array, or you can swap the call for:
python PRScsx/PRScsx.py \
--ref_dir=<1000G multi-population LD panels> \
--bim_prefix=<multi-population HapMap3, e.g. hm3_multipop> \
--sst_file=eur_sumstats.txt,eas_sumstats.txt,afr_sumstats.txt \
--n_gwas=220416,16917,8999 \ # per-population effective N: EUR, EAS, AFR
--pop=EUR,EAS,AFR \
--meta=True \ # inverse-variance-weighted meta posterior
--n_iter=10000 --n_burnin=5000 \
--out_dir=<out> --out_name=OConnellPRSCSX --chrom=${CHROM}Changes vs PRS-CS-auto: per-population summary statistics and effective Ns (the O'Connell BD GWAS provided EUR/EAS/AFR, no AMR), a multi-population HapMap3/LD reference, and --meta=True for a single inverse-variance-weighted meta posterior. Concatenate the per-chromosome meta files to get the genome-wide PRS-CSx weights.
Lifts the posterior weights GRCh37 → GRCh38 (the AoU WGS build) with Hail's liftover, then writes a clean weights CSV and a keyed Hail Table checkpoint for Step 3.
The compute-heavy step. For each chromosome it filters the AoU WGS ACAF MatrixTable to the PRS variants, harmonizes alleles, and exports PLINK-format files. We used 16 workers (4 CPU / 15 GB / 250 GB each; 8 CPU / 30 GB / 500 GB master) in a Hail environment.
Per-chromosome processing is deliberate: the AoU WGS MatrixTable has ~145,000 Hail partitions, so a naive genome-wide filter crashes the Spark driver, and re-reading the full MT for all 22 chromosomes is slow. This script reads the MT fresh for each chromosome (avoiding Spark staleness), uses filter_intervals to restrict to that chromosome's partitions, harmonizes, checkpoints, and exports. Completed chromosomes are auto-detected and skipped, so the script is resumable across one or more runs.
Set SCORE_PREFIX (a short analysis label, OConnell for this paper) in the config; the paths work if you followed Steps 1–2. Run it in a tmux session so it survives browser disconnects, with cluster auto-pause set to NEVER:
tmux new -s prs
python3 ~/001b_PRS_byChr.py 2>&1 | tee ~/001b_run.log
# Ctrl+B, then D to detach; tmux attach -t prs to re-attachScores each chromosome with plink --score, then sums across chromosomes to a genome-wide PRS per person. A standard AoU environment is fine. Set SCORE_PREFIX (same label as Step 3). For each chromosome it downloads the PLINK trio plus matched weights, runs plink --bfile ... --score <id> <effect_allele> <weight> sum (dosage-weighted additive sum, columns 3/4/5), reads the .profile output, and merges into a running frame; after chromosome 22 it sums to one score per person.
Assembles demographic, ancestry, and EHR-density covariates from AoU controlled-tier data, independent of the scoring pipeline.
Variables produced:
- Demographics: race, ethnicity, sex at birth, gender.
SexGender: three-level concordance (Cis_female,Cis_male,Non-cis).- Genetic ancestry (
ancestry_pred:eur/afr/amr/eas) and PCs, kept asPC_1…PC_16. ehr_length: days between first and last EHR event.record_depth: count of distinct condition + observation visits.age_at_last_event: age at most recent EHR record.
Inclusion: participants with ehr_length ≤ 1095 days (3 years) or record_depth ≤ 2 are removed (insufficient EHR depth).
From here on everything is R. The six social-determinants-of-health (SDoH) surveys are extracted from AoU survey responses and scored into per-participant scale scores
This is where the manuscript's results come from. Every script sources 00_config.R
| Script | What it does |
|---|---|
00_config.R |
Shared functions + SDoH strata definitions |
01_defineCohorts.R |
EHR phenotyping → BD case/control labels |
02_masterTable.R |
Merge + PC-residualize + global-standardize the PRS |
03_prsMain.R |
PRS performance (AUC, liability R², decile ORs) |
04_socialVarSetup.R |
Merge surveys, z-score with directionality, build ACE/CRMN binaries |
05_sdohAssociations.R |
Per-SDoH logistic associations + incremental AUC + LOO-LRT |
06_interactionAnalysis.R |
PRS×SDoH interaction — multiplicative (BH-FDR) + additive (RERI) |
- Swap the PRS: any per-variant weights with columns
CHR SNP BP A1 A2 BETAwork from Step 2 on. UpdateWEIGHTS_FILENAME(Step 2) andSCORE_PREFIX(Steps 3–4), and confirm the AoU WGS path points to your release. - Swap the phenotype: replace the four concept sets in
01_defineCohorts.Rby either rebuilding from our code list provided in SupplementalInformation, or substituting your own phenotype. - Swap the prevalence: change
K = 0.02in00_config.Rto your disorder's prevalence — it affects both the liability-scale R² and the absolute-risk curves.
This research was conducted using data from the All of Us Research Program. We gratefully acknowledge All of Us participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health's All of Us Research Program for making available the participant data examined in this study.
The All of Us Research Program is supported by the National Institutes of Health, Office of the Director.
The code in this repository reads All of Us Controlled Tier data inside the secure Researcher Workbench and contains no participant-level data, genomic data, or individual records. All reported and derived values respect the All of Us small-cell rule (no participant count of 1–20 is published or back-calculable); in the code this surfaces as the n ≥ 50 / n_cases ≥ 20 guards.
Please contact rsharp@unc.edu with any questions