Methods for Considering social risk alongside genetic risk for bipolar disorder in the All of Us Research Program

The pipeline runs in four stages:

Stage	Directory	Where it runs	What it produces
01	`01_LocalPRSComputation/`	Your HPC cluster (SLURM)	PRS-CS-auto posterior SNP weights
02	`02_PRSonAoU/`	AoU Workbench (Hail / Dataproc / PLINK)	Per-person BD PRS + covariate table
03	`03_socialSurveys/`	AoU Workbench (R)	Six scored SDoH surveys
04	`04_mainPRS_Pipeline/`	AoU Workbench (R)	Cohorts, PRS performance, SDoH associations, PRS×SDoH interaction

Stages 01–02 compute the PRS (method-agnostic from Step 2 on). Stages 03–04 perform the statistical analyses presented in the manuscript.

Stage 01–02 — Computing the BD PRS

PRS-CS-auto runs on your HPC cluster (public GWAS + reference panels, no AoU data). The rest runs in the AoU Workbench: liftover to hg38, harmonize against the AoU WGS MatrixTable, score with PLINK, and build covariates.

Step	Script	Where to run
1	`runPRScs_array.sh`	HPC cluster (SLURM)
2	`001a_LiftoverWeightsFiles.py`	AoU Workbench (Hail)
3	`001b_PRS_byChr.py`	AoU Workbench (Dataproc)
4	`001c_runPlink.py`	AoU Workbench (standard)
—	`getCovariates.py`	AoU Workbench (standard)

Step 1 — Estimate posterior weights (`runPRScs_array.sh`)

Runs PRS-CS-auto per chromosome on the BD GWAS summary statistics files, producing posterior SNP weights. Concatenate into one genome-wide weights file.

Multi-ancestry variant — PRS-CSx (alternative to PRS-CS-auto)

PRS-CSx is a multi-ancestry alternative to PRS-CS-auto. You can run it using the same starting script as is provided for PRS-CS-auto and multi-population inputs. More details provided in runPRScs_array, or you can swap the call for:

python PRScsx/PRScsx.py \
    --ref_dir=<1000G multi-population LD panels> \
    --bim_prefix=<multi-population HapMap3, e.g. hm3_multipop> \
    --sst_file=eur_sumstats.txt,eas_sumstats.txt,afr_sumstats.txt \
    --n_gwas=220416,16917,8999 \        # per-population effective N: EUR, EAS, AFR 
    --pop=EUR,EAS,AFR \
    --meta=True \                        # inverse-variance-weighted meta posterior
    --n_iter=10000 --n_burnin=5000 \
    --out_dir=<out> --out_name=OConnellPRSCSX --chrom=${CHROM}

Changes vs PRS-CS-auto: per-population summary statistics and effective Ns (the O'Connell BD GWAS provided EUR/EAS/AFR, no AMR), a multi-population HapMap3/LD reference, and --meta=True for a single inverse-variance-weighted meta posterior. Concatenate the per-chromosome meta files to get the genome-wide PRS-CSx weights.

Step 2 — Liftover to hg38 (`001a_LiftoverWeightsFiles.py`)

Lifts the posterior weights GRCh37 → GRCh38 (the AoU WGS build) with Hail's liftover, then writes a clean weights CSV and a keyed Hail Table checkpoint for Step 3.

Step 3 — Per-chromosome filtering and PLINK export (`001b_PRS_byChr.py`)

The compute-heavy step. For each chromosome it filters the AoU WGS ACAF MatrixTable to the PRS variants, harmonizes alleles, and exports PLINK-format files. We used 16 workers (4 CPU / 15 GB / 250 GB each; 8 CPU / 30 GB / 500 GB master) in a Hail environment.

Per-chromosome processing is deliberate: the AoU WGS MatrixTable has ~145,000 Hail partitions, so a naive genome-wide filter crashes the Spark driver, and re-reading the full MT for all 22 chromosomes is slow. This script reads the MT fresh for each chromosome (avoiding Spark staleness), uses filter_intervals to restrict to that chromosome's partitions, harmonizes, checkpoints, and exports. Completed chromosomes are auto-detected and skipped, so the script is resumable across one or more runs.

Set SCORE_PREFIX (a short analysis label, OConnell for this paper) in the config; the paths work if you followed Steps 1–2. Run it in a tmux session so it survives browser disconnects, with cluster auto-pause set to NEVER:

tmux new -s prs
python3 ~/001b_PRS_byChr.py 2>&1 | tee ~/001b_run.log
# Ctrl+B, then D to detach; tmux attach -t prs to re-attach

Step 4 — PLINK scoring and aggregation (`001c_runPlink.py`)

Scores each chromosome with plink --score, then sums across chromosomes to a genome-wide PRS per person. A standard AoU environment is fine. Set SCORE_PREFIX (same label as Step 3). For each chromosome it downloads the PLINK trio plus matched weights, runs plink --bfile ... --score <id> <effect_allele> <weight> sum (dosage-weighted additive sum, columns 3/4/5), reads the .profile output, and merges into a running frame; after chromosome 22 it sums to one score per person.

Covariate construction (`getCovariates.py`)

Assembles demographic, ancestry, and EHR-density covariates from AoU controlled-tier data, independent of the scoring pipeline.

Variables produced:

Demographics: race, ethnicity, sex at birth, gender.
SexGender: three-level concordance (Cis_female, Cis_male, Non-cis).
Genetic ancestry (ancestry_pred: eur/afr/amr/eas) and PCs, kept as PC_1…PC_16.
ehr_length: days between first and last EHR event.
record_depth: count of distinct condition + observation visits.
age_at_last_event: age at most recent EHR record.

Inclusion: participants with ehr_length ≤ 1095 days (3 years) or record_depth ≤ 2 are removed (insufficient EHR depth).

Stage 03 — Scoring the six SDoH surveys (`03_socialSurveys/`)

From here on everything is R. The six social-determinants-of-health (SDoH) surveys are extracted from AoU survey responses and scored into per-participant scale scores

Stage 04 — The statistical pipeline (`04_mainPRS_Pipeline/`)

This is where the manuscript's results come from. Every script sources 00_config.R

Script	What it does
`00_config.R`	Shared functions + SDoH strata definitions
`01_defineCohorts.R`	EHR phenotyping → BD case/control labels
`02_masterTable.R`	Merge + PC-residualize + global-standardize the PRS
`03_prsMain.R`	PRS performance (AUC, liability R², decile ORs)
`04_socialVarSetup.R`	Merge surveys, z-score with directionality, build ACE/CRMN binaries
`05_sdohAssociations.R`	Per-SDoH logistic associations + incremental AUC + LOO-LRT
`06_interactionAnalysis.R`	PRS×SDoH interaction — multiplicative (BH-FDR) + additive (RERI)

Adapting for your project

Swap the PRS: any per-variant weights with columns CHR SNP BP A1 A2 BETA work from Step 2 on. Update WEIGHTS_FILENAME (Step 2) and SCORE_PREFIX (Steps 3–4), and confirm the AoU WGS path points to your release.
Swap the phenotype: replace the four concept sets in 01_defineCohorts.R by either rebuilding from our code list provided in SupplementalInformation, or substituting your own phenotype.
Swap the prevalence: change K = 0.02 in 00_config.R to your disorder's prevalence — it affects both the liability-scale R² and the absolute-risk curves.

Acknowledgments and data access

This research was conducted using data from the All of Us Research Program. We gratefully acknowledge All of Us participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health's All of Us Research Program for making available the participant data examined in this study.

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director.

The code in this repository reads All of Us Controlled Tier data inside the secure Researcher Workbench and contains no participant-level data, genomic data, or individual records. All reported and derived values respect the All of Us small-cell rule (no participant count of 1–20 is published or back-calculable); in the code this surfaces as the n ≥ 50 / n_cases ≥ 20 guards.

Contact

Please contact rsharp@unc.edu with any questions

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
01_ComputePRSWeights		01_ComputePRSWeights
02_PRSonAoU		02_PRSonAoU
03_socialSurveys		03_socialSurveys
04_mainPRSPipeline		04_mainPRSPipeline
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Methods for Considering social risk alongside genetic risk for bipolar disorder in the All of Us Research Program

Stage 01–02 — Computing the BD PRS

Step 1 — Estimate posterior weights (`runPRScs_array.sh`)

Multi-ancestry variant — PRS-CSx (alternative to PRS-CS-auto)

Step 2 — Liftover to hg38 (`001a_LiftoverWeightsFiles.py`)

Step 3 — Per-chromosome filtering and PLINK export (`001b_PRS_byChr.py`)

Step 4 — PLINK scoring and aggregation (`001c_runPlink.py`)

Covariate construction (`getCovariates.py`)

Stage 03 — Scoring the six SDoH surveys (`03_socialSurveys/`)

Stage 04 — The statistical pipeline (`04_mainPRS_Pipeline/`)

Adapting for your project

Acknowledgments and data access

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Methods for Considering social risk alongside genetic risk for bipolar disorder in the All of Us Research Program

Stage 01–02 — Computing the BD PRS

Step 1 — Estimate posterior weights (runPRScs_array.sh)

Multi-ancestry variant — PRS-CSx (alternative to PRS-CS-auto)

Step 2 — Liftover to hg38 (001a_LiftoverWeightsFiles.py)

Step 3 — Per-chromosome filtering and PLINK export (001b_PRS_byChr.py)

Step 4 — PLINK scoring and aggregation (001c_runPlink.py)

Covariate construction (getCovariates.py)

Stage 03 — Scoring the six SDoH surveys (03_socialSurveys/)

Stage 04 — The statistical pipeline (04_mainPRS_Pipeline/)

Adapting for your project

Acknowledgments and data access

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1 — Estimate posterior weights (`runPRScs_array.sh`)

Step 2 — Liftover to hg38 (`001a_LiftoverWeightsFiles.py`)

Step 3 — Per-chromosome filtering and PLINK export (`001b_PRS_byChr.py`)

Step 4 — PLINK scoring and aggregation (`001c_runPlink.py`)

Covariate construction (`getCovariates.py`)

Stage 03 — Scoring the six SDoH surveys (`03_socialSurveys/`)

Stage 04 — The statistical pipeline (`04_mainPRS_Pipeline/`)

Packages