Skip to content

thelovelab/bipolarPRSAOU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Methods for Considering social risk alongside genetic risk for bipolar disorder in the All of Us Research Program

The pipeline runs in four stages:

Stage Directory Where it runs What it produces
01 01_LocalPRSComputation/ Your HPC cluster (SLURM) PRS-CS-auto posterior SNP weights
02 02_PRSonAoU/ AoU Workbench (Hail / Dataproc / PLINK) Per-person BD PRS + covariate table
03 03_socialSurveys/ AoU Workbench (R) Six scored SDoH surveys
04 04_mainPRS_Pipeline/ AoU Workbench (R) Cohorts, PRS performance, SDoH associations, PRS×SDoH interaction

Stages 01–02 compute the PRS (method-agnostic from Step 2 on). Stages 03–04 perform the statistical analyses presented in the manuscript.


Stage 01–02 — Computing the BD PRS

PRS-CS-auto runs on your HPC cluster (public GWAS + reference panels, no AoU data). The rest runs in the AoU Workbench: liftover to hg38, harmonize against the AoU WGS MatrixTable, score with PLINK, and build covariates.

Step Script Where to run
1 runPRScs_array.sh HPC cluster (SLURM)
2 001a_LiftoverWeightsFiles.py AoU Workbench (Hail)
3 001b_PRS_byChr.py AoU Workbench (Dataproc)
4 001c_runPlink.py AoU Workbench (standard)
getCovariates.py AoU Workbench (standard)

Step 1 — Estimate posterior weights (runPRScs_array.sh)

Runs PRS-CS-auto per chromosome on the BD GWAS summary statistics files, producing posterior SNP weights. Concatenate into one genome-wide weights file.

Multi-ancestry variant — PRS-CSx (alternative to PRS-CS-auto)

PRS-CSx is a multi-ancestry alternative to PRS-CS-auto. You can run it using the same starting script as is provided for PRS-CS-auto and multi-population inputs. More details provided in runPRScs_array, or you can swap the call for:

python PRScsx/PRScsx.py \
    --ref_dir=<1000G multi-population LD panels> \
    --bim_prefix=<multi-population HapMap3, e.g. hm3_multipop> \
    --sst_file=eur_sumstats.txt,eas_sumstats.txt,afr_sumstats.txt \
    --n_gwas=220416,16917,8999 \        # per-population effective N: EUR, EAS, AFR 
    --pop=EUR,EAS,AFR \
    --meta=True \                        # inverse-variance-weighted meta posterior
    --n_iter=10000 --n_burnin=5000 \
    --out_dir=<out> --out_name=OConnellPRSCSX --chrom=${CHROM}

Changes vs PRS-CS-auto: per-population summary statistics and effective Ns (the O'Connell BD GWAS provided EUR/EAS/AFR, no AMR), a multi-population HapMap3/LD reference, and --meta=True for a single inverse-variance-weighted meta posterior. Concatenate the per-chromosome meta files to get the genome-wide PRS-CSx weights.

Step 2 — Liftover to hg38 (001a_LiftoverWeightsFiles.py)

Lifts the posterior weights GRCh37 → GRCh38 (the AoU WGS build) with Hail's liftover, then writes a clean weights CSV and a keyed Hail Table checkpoint for Step 3.

Step 3 — Per-chromosome filtering and PLINK export (001b_PRS_byChr.py)

The compute-heavy step. For each chromosome it filters the AoU WGS ACAF MatrixTable to the PRS variants, harmonizes alleles, and exports PLINK-format files. We used 16 workers (4 CPU / 15 GB / 250 GB each; 8 CPU / 30 GB / 500 GB master) in a Hail environment.

Per-chromosome processing is deliberate: the AoU WGS MatrixTable has ~145,000 Hail partitions, so a naive genome-wide filter crashes the Spark driver, and re-reading the full MT for all 22 chromosomes is slow. This script reads the MT fresh for each chromosome (avoiding Spark staleness), uses filter_intervals to restrict to that chromosome's partitions, harmonizes, checkpoints, and exports. Completed chromosomes are auto-detected and skipped, so the script is resumable across one or more runs.

Set SCORE_PREFIX (a short analysis label, OConnell for this paper) in the config; the paths work if you followed Steps 1–2. Run it in a tmux session so it survives browser disconnects, with cluster auto-pause set to NEVER:

tmux new -s prs
python3 ~/001b_PRS_byChr.py 2>&1 | tee ~/001b_run.log
# Ctrl+B, then D to detach; tmux attach -t prs to re-attach

Step 4 — PLINK scoring and aggregation (001c_runPlink.py)

Scores each chromosome with plink --score, then sums across chromosomes to a genome-wide PRS per person. A standard AoU environment is fine. Set SCORE_PREFIX (same label as Step 3). For each chromosome it downloads the PLINK trio plus matched weights, runs plink --bfile ... --score <id> <effect_allele> <weight> sum (dosage-weighted additive sum, columns 3/4/5), reads the .profile output, and merges into a running frame; after chromosome 22 it sums to one score per person.

Covariate construction (getCovariates.py)

Assembles demographic, ancestry, and EHR-density covariates from AoU controlled-tier data, independent of the scoring pipeline.

Variables produced:

  • Demographics: race, ethnicity, sex at birth, gender.
  • SexGender: three-level concordance (Cis_female, Cis_male, Non-cis).
  • Genetic ancestry (ancestry_pred: eur/afr/amr/eas) and PCs, kept as PC_1…PC_16.
  • ehr_length: days between first and last EHR event.
  • record_depth: count of distinct condition + observation visits.
  • age_at_last_event: age at most recent EHR record.

Inclusion: participants with ehr_length ≤ 1095 days (3 years) or record_depth ≤ 2 are removed (insufficient EHR depth).


Stage 03 — Scoring the six SDoH surveys (03_socialSurveys/)

From here on everything is R. The six social-determinants-of-health (SDoH) surveys are extracted from AoU survey responses and scored into per-participant scale scores


Stage 04 — The statistical pipeline (04_mainPRS_Pipeline/)

This is where the manuscript's results come from. Every script sources 00_config.R

Script What it does
00_config.R Shared functions + SDoH strata definitions
01_defineCohorts.R EHR phenotyping → BD case/control labels
02_masterTable.R Merge + PC-residualize + global-standardize the PRS
03_prsMain.R PRS performance (AUC, liability R², decile ORs)
04_socialVarSetup.R Merge surveys, z-score with directionality, build ACE/CRMN binaries
05_sdohAssociations.R Per-SDoH logistic associations + incremental AUC + LOO-LRT
06_interactionAnalysis.R PRS×SDoH interaction — multiplicative (BH-FDR) + additive (RERI)

Adapting for your project

  1. Swap the PRS: any per-variant weights with columns CHR SNP BP A1 A2 BETA work from Step 2 on. Update WEIGHTS_FILENAME (Step 2) and SCORE_PREFIX (Steps 3–4), and confirm the AoU WGS path points to your release.
  2. Swap the phenotype: replace the four concept sets in 01_defineCohorts.R by either rebuilding from our code list provided in SupplementalInformation, or substituting your own phenotype.
  3. Swap the prevalence: change K = 0.02 in 00_config.R to your disorder's prevalence — it affects both the liability-scale R² and the absolute-risk curves.

Acknowledgments and data access

This research was conducted using data from the All of Us Research Program. We gratefully acknowledge All of Us participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health's All of Us Research Program for making available the participant data examined in this study.

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director.

The code in this repository reads All of Us Controlled Tier data inside the secure Researcher Workbench and contains no participant-level data, genomic data, or individual records. All reported and derived values respect the All of Us small-cell rule (no participant count of 1–20 is published or back-calculable); in the code this surfaces as the n ≥ 50 / n_cases ≥ 20 guards.

Contact

Please contact rsharp@unc.edu with any questions

About

PRS Calculation Pipeline performed in the All of Us Research Workbench

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors