Genome // How to Run

01 / 10 · Overview

What this builds

You drop a 23andMe v5 raw genotype download into a folder, run three commands, and end up with an interactive dashboard at http://localhost:8732 covering:

Pharmacogenomics — PharmCAT 3.2 haplotype calls for 16+ drug-metabolism genes (CYP2C19, VKORC1, TPMT, DPYD, SLCO1B1, etc.)
Polygenic risk scores — 10 traits via PGS Catalog (CAD, T2D, AD, breast/prostate cancer, BMI, LDL, MDD, etc.)
ClinVar / ACMG SF v3.2 — monogenic findings + recessive carrier panel
Curated nutrition + lifestyle SNP panel — ~60 well-known variants (MTHFR, APOE, ALDH2, FTO, etc.)
TOPMed-imputed VCF — chip's ~640K variants expanded to ~9 M high-confidence (R²≥0.8)

data/genome.zip 23andMe download (you provide) │ ▼ pipeline/01_23andme_to_tsv.py data/raw_calls.tsv (drops --, II, DD; expand hemizygous) │ ▼ pipeline/02_tsv_to_vcf.sh (uses GRCh37 FASTA) data/raw_grch37.vcf.gz │ ▼ pipeline/04_split_for_topmed.sh data/topmed_input/chr*.vcf.gz │ ▼ pipeline/04b_pad_with_1000g.sh (merge 30 public 1000G samples) data/topmed_input_padded_autosomes/ │ → MANUAL: upload to TOPMed Imputation Server ▼ ← MANUAL: download encrypted .zip results data/topmed_output/chr_*.zip │ ▼ pipeline/run_post_imputation_pipeline.sh data/imputed_grch38_r2_0.8.vcf.gz (~9M variants, R²≥0.8) │ ├──▶ PharmCAT → output/pharmcat/ ├──▶ ClinVar/ACMG → output/raw_findings/imputed_grch38/ ├──▶ rsID panels → output/raw_findings/imputed_panels.tsv └──▶ PGS Catalog PRS → output/raw_findings/prs_scores.tsv │ ▼ output/web/ (interactive dashboard)

02 / 10 · Prerequisites

What you need before starting

Operating system

macOS / Linux

arm64 or x86_64

Free disk

~10 GB

3 GB ref + 7 GB imputation

RAM

~16 GB

PRS computation peaks 4 GB

Homebrew

Required

brew.sh

23andMe data

v5 chip .zip

Browse Raw Data → Download

TOPMed account

Free

5 min to register

03 / 10 · Setup

One-time environment setup

Clone the repo, then run the setup script. Installs bcftools, samtools, p7zip, openjdk, plink2, PharmCAT 3.2, plus Python deps. Idempotent — safe to re-run.

git clone https://github.com/jlgao2/personal-genome-pipeline.git
cd personal-genome-pipeline
bash pipeline/00_setup.sh

Time: ~5 min. Most of that is Homebrew install of openjdk (Java for PharmCAT).

04 / 10 · Phase 1

Convert 23andMe TSV → VCF, prep for imputation

01

Drop in zip

~30 sec

Move your 23andMe download into ./data/. The script auto-detects any .zip file.

02

Set name

~5 sec

SAMPLE_NAME appears in VCF headers and the dashboard. Pick anything.

03

Run phase 1

~30 min

First-time downloads 3 GB GRCh37 FASTA, then converts + QCs + pads with public 1000G samples for TOPMed compliance.

cp ~/Downloads/genome_*.zip data/
export SAMPLE_NAME="Your_Name"
bash pipeline/00_run_phase1.sh

What you'll have at the end

data/topmed_input_padded_autosomes/ — 22 small .vcf.gz files (~11 MB total) ready to upload to TOPMed Imputation Server.

Why "padded"? TOPMed requires ≥20 samples per submission (privacy/QC reason). The 04b_pad_with_1000g.sh script auto-merges your sample with 30 public 1000 Genomes reference samples — just at your chip's positions, ~30 MB streamed from the EBI FTP. After imputation, we extract only your sample and discard the others.

05 / 10 · Phase 2

TOPMed imputation (manual web step)

This is the only step the pipeline can't automate — you have to register and click through the TOPMed Imputation Server's web UI. ~5 min hands-on, then 3-12 hours wall-clock for the server to process.

Register at imputation.biodatacatalyst.nhlbi.nih.gov — verify your email.
Click Run → Genotype Imputation (Minimac4).
Settings:
- Reference Panel: TOPMed r3
- Array Build: GRCh37/hg19
- Phasing: Eagle v2.4
- Population: vs. TOPMed Panel (mixed)
- Mode: Quality Control & Imputation
Drag in all 22 chr*.vcf.gz files from data/topmed_input_padded_autosomes/. Do NOT include the .tbi files — TOPMed builds its own indexes server-side.
Submit. You'll get two emails over the next 3-12 hours: a curl-command download script, and the encryption password.
Run the curl command from data/topmed_output/ to fetch the ~10 GB of chr_*.zip files.

Why TOPMed? It's a free, NIH-hosted reference panel of ~16,000 multi-ancestry whole genomes. Statistically infers genotypes at ~50 M positions you didn't directly type, using haplotype patterns from the reference. Best ancestry diversity of any imputation server (better than older Michigan / 1000G servers).

If you get "ChrX nonPAR ambiguous" error: TOPMed rejects males' hemizygous chrX mixed with mixed-sex 1000G padding. Solution: upload only the autosomes (chr1-chr22). The 00_run_phase1.sh script already isolates these in topmed_input_padded_autosomes/ — use that directory, not the raw topmed_input_padded/.

06 / 10 · Phase 3

Annotation & report generation

Once you have all 22 chr_*.zip files in data/topmed_output/, run:

TOPMED_PASS='your-decryption-password' bash pipeline/00_run_phase3.sh

01

Decrypt

~10 min

4-way parallel 7zip AES-256 decrypt of all 22 zips.

02

Concat + extract

~5 min

Concatenate per-chrom imputed VCFs, then extract just your sample (drop the 30 1000G padding samples).

03

R² ≥ 0.8 filter

~3 min

Drop low-quality imputed positions. Keeps ~9 M high-confidence variants.

04

PharmCAT

~2 min

Star-allele haplotype calling for ~16 PGx genes. Generates HTML + JSON reports.

05

ClinVar / ACMG

~3 min

Annotates against current ClinVar (4.4 M entries), filters to P/LP variants in ACMG SF v3.2 + carrier-panel genes.

06

PRS

~15 min

Downloads 10 PGS Catalog scoring files, computes weighted scores, top contributing variants per trait.

07 / 10 · Dashboard

View the interactive report

python3 -m http.server 8732 --directory output/web
open http://localhost:8732

Features in the dashboard:

Tier filter chips — show only Tier-A (high-confidence + actionable) findings
Search box — find by rsID, gene, or drug name
Click-to-expand findings — each opens a side panel with rsID, genotype, position, trait allele
Top action cards — the highest-priority findings, click any to jump to its detail
Cross-reference cards — where genotype prediction matches a lab value or clinical event
Lab table — paste your own labs to highlight low/high flags
PCP visit checklist — saves checked state in localStorage
Print-friendly mode — auto-expands all accordions, hides nav

Customizing for yourself: the dashboard's data lives in output/web/js/data.js. The current version is hand-curated for one specific subject. To regenerate from your own pipeline outputs, you'd write a pipeline/13_build_report.py that reads the TSVs in output/raw_findings/ and rewrites data.js. (Marked TODO in the repo — open a PR if you build it.)

08 / 10 · Privacy

What touches the network

Local-first: all raw genotype processing is on your machine. Your 23andMe TSV never leaves it after Phase 1.
Phase 2 is the only upload — to NIH-hosted TOPMed Imputation Server. Federal funding, encrypted-at-rest, results encrypted with a password emailed only to you. They don't sell or research your individual data without consent.
Phase 3 lookups (ClinVar, gnomAD via myvariant.info, PGS Catalog) query public databases by rsID — no genotypes leave your machine.
The dashboard is a static site — no server, no analytics, no telemetry. Open output/web/index.html directly.

If even TOPMed feels too exposed: skip Phase 2. You'll lose the ~50× variant expansion and the polygenic risk scores, but Phase 1 outputs are still useful — run the panel scripts directly on data/raw_grch37.vcf.gz.

If you fork this repo: the .gitignore already excludes data/, output/raw_findings/, output/qc/, output/pharmcat/, output/web/js/data.js, and refs/. Don't disable any of these — they prevent committing your personal genotype, lab values, and clinical findings. Double-check before any push.

09 / 10 · Limits

What this pipeline cannot do

Detect rare variants — chip + imputation handles common variants (>1% MAF). For BRCA1/2, Lynch genes, ACMG actionable rare variants → clinical NGS panel ($250 from Color/Invitae).
Call CYP2D6 — copy number / hybrid alleles defeat any chip-based method. Targeted PGx panel ($200 from Mayo PGx, GeneSight) needed.
SMN1 (SMA carrier) — copy-number based, needs MLPA assay.
Triplet-repeat disorders (Fragile X FMR1, Huntington's HTT, myotonic dystrophy DMPK) — all invisible to SNP arrays.
Mitochondrial heteroplasmy % — only homoplasmic calls.

For all of these, 30× whole-genome sequencing ($200–600 from Nebula, Dante Labs, etc.) is the right next step. The same pipeline runs on WGS-derived VCFs with massively expanded coverage.

10 / 10 · Troubleshooting

Known gotchas

Phase 1 fails on bcftools convert --tsv2vcf

The GRCh37 FASTA didn't index. Manually: samtools faidx refs/human_g1k_v37.fasta.

TOPMed rejects "ChrX nonPAR ambiguous"

Use data/topmed_input_padded_autosomes/ (chr1-22 only), not data/topmed_input_padded/ which includes chrX. Mixed-sex padding breaks chrX.

TOPMed rejects "minimum 20 samples"

You forgot the 1000G padding step. Re-run bash pipeline/04b_pad_with_1000g.sh.

PharmCAT crashes on "Java not found"

export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"   # Apple Silicon
export PATH="/usr/local/opt/openjdk/bin:$PATH"      # Intel Mac

PharmCAT pipeline hangs on "Downloading reference FASTA"

Zenodo download is flaky. 00_run_phase3.sh uses the JAR-direct approach, which skips that download — use it.

Disk full during imputation download

22 zips × ~500 MB = ~10 GB. Free ~15 GB before downloading. Or download in batches by editing the curl script TOPMed gives you.