PIPELINE // FROM 23ANDME ZIP TO INTERACTIVE DASHBOARD
RUN IT / YOURSELF
01 / 10 · Overview
What this builds
You drop a 23andMe v5 raw genotype download into a folder, run three commands, and end up with an interactive dashboard at http://localhost:8732 covering:
- Pharmacogenomics — PharmCAT 3.2 haplotype calls for 16+ drug-metabolism genes (CYP2C19, VKORC1, TPMT, DPYD, SLCO1B1, etc.)
- Polygenic risk scores — 10 traits via PGS Catalog (CAD, T2D, AD, breast/prostate cancer, BMI, LDL, MDD, etc.)
- ClinVar / ACMG SF v3.2 — monogenic findings + recessive carrier panel
- Curated nutrition + lifestyle SNP panel — ~60 well-known variants (MTHFR, APOE, ALDH2, FTO, etc.)
- TOPMed-imputed VCF — chip's ~640K variants expanded to ~9 M high-confidence (R²≥0.8)
02 / 10 · Prerequisites
What you need before starting
03 / 10 · Setup
One-time environment setup
Clone the repo, then run the setup script. Installs bcftools, samtools, p7zip, openjdk, plink2, PharmCAT 3.2, plus Python deps. Idempotent — safe to re-run.
git clone https://github.com/jlgao2/personal-genome-pipeline.git cd personal-genome-pipeline bash pipeline/00_setup.sh
openjdk (Java for PharmCAT).
04 / 10 · Phase 1
Convert 23andMe TSV → VCF, prep for imputation
Move your 23andMe download into ./data/. The script auto-detects any .zip file.
SAMPLE_NAME appears in VCF headers and the dashboard. Pick anything.
First-time downloads 3 GB GRCh37 FASTA, then converts + QCs + pads with public 1000G samples for TOPMed compliance.
cp ~/Downloads/genome_*.zip data/ export SAMPLE_NAME="Your_Name" bash pipeline/00_run_phase1.sh
What you'll have at the end
data/topmed_input_padded_autosomes/ — 22 small .vcf.gz files (~11 MB total) ready to upload to TOPMed Imputation Server.
04b_pad_with_1000g.sh script auto-merges your sample with 30 public 1000 Genomes reference samples — just at your chip's positions, ~30 MB streamed from the EBI FTP. After imputation, we extract only your sample and discard the others.
05 / 10 · Phase 2
TOPMed imputation (manual web step)
This is the only step the pipeline can't automate — you have to register and click through the TOPMed Imputation Server's web UI. ~5 min hands-on, then 3-12 hours wall-clock for the server to process.
- Register at imputation.biodatacatalyst.nhlbi.nih.gov — verify your email.
- Click Run → Genotype Imputation (Minimac4).
- Settings:
- Reference Panel: TOPMed r3
- Array Build: GRCh37/hg19
- Phasing: Eagle v2.4
- Population: vs. TOPMed Panel (mixed)
- Mode: Quality Control & Imputation
- Drag in all 22
chr*.vcf.gzfiles fromdata/topmed_input_padded_autosomes/. Do NOT include the.tbifiles — TOPMed builds its own indexes server-side. - Submit. You'll get two emails over the next 3-12 hours: a curl-command download script, and the encryption password.
- Run the curl command from
data/topmed_output/to fetch the ~10 GB ofchr_*.zipfiles.
00_run_phase1.sh script already isolates these in topmed_input_padded_autosomes/ — use that directory, not the raw topmed_input_padded/.
06 / 10 · Phase 3
Annotation & report generation
Once you have all 22 chr_*.zip files in data/topmed_output/, run:
TOPMED_PASS='your-decryption-password' bash pipeline/00_run_phase3.sh
4-way parallel 7zip AES-256 decrypt of all 22 zips.
Concatenate per-chrom imputed VCFs, then extract just your sample (drop the 30 1000G padding samples).
Drop low-quality imputed positions. Keeps ~9 M high-confidence variants.
Star-allele haplotype calling for ~16 PGx genes. Generates HTML + JSON reports.
Annotates against current ClinVar (4.4 M entries), filters to P/LP variants in ACMG SF v3.2 + carrier-panel genes.
Downloads 10 PGS Catalog scoring files, computes weighted scores, top contributing variants per trait.
07 / 10 · Dashboard
View the interactive report
python3 -m http.server 8732 --directory output/web open http://localhost:8732
Features in the dashboard:
- Tier filter chips — show only Tier-A (high-confidence + actionable) findings
- Search box — find by rsID, gene, or drug name
- Click-to-expand findings — each opens a side panel with rsID, genotype, position, trait allele
- Top action cards — the highest-priority findings, click any to jump to its detail
- Cross-reference cards — where genotype prediction matches a lab value or clinical event
- Lab table — paste your own labs to highlight low/high flags
- PCP visit checklist — saves checked state in localStorage
- Print-friendly mode — auto-expands all accordions, hides nav
output/web/js/data.js. The current version is hand-curated for one specific subject. To regenerate from your own pipeline outputs, you'd write a pipeline/13_build_report.py that reads the TSVs in output/raw_findings/ and rewrites data.js. (Marked TODO in the repo — open a PR if you build it.)
08 / 10 · Privacy
What touches the network
- Local-first: all raw genotype processing is on your machine. Your 23andMe TSV never leaves it after Phase 1.
- Phase 2 is the only upload — to NIH-hosted TOPMed Imputation Server. Federal funding, encrypted-at-rest, results encrypted with a password emailed only to you. They don't sell or research your individual data without consent.
- Phase 3 lookups (ClinVar, gnomAD via myvariant.info, PGS Catalog) query public databases by
rsID— no genotypes leave your machine. - The dashboard is a static site — no server, no analytics, no telemetry. Open
output/web/index.htmldirectly.
data/raw_grch37.vcf.gz.
.gitignore already excludes data/, output/raw_findings/, output/qc/, output/pharmcat/, output/web/js/data.js, and refs/. Don't disable any of these — they prevent committing your personal genotype, lab values, and clinical findings. Double-check before any push.
09 / 10 · Limits
What this pipeline cannot do
- Detect rare variants — chip + imputation handles common variants (>1% MAF). For BRCA1/2, Lynch genes, ACMG actionable rare variants → clinical NGS panel ($250 from Color/Invitae).
- Call CYP2D6 — copy number / hybrid alleles defeat any chip-based method. Targeted PGx panel ($200 from Mayo PGx, GeneSight) needed.
- SMN1 (SMA carrier) — copy-number based, needs MLPA assay.
- Triplet-repeat disorders (Fragile X FMR1, Huntington's HTT, myotonic dystrophy DMPK) — all invisible to SNP arrays.
- Mitochondrial heteroplasmy % — only homoplasmic calls.
For all of these, 30× whole-genome sequencing ($200–600 from Nebula, Dante Labs, etc.) is the right next step. The same pipeline runs on WGS-derived VCFs with massively expanded coverage.
10 / 10 · Troubleshooting
Known gotchas
Phase 1 fails on bcftools convert --tsv2vcf
The GRCh37 FASTA didn't index. Manually: samtools faidx refs/human_g1k_v37.fasta.
TOPMed rejects "ChrX nonPAR ambiguous"
Use data/topmed_input_padded_autosomes/ (chr1-22 only), not data/topmed_input_padded/ which includes chrX. Mixed-sex padding breaks chrX.
TOPMed rejects "minimum 20 samples"
You forgot the 1000G padding step. Re-run bash pipeline/04b_pad_with_1000g.sh.
PharmCAT crashes on "Java not found"
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH" # Apple Silicon export PATH="/usr/local/opt/openjdk/bin:$PATH" # Intel Mac
PharmCAT pipeline hangs on "Downloading reference FASTA"
Zenodo download is flaky. 00_run_phase3.sh uses the JAR-direct approach, which skips that download — use it.
Disk full during imputation download
22 zips × ~500 MB = ~10 GB. Free ~15 GB before downloading. Or download in batches by editing the curl script TOPMed gives you.