PIPELINE // FROM 23ANDME ZIP TO INTERACTIVE DASHBOARD

RUN IT / YOURSELF

Hands-on time
~5 hrs
across 3 phases
Wall-clock
~24 hrs
includes TOPMed queue
Cost
$0
all reference data is free
Disk needed
~10 GB
most for imputation results
Skill
Terminal
comfortable in bash
Privacy
Local
only TOPMed sees data

01 / 10 · Overview

What this builds

You drop a 23andMe v5 raw genotype download into a folder, run three commands, and end up with an interactive dashboard at http://localhost:8732 covering:

data/genome.zip 23andMe download (you provide) │ ▼ pipeline/01_23andme_to_tsv.py data/raw_calls.tsv (drops --, II, DD; expand hemizygous) │ ▼ pipeline/02_tsv_to_vcf.sh (uses GRCh37 FASTA) data/raw_grch37.vcf.gz │ ▼ pipeline/04_split_for_topmed.sh data/topmed_input/chr*.vcf.gz │ ▼ pipeline/04b_pad_with_1000g.sh (merge 30 public 1000G samples) data/topmed_input_padded_autosomes/ │ → MANUAL: upload to TOPMed Imputation Server ▼ ← MANUAL: download encrypted .zip results data/topmed_output/chr_*.zip │ ▼ pipeline/run_post_imputation_pipeline.sh data/imputed_grch38_r2_0.8.vcf.gz (~9M variants, R²≥0.8) │ ├──▶ PharmCAT → output/pharmcat/ ├──▶ ClinVar/ACMG → output/raw_findings/imputed_grch38/ ├──▶ rsID panels → output/raw_findings/imputed_panels.tsv └──▶ PGS Catalog PRS → output/raw_findings/prs_scores.tsv │ ▼ output/web/ (interactive dashboard)

02 / 10 · Prerequisites

What you need before starting

Operating system
macOS / Linux
arm64 or x86_64
Free disk
~10 GB
3 GB ref + 7 GB imputation
RAM
~16 GB
PRS computation peaks 4 GB
Homebrew
Required
brew.sh
23andMe data
v5 chip .zip
Browse Raw Data → Download
TOPMed account
Free
5 min to register

03 / 10 · Setup

One-time environment setup

Clone the repo, then run the setup script. Installs bcftools, samtools, p7zip, openjdk, plink2, PharmCAT 3.2, plus Python deps. Idempotent — safe to re-run.

git clone https://github.com/jlgao2/personal-genome-pipeline.git
cd personal-genome-pipeline
bash pipeline/00_setup.sh
Time: ~5 min. Most of that is Homebrew install of openjdk (Java for PharmCAT).

04 / 10 · Phase 1

Convert 23andMe TSV → VCF, prep for imputation

01
Drop in zip
~30 sec

Move your 23andMe download into ./data/. The script auto-detects any .zip file.

02
Set name
~5 sec

SAMPLE_NAME appears in VCF headers and the dashboard. Pick anything.

03
Run phase 1
~30 min

First-time downloads 3 GB GRCh37 FASTA, then converts + QCs + pads with public 1000G samples for TOPMed compliance.

cp ~/Downloads/genome_*.zip data/
export SAMPLE_NAME="Your_Name"
bash pipeline/00_run_phase1.sh

What you'll have at the end

data/topmed_input_padded_autosomes/ — 22 small .vcf.gz files (~11 MB total) ready to upload to TOPMed Imputation Server.

Why "padded"? TOPMed requires ≥20 samples per submission (privacy/QC reason). The 04b_pad_with_1000g.sh script auto-merges your sample with 30 public 1000 Genomes reference samples — just at your chip's positions, ~30 MB streamed from the EBI FTP. After imputation, we extract only your sample and discard the others.

05 / 10 · Phase 2

TOPMed imputation (manual web step)

This is the only step the pipeline can't automate — you have to register and click through the TOPMed Imputation Server's web UI. ~5 min hands-on, then 3-12 hours wall-clock for the server to process.

  1. Register at imputation.biodatacatalyst.nhlbi.nih.gov — verify your email.
  2. Click RunGenotype Imputation (Minimac4).
  3. Settings:
    • Reference Panel: TOPMed r3
    • Array Build: GRCh37/hg19
    • Phasing: Eagle v2.4
    • Population: vs. TOPMed Panel (mixed)
    • Mode: Quality Control & Imputation
  4. Drag in all 22 chr*.vcf.gz files from data/topmed_input_padded_autosomes/. Do NOT include the .tbi files — TOPMed builds its own indexes server-side.
  5. Submit. You'll get two emails over the next 3-12 hours: a curl-command download script, and the encryption password.
  6. Run the curl command from data/topmed_output/ to fetch the ~10 GB of chr_*.zip files.
Why TOPMed? It's a free, NIH-hosted reference panel of ~16,000 multi-ancestry whole genomes. Statistically infers genotypes at ~50 M positions you didn't directly type, using haplotype patterns from the reference. Best ancestry diversity of any imputation server (better than older Michigan / 1000G servers).
If you get "ChrX nonPAR ambiguous" error: TOPMed rejects males' hemizygous chrX mixed with mixed-sex 1000G padding. Solution: upload only the autosomes (chr1-chr22). The 00_run_phase1.sh script already isolates these in topmed_input_padded_autosomes/ — use that directory, not the raw topmed_input_padded/.

06 / 10 · Phase 3

Annotation & report generation

Once you have all 22 chr_*.zip files in data/topmed_output/, run:

TOPMED_PASS='your-decryption-password' bash pipeline/00_run_phase3.sh
01
Decrypt
~10 min

4-way parallel 7zip AES-256 decrypt of all 22 zips.

02
Concat + extract
~5 min

Concatenate per-chrom imputed VCFs, then extract just your sample (drop the 30 1000G padding samples).

03
R² ≥ 0.8 filter
~3 min

Drop low-quality imputed positions. Keeps ~9 M high-confidence variants.

04
PharmCAT
~2 min

Star-allele haplotype calling for ~16 PGx genes. Generates HTML + JSON reports.

05
ClinVar / ACMG
~3 min

Annotates against current ClinVar (4.4 M entries), filters to P/LP variants in ACMG SF v3.2 + carrier-panel genes.

06
PRS
~15 min

Downloads 10 PGS Catalog scoring files, computes weighted scores, top contributing variants per trait.

07 / 10 · Dashboard

View the interactive report

python3 -m http.server 8732 --directory output/web
open http://localhost:8732

Features in the dashboard:

Customizing for yourself: the dashboard's data lives in output/web/js/data.js. The current version is hand-curated for one specific subject. To regenerate from your own pipeline outputs, you'd write a pipeline/13_build_report.py that reads the TSVs in output/raw_findings/ and rewrites data.js. (Marked TODO in the repo — open a PR if you build it.)

08 / 10 · Privacy

What touches the network

If even TOPMed feels too exposed: skip Phase 2. You'll lose the ~50× variant expansion and the polygenic risk scores, but Phase 1 outputs are still useful — run the panel scripts directly on data/raw_grch37.vcf.gz.
If you fork this repo: the .gitignore already excludes data/, output/raw_findings/, output/qc/, output/pharmcat/, output/web/js/data.js, and refs/. Don't disable any of these — they prevent committing your personal genotype, lab values, and clinical findings. Double-check before any push.

09 / 10 · Limits

What this pipeline cannot do

For all of these, 30× whole-genome sequencing ($200–600 from Nebula, Dante Labs, etc.) is the right next step. The same pipeline runs on WGS-derived VCFs with massively expanded coverage.

10 / 10 · Troubleshooting

Known gotchas

Phase 1 fails on bcftools convert --tsv2vcf

The GRCh37 FASTA didn't index. Manually: samtools faidx refs/human_g1k_v37.fasta.

TOPMed rejects "ChrX nonPAR ambiguous"

Use data/topmed_input_padded_autosomes/ (chr1-22 only), not data/topmed_input_padded/ which includes chrX. Mixed-sex padding breaks chrX.

TOPMed rejects "minimum 20 samples"

You forgot the 1000G padding step. Re-run bash pipeline/04b_pad_with_1000g.sh.

PharmCAT crashes on "Java not found"

export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"   # Apple Silicon
export PATH="/usr/local/opt/openjdk/bin:$PATH"      # Intel Mac

PharmCAT pipeline hangs on "Downloading reference FASTA"

Zenodo download is flaky. 00_run_phase3.sh uses the JAR-direct approach, which skips that download — use it.

Disk full during imputation download

22 zips × ~500 MB = ~10 GB. Free ~15 GB before downloading. Or download in batches by editing the curl script TOPMed gives you.