Workflow Reference

Scallops provides two primary end-to-end pipelines written in WDL 1.0 (Workflow Description Language). These workflows are designed for scalability and reproducibility across various environments, including local machines, cloud infrastructure, and high-performance computing (HPC) clusters.


Stitching Workflow

File: stitch_workflow.wdl

This workflow performs illumination correction (flatfield estimation) followed by image stitching. It takes raw microscopy images (e.g., .nd2, .tiff) and converts them into OME-Zarr format.

Workflow Steps

  1. Grouping: The workflow scans the input directories (urls) using the image_pattern. It groups images based on the groupby parameter (default: plate, well, timepoint).

  2. Illumination Correction: (Optional) For each group, it calculates a flatfield image (mean or median projection). This step is parallelized across groups.

  3. Stitching:

    • Applies the calculated flatfield to the raw tiles.

    • Corrects for radial distortion.

    • Aligns tiles using stage positions and cross-correlation.

    • Stitches tiles into an OME-Zarr image.

Inputs

Minimal Configuration (Required)

These are the absolute minimum parameters required to run the stitching workflow.

Parameter

Type

Description

urls

Array[String]

List of directories containing raw images (e.g S3 URLs).

image_pattern

String

Regex-like pattern to parse filenames (e.g., "Well{well}_Point{point}.nd2").

output_directory

String

Base path for outputs.

docker

String

Workflow docker image.

Minimal Stitching JSON
{
   "urls": ["s3://your-bucket/experiment_data/"],
   "image_pattern": "20231010_10x_6W_SBS_c{t}/plate{plate}/Well{well}_Point{skip}_{skip}_Channel{skip}_Seq{skip}.nd2",
   "output_directory": "s3://your-bucket/experiment_data/stitch/iss/",
   "docker":"772311241819.dkr.ecr.us-west-2.amazonaws.com/scallops:1.0.0"
}

Full Parameter Reference (Advanced)

Below is the complete list of exposed options, including optional settings for grouping, distortion correction, and resource allocation.

Parameter

Type

Description

groupby

Array[String]

Metadata keys to group tiles by. Default: ["plate", "well", "t"].

subset

Array[String]

Filter to process only specific groups (e.g., ["A-1", "B-2"]).

z_index

String

Specific Z-plane to stitch, or "focus" for auto-focus.

stitch_channel

Int

The reference channel index used for calculating stitching offsets. Default: 0.

stitch_radial_correction_k

String

Coefficient for barrel distortion correction.

stitch_max_shift

Float

Maximum allowed shift between tiles.

stitch_blend

String

Blending method for overlapping regions.

stitch_crop

Int

Pixels to crop from edges before stitching.

stitch_min_overlap_fraction

Float

Minimum overlap required between tiles.

run_illumination_correction

Boolean

Default true. Set to false if images are pre-corrected.

illumination_agg_method

String

Method for flatfield calculation. Default: "mean".

expected_images

Int

Expected number of images per group (useful for QC).

rename

String

Path to a 2-column CSV mapping image IDs to new IDs.

force_stitch

Boolean

Force re-run of stitching even if output exists.

Resources

Various

stitch_cpu, stitch_memory, etc. can be set to override defaults.

Outputs

The workflow generates the following directory structure in output_directory:

  • illumination_correction/: Contains calculated flatfield (and optionally darkfield) images in TIFF format.

  • stitch/: Contains the stitched images in OME-Zarr format.


OPS Workflow

File: ops_workflow.wdl

The Optical Pooled Screens (OPS) workflow is a comprehensive pipeline that integrates Phenotypic imaging (IF) with In-Situ Sequencing (ISS).

Workflow Steps

Phase 1: Phenotype Pre-processing
  1. Registration: Aligns multiple phenotypic rounds (if applicable) to a reference timepoint (e.g., “IF”).

  2. Segmentation: Segments Nuclei and Cells using the registered images.

  3. Object Discovery: Creates labeled object maps for Nuclei, Cells, and Cytosol.

Phase 2: ISS Pre-processing
  1. Registration: Aligns the ISS anchor round (t0) to the rest of cycles to prepare the coordinate space.

Phase 3: Integration & Analysis
  1. Cross-Modality Registration: Aligns the Phenotype images to the ISS coordinate space.

  2. Feature Extraction: Calculates morphological and intensity features for Nuclei, Cells, and Cytosol.

  3. Sequencing Analysis: Detects spots in ISS channels and decodes the sequence (read calling).

  4. Merge: Combines phenotypic features with decoded barcodes into a single dataset.

Inputs

Minimal Configuration (Required)

These are the absolute minimum parameters required to run the OPS workflow.

Parameter

Type

Description

output_directory

String

Base path for outputs.

iss_url

String

Path to stitched ISS Zarr (Required if running ISS analysis).

phenotype_url

String

Path to stitched Phenotype Zarr (Required if running Phenotype analysis).

phenotype_dapi_channel

Int

Channel index for DAPI in phenotype images.

phenotype_cyto_channel

Array[Int]

Channel indices for Cytoplasm segmentation.

reads_labels

String

Which segmentation label to assign reads to (e.g., "cell" or "nuclei").

docker

String

Workflow docker image.

Minimal OPS JSON
{
   "output_directory": "s3://your-bucket/experiment/ops_results/",
   "iss_url": "s3://your-bucket/experiment/stitch/iss/stitch/stitch.zarr/",
   "phenotype_url": "s3://your-bucket/experiment/stitch/pheno/stitch/stitch.zarr/",
   "phenotype_dapi_channel": 4,
   "phenotype_cyto_channel": [6],
   "reads_labels": "cell",
   "docker":"772311241819.dkr.ecr.us-west-2.amazonaws.com/scallops:1.0.0"
}

Full Parameter Reference (Advanced)

Below is the complete list of exposed options covering registration, feature extraction, spot detection, and library configuration.

Data & Grouping

Parameter

Type

Description

iss_image_pattern

String

Default: "{plate}-{well}-{t}".

phenotype_image_pattern

String

Default: "{plate}-{well}-{t}".

groupby

Array[String]

Default: ["plate", "well"].

subset

Array[String]

Filter specific wells/plates.

Segmentation & Registration

Parameter

Type

Description

reference_phenotype_time

String

Timepoint to use as reference (e.g., "IF").

phenotype_dapi_channel_before_registration

Int

DAPI index before registration (for pheno-pheno alignment).

iss_dapi_channel

Int

DAPI index in ISS images.

nuclei_segmentation

String

Method (e.g., "stardist", "cellpose").

cell_segmentation_method

String

Method (e.g., "watershed").

cell_segmentation_extra_arguments

String

Extra flags (e.g., "--closing-radius 5").

register_across_channels

Boolean

Enable cross-channel registration logic.

Feature Extraction

Parameter

Type

Description

phenotype_nuclei_features

Array[String]

List of features (e.g., ["intensity_*"]).

phenotype_cell_features

Array[String]

List of features.

phenotype_cytosol_features

Array[String]

List of features.

features_cell_min_area

Int

Minimum area filter for cells.

features_nuclei_min_area

Int

Minimum area filter for nuclei.

Sequencing (ISS)

Parameter

Type

Description

barcodes

String

Path to CSV containing the library design.

barcode_column

String

Column name in the barcode CSV.

iss_expected_cycles

Int

Number of sequencing cycles.

iss_channels

Array[Int]

Channels to use for spot detection. Default: [1,2,3,4].

reads_bases

String

Bases order (e.g., "GTAC").

spot_detection_sigma_log

Array[Float]

Sigma for Laplacian of Gaussian spot detection.

Additional Parameters

Parameter

Type

Description

model_dir

String

Path containing deep learning model resouces (See FAQ for more details.)

run_<task>

Boolean

Set to false, (e.g. run_nuclei_segmentation) to skip task

force_<task>

Boolean

Set to true, to re-run task (e.g. force_segment_cell) even if output exists.

Resources

Various

segment_nuclei_cpu, segment_nuclei_memory, etc. can be set to override defaults.

batch_size

Int

Number of groups to process in one batch.

Outputs

The output_directory will contain subdirectories for every major step:

  • segment.zarr: Nuclei and Cell labels.

  • pheno-to-iss-registered.zarr: Phenotype images transformed to align with ISS.

  • features-nuclei-<index>/, features-cell-<index>/, features-cytosol-<index>/: Parquet files containing calculated features. The <index> refers to different splits of the data that had been run in parallel.

  • spot-detect.zarr: Raw spot locations.

  • reads/: Decoded reads per cell.

  • merge/: Final Output. A merged Parquet dataset linking Cell IDs, Barcodes, and Phenotypic Features.


Running on AWS HealthOmics

AWS HealthOmics provides a managed service for running bioinformatics workflows at scale. Scallops workflows (WDL) are fully compatible with HealthOmics. We recommend using the miniwdl-omics-run tool to simplify the submission process.

Prerequisites

  1. S3 Buckets: You must have S3 buckets for inputs (images) and outputs.

  2. IAM Role: An IAM role with permissions to read/write to your S3 buckets and execution permissions for HealthOmics.

  3. Docker Images: The Scallops Docker image must be in Amazon ECR (Elastic Container Registry).

Step 1: Configure Input JSON

Create a JSON file (e.g., ops_input.json) defining your inputs. Below is a minimal example for the OPS Workflow.

Note: Ensure all S3 paths end with a trailing slash / if they refer to directories.

{
  "iss_url": "s3://your-bucket/experiment/ISS/stitch.zarr/",
  "iss_image_pattern": "{plate}-{well}-{t}",
  "phenotype_url": "s3://your-bucket/experiment/Pheno/stitch.zarr/",
  "phenotype_image_pattern": "{plate}-{well}-{t}",

  "subset": ["A-1", "A-2"],
  "groupby": ["plate", "well"],
  "output_directory": "s3://your-output-bucket/results/experiment_name/",

  "reference_phenotype_time": "IF",
  "phenotype_dapi_channel": 4,
  "phenotype_cyto_channel": [6],

  "phenotype_nuclei_features": ["intensity_*", "sizeshape", "colocalization_*_*", "spots_1,2,3"],
  "phenotype_cell_features": ["intensity_*", "sizeshape", "colocalization_*_*", "spots_1,2,3"],
  "phenotype_cytosol_features": ["intensity_*", "sizeshape", "colocalization_*_*", "spots_1,2,3"],

  "barcodes": "s3://your-bucket/library/barcodes.csv",
  "barcode_column": "opsBarcode",
  "reads_labels": "cell",
  "iss_expected_cycles": 7,
  "reads_bases": "GTAC",

  "segment_cell_threshold_correction_factor": 1.0,
  "cell_segmentation_extra_arguments": "--closing-radius 5",

  "docker": "123456789012.dkr.ecr.us-region-1.amazonaws.com/scallops:latest"
}

Step 2: Run with miniwdl-omics-run

Use the miniwdl-omics-run utility to submit the workflow. This tool zips your local WDL files, uploads them to S3, and triggers the HealthOmics run.

miniwdl-omics-run \
  scallops/wdl/ops_workflow.wdl \
  -i ops_input.json \
  --role-arn arn:aws:iam::123456789012:role/YourHealthOmicsWorkflowRole \
  --output-uri s3://your-output-bucket/omics-logs/ \
  --name "OPS_Experiment_Run_01"

Arguments Explained:

  • Workflow File: Points to the local main WDL file (e.g., scallops/wdl/ops_workflow.wdl). It will automatically bundle dependencies like ops_tasks.wdl.

  • -i: The input JSON file you created in Step 1.

  • –role-arn: The AWS IAM role ARN that HealthOmics assumes to access S3 and CloudWatch.

  • –output-uri: The S3 location where HealthOmics will store execution logs (different from the workflow output_directory).

  • –name: A custom name for the run to identify it in the AWS Console.


Customizing Workflows

Scallops’ WDL architecture is modular. Key computational steps (such as stitching, registration, and segmentation) are defined as independent Tasks in files like ops_tasks.wdl and stitch_tasks.wdl. This design allows you to construct your own custom workflows by importing these tasks, rather than relying solely on the pre-built end-to-end pipelines.

You can mix and match Scallops tasks with your own custom tasks (e.g., for QC or specific file conversions) to create tailored analysis solutions.

Example: Building a Custom Registration Workflow

Suppose you only need to perform image registration without the full segmentation or sequencing analysis. You can create a simple WDL file that imports the Scallops tasks and calls only the registration step.

  1. Create a new WDL file (e.g., my_registration.wdl).

  2. Import the Scallops tasks file.

  3. Define a workflow that calls the specific task.

version 1.0

# Import the existing Scallops tasks
import "scallops/wdl/ops_tasks.wdl" as tasks

workflow my_custom_registration {
    input {
        String moving_image
        String fixed_image
        String output_dir
        String docker
    }

    # Call the existing Scallops registration task
    call tasks.register_elastix {
        input:
            moving = [moving_image],
            fixed = fixed_image,
            transform_output_directory = output_dir + "/transforms",
            moving_output_directory = output_dir + "/registered_images",
            # Pass through required runtime parameters
            docker = docker,
            cpu = 4,
            memory = "16 GiB",
            # ... (other required inputs like zones, disks, etc.)
    }
}

Modifying Existing Tasks

If the pre-built tasks do not perfectly fit your needs (e.g., you need to change the resource allocation or add a specific command-line flag not currently exposed), you can modify the task definitions directly:

  1. Copy the relevant task file (e.g., ops_tasks.wdl) to your local directory.

  2. Edit the runtime block to adjust memory/CPU, or the command block to add new flags.

  3. Point your workflow to import your modified task file instead of the standard one.

# In your workflow file
import "my_modified_tasks.wdl" as tasks