Data Structures and data manipulation

In this tutorial we’ll walk you through some of the data structures we use in scallops, how to read it and manipulate them. We will intoduce you to some useful functions when using SCALLOPS API. Let’s start with some general imports:

[1]:

import skimage
import xarray as xr
from scallops.io import read_image
from scallops.datasets import feldman_2019_small

[2]:

experiment_c_path = feldman_2019_small()
image1 = read_image(
    experiment_c_path
    / "input"
    / "10X_c1-SBS-1"
    / "10X_c1-SBS-1_A1_Tile-102.sbs.tif"
)
image1.sizes

[2]:

Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})

We can see that this data, which comes from the public dataset experiment C. This particular image comes from well A1, tile 102 of the SBS experiment cycle 1 (see the c1 in the name). As you can see, the sizes of each dimension can be accessed in this data structure through the sizes attribute, showing a single round (cycle 1), 5 channels corresponding to DAPI + nucleotide channels, one z-plane, and x and y contain 1024 pixels each. Some images will contain the names of each element in those channels, simply by accessing the c attribute of image:

[3]:

image1.c

[3]:

<xarray.DataArray 'c' (c: 5)> Size: 220B
array(['Channel:0:0', 'Channel:0:1', 'Channel:0:2', 'Channel:0:3',
       'Channel:0:4'], dtype='<U11')
Coordinates:
  * c        (c) <U11 220B 'Channel:0:0' 'Channel:0:1' ... 'Channel:0:4'

Some cases (like this), is not very informative (channels names metadata was not provided) so you’ll have to check the experiment for order of said channels. In this case: DAPI, G, T, A, and C.

Indexing and slicing the image

For a thorough description of how to index and slice this data structure, please check the xarray documentation. Briefly, you can select a particular dimension index by using isel or name by using sel:

[4]:

print(image1.isel(c=1).sizes)  # select channel 1 by index
print(image1.sel(c="Channel:0:1").sizes)  # select channel 1 by name

Frozen({'t': 1, 'z': 1, 'y': 1024, 'x': 1024})
Frozen({'t': 1, 'z': 1, 'y': 1024, 'x': 1024})

You can also use isel and slice to slice one of the dimensions:

[5]:

image1.isel(x=slice(30, 40), y=slice(100, 150)).sizes

[5]:

Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 50, 'x': 10})

Concatenate Images

Sometimes we would like to combine a set of images in an specific dimension. An example of that would be to concatenate all cycles into a single multi t image for example. Let’s read a second cycle and concatenate it to the first to have a 2 cycle datastructure:

[6]:

image2 = read_image(
   experiment_c_path
    / "input"
    / "10X_c2-SBS-2"
    / "10X_c2-SBS-2_A1_Tile-102.sbs.tif"
)
image2.sizes

[6]:

Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})

[7]:

xr.concat((image1, image2), dim="t").sizes

[7]:

Frozen({'t': 2, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})

Using xarray’s concat function we can concatenate in any dimension, providing other dimensions broadcast appropriately

Reading a bigger than memory image

Sometimes the images we want to read are too big to fit in memory. For those cases SCALLOPS provides dask option, to lazy load the image:

[8]:

image_dask = read_image(
   experiment_c_path
    / "input"
    / "10X_c2-SBS-2"
    / "10X_c2-SBS-2_A1_Tile-102.sbs.tif",
    dask=True,
)
image_dask.data

[8]:

	Array	Chunk
Bytes	10.00 MiB	2.00 MiB
Shape	(1, 5, 1, 1024, 1024)	(1, 1, 1, 1024, 1024)
Dask graph	5 chunks in 19 graph layers
Data type	uint16 numpy.ndarray

In this case the slicing, and operations with the xarray remain the same, but only will be executed after a computing the operations (see dask for more details).

Read Directory of Images

Reading each cycle separate and then concatenating them is one way to create a single data structure containing the combination of multiple images. However, it becomes cumbersome when not only you might have multiple cycles but also different folders. For example, in our data_path, in experimentC/input we have a structure like this:

├── 10X_c1-SBS-1
│   ├── 10X_c1-SBS-1_A1_Tile-102.sbs.tif
│   └── 10X_c1-SBS-1_A1_Tile-103.sbs.tif
├── 10X_c10-SBS-10
│   ├── 10X_c10-SBS-10_A1_Tile-102.sbs.tif
│   └── 10X_c10-SBS-10_A1_Tile-103.sbs.tif
├── 10X_c2-SBS-2
│   ├── 10X_c2-SBS-2_A1_Tile-102.sbs.tif
│   └── 10X_c2-SBS-2_A1_Tile-103.sbs.tif
├── 10X_c3-SBS-3
│   ├── 10X_c3-SBS-3_A1_Tile-102.sbs.tif
│   └── 10X_c3-SBS-3_A1_Tile-103.sbs.tif
├── 10X_c4-SBS-4
│   ├── 10X_c4-SBS-4_A1_Tile-102.sbs.tif
│   └── 10X_c4-SBS-4_A1_Tile-103.sbs.tif
├── 10X_c5-SBS-5
│   ├── 10X_c5-SBS-5_A1_Tile-102.sbs.tif
│   └── 10X_c5-SBS-5_A1_Tile-103.sbs.tif
├── 10X_c7-SBS-7
│   ├── 10X_c7-SBS-7_A1_Tile-102.sbs.tif
│   └── 10X_c7-SBS-7_A1_Tile-103.sbs.tif
├── 10X_c8-SBS-8
│   ├── 10X_c8-SBS-8_A1_Tile-102.sbs.tif
│   └── 10X_c8-SBS-8_A1_Tile-103.sbs.tif
└── 10X_c9-SBS-9
    ├── 10X_c9-SBS-9_A1_Tile-102.sbs.tif
    └── 10X_c9-SBS-9_A1_Tile-103.sbs.tif

9 cycles (missing 6) of SBS data in two tiles of the well A1. There are multiple ways we might want to group these, but for now let’s assume that we want to access each tile independently with well A1. SCALLOPS provides a nice reading function read_experiment, that allows you to read all images at once, based on a parent folder path, and a file pattern:

[9]:

from scallops.io import read_experiment

experiment = read_experiment(
   experiment_c_path/ "input",
    group_by=("well", "tile"),
    files_pattern="{prefix}/{mag}X_c{t}-{skip}_{well}_Tile-{tile}.{data_type}.tif",
)
experiment

[9]:

Experiment with 2 images and 0 labels

The read_experiment function returns an Experiment object, which contains images or labels. In this particular case, we only see images:

[10]:

experiment.images.keys()

[10]:

dict_keys(['A1-102', 'A1-103'])

We access those images through their names:

[11]:

experiment.images["A1-102"].sizes

[11]:

Frozen({'t': 9, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})

Notice how the pattern follows the f-string, or formatting string literals, where the pattern is done by giving a group name to the curly brackets or {skip}. The latter case is akin to the * in bash. An important obsevation is that the literal t is special, as it will be collected to group by it. In our example above, you can see that t got grouped into the t dimension, while the well/tile grouping allows you to index by a dash-separated string (e.g. A1-102 for well A1 tile 102)

Apply a Function To All Images In An Experiment

Now that we have an experiment object, let’s explore the how to apply functions to every image in an experiment. To showcase this we can use a simple high_pass_filter filter:

[12]:

from itertools import product

from scallops.experiment.util import map_images


def high_pass_filter(image: xr.DataArray, sigma: float) -> xr.DataArray:
    """High pass filter typically used to remove background using gaussian filters.

    :param image: Input image to filter.
    :param sigma: Standard deviation for Gaussian kernel.
    :return: Filtered image.
    """
    im = image.copy()
    for t, c, z in product(
        range(image.t.size), range(image.c.size), range(image.z.size)
    ):
        data = im.isel(t=t, c=c, z=z).data.squeeze()
        lowpass = skimage.filters.gaussian(data, sigma=sigma, preserve_range=True)
        highpass = data - lowpass
        highpass[lowpass > data] = 0
        im.data[t, c, z, ...] = highpass
    return im


gaussian_filtered_experiment = map_images(experiment, high_pass_filter, sigma=3)
gaussian_filtered_experiment.images.keys()

[12]:

dict_keys(['A1-102', 'A1-103'])

The above function applies the high_pass filter to every slice of t, c and z dimensions, returning an experiment with the transformed data.

Apply a Function To All Common Keys In Multiple Experiments

You can also group multiple experiments and apply a function to all common keys. For example, let’s say you want to compute the differences between the gaussian_filtered_experiment and the original experiment. You can simply use map_images like so:

[13]:

def diff(data: xr.DataArray, filtered_data: xr.DataArray):
    return data - filtered_data


diff_experiment = map_images((experiment, gaussian_filtered_experiment), diff)
diff_experiment.images.keys()

[13]:

dict_keys(['A1-102', 'A1-103'])

This will return a new experiment with the same keys, after applying the function. Note that the function pass must take the key-sharing xarrays as inputs.

Read and Write an Experiment

Now, after we have modified our experiment, you want want to save it. Fortunately SCALLOPS Experiment class comes with a method called save it in a zarr format:

[14]:

experiment.save("test.zarr")

The test.zarr will contain the images of the two tiles:

└── images
    ├── A1-102
    └── A1-103

can can be read as before:

[15]:

read_experiment("test.zarr", group_by=("well", "tile"), files_pattern="{well}-{tile}")

[15]:

Experiment with 2 images and 0 labels