Data Structures and data manipulationο
In this tutorial weβll walk you through some of the data structures we use in scallops, how to read it and manipulate them. We will intoduce you to some useful functions when using SCALLOPS API. Letβs start with some general imports:
[1]:
import skimage
import xarray as xr
from scallops.io import read_image
from scallops.datasets import feldman_2019_small
[2]:
experiment_c_path = feldman_2019_small()
image1 = read_image(
experiment_c_path
/ "input"
/ "10X_c1-SBS-1"
/ "10X_c1-SBS-1_A1_Tile-102.sbs.tif"
)
image1.sizes
[2]:
Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})
We can see that this data, which comes from the public dataset experiment C. This particular image comes from well A1, tile 102 of the SBS experiment cycle 1 (see the c1 in the name). As you can see, the sizes of each dimension can be accessed in this data structure through the sizes attribute, showing a single round (cycle 1), 5 channels corresponding to DAPI + nucleotide channels, one z-plane, and x and y contain 1024 pixels each. Some
images will contain the names of each element in those channels, simply by accessing the c attribute of image:
[3]:
image1.c
[3]:
<xarray.DataArray 'c' (c: 5)> Size: 220B
array(['Channel:0:0', 'Channel:0:1', 'Channel:0:2', 'Channel:0:3',
'Channel:0:4'], dtype='<U11')
Coordinates:
* c (c) <U11 220B 'Channel:0:0' 'Channel:0:1' ... 'Channel:0:4'Some cases (like this), is not very informative (channels names metadata was not provided) so youβll have to check the experiment for order of said channels. In this case: DAPI, G, T, A, and C.
Indexing and slicing the imageο
For a thorough description of how to index and slice this data structure, please check the xarray documentation. Briefly, you can select a particular dimension index by using isel or name by using sel:
[4]:
print(image1.isel(c=1).sizes) # select channel 1 by index
print(image1.sel(c="Channel:0:1").sizes) # select channel 1 by name
Frozen({'t': 1, 'z': 1, 'y': 1024, 'x': 1024})
Frozen({'t': 1, 'z': 1, 'y': 1024, 'x': 1024})
You can also use isel and slice to slice one of the dimensions:
[5]:
image1.isel(x=slice(30, 40), y=slice(100, 150)).sizes
[5]:
Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 50, 'x': 10})
Concatenate Imagesο
Sometimes we would like to combine a set of images in an specific dimension. An example of that would be to concatenate all cycles into a single multi t image for example. Letβs read a second cycle and concatenate it to the first to have a 2 cycle datastructure:
[6]:
image2 = read_image(
experiment_c_path
/ "input"
/ "10X_c2-SBS-2"
/ "10X_c2-SBS-2_A1_Tile-102.sbs.tif"
)
image2.sizes
[6]:
Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})
[7]:
xr.concat((image1, image2), dim="t").sizes
[7]:
Frozen({'t': 2, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})
Using xarrayβs concat function we can concatenate in any dimension, providing other dimensions broadcast appropriately
Reading a bigger than memory imageο
Sometimes the images we want to read are too big to fit in memory. For those cases SCALLOPS provides dask option, to lazy load the image:
[8]:
image_dask = read_image(
experiment_c_path
/ "input"
/ "10X_c2-SBS-2"
/ "10X_c2-SBS-2_A1_Tile-102.sbs.tif",
dask=True,
)
image_dask.data
[8]:
|
||||||||||||||||
In this case the slicing, and operations with the xarray remain the same, but only will be executed after a computing the operations (see dask for more details).
Read Directory of Imagesο
Reading each cycle separate and then concatenating them is one way to create a single data structure containing the combination of multiple images. However, it becomes cumbersome when not only you might have multiple cycles but also different folders. For example, in our data_path, in experimentC/input we have a structure like this:
βββ 10X_c1-SBS-1
βΒ Β βββ 10X_c1-SBS-1_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c1-SBS-1_A1_Tile-103.sbs.tif
βββ 10X_c10-SBS-10
βΒ Β βββ 10X_c10-SBS-10_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c10-SBS-10_A1_Tile-103.sbs.tif
βββ 10X_c2-SBS-2
βΒ Β βββ 10X_c2-SBS-2_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c2-SBS-2_A1_Tile-103.sbs.tif
βββ 10X_c3-SBS-3
βΒ Β βββ 10X_c3-SBS-3_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c3-SBS-3_A1_Tile-103.sbs.tif
βββ 10X_c4-SBS-4
βΒ Β βββ 10X_c4-SBS-4_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c4-SBS-4_A1_Tile-103.sbs.tif
βββ 10X_c5-SBS-5
βΒ Β βββ 10X_c5-SBS-5_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c5-SBS-5_A1_Tile-103.sbs.tif
βββ 10X_c7-SBS-7
βΒ Β βββ 10X_c7-SBS-7_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c7-SBS-7_A1_Tile-103.sbs.tif
βββ 10X_c8-SBS-8
βΒ Β βββ 10X_c8-SBS-8_A1_Tile-102.sbs.tif
βΒ Β βββ 10X_c8-SBS-8_A1_Tile-103.sbs.tif
βββ 10X_c9-SBS-9
βββ 10X_c9-SBS-9_A1_Tile-102.sbs.tif
βββ 10X_c9-SBS-9_A1_Tile-103.sbs.tif
9 cycles (missing 6) of SBS data in two tiles of the well A1. There are multiple ways we might want to group these, but for now letβs assume that we want to access each tile independently with well A1. SCALLOPS provides a nice reading function read_experiment, that allows you to read all images at once, based on a parent folder path, and a file pattern:
[9]:
from scallops.io import read_experiment
experiment = read_experiment(
experiment_c_path/ "input",
group_by=("well", "tile"),
files_pattern="{prefix}/{mag}X_c{t}-{skip}_{well}_Tile-{tile}.{data_type}.tif",
)
experiment
[9]:
Experiment with 2 images and 0 labels
The read_experiment function returns an Experiment object, which contains images or labels. In this particular case, we only see images:
[10]:
experiment.images.keys()
[10]:
dict_keys(['A1-102', 'A1-103'])
We access those images through their names:
[11]:
experiment.images["A1-102"].sizes
[11]:
Frozen({'t': 9, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})
Notice how the pattern follows the f-string, or formatting string literals, where the pattern is done by giving a group name to the curly brackets or {skip}. The latter case is akin to the * in bash. An important obsevation is that the literal t is special, as it will be collected to group by it. In our example above, you can see that t got grouped into the t dimension, while the well/tile grouping allows you
to index by a dash-separated string (e.g. A1-102 for well A1 tile 102)
Apply a Function To All Images In An Experimentο
Now that we have an experiment object, letβs explore the how to apply functions to every image in an experiment. To showcase this we can use a simple high_pass_filter filter:
[12]:
from itertools import product
from scallops.experiment.util import map_images
def high_pass_filter(image: xr.DataArray, sigma: float) -> xr.DataArray:
"""High pass filter typically used to remove background using gaussian filters.
:param image: Input image to filter.
:param sigma: Standard deviation for Gaussian kernel.
:return: Filtered image.
"""
im = image.copy()
for t, c, z in product(
range(image.t.size), range(image.c.size), range(image.z.size)
):
data = im.isel(t=t, c=c, z=z).data.squeeze()
lowpass = skimage.filters.gaussian(data, sigma=sigma, preserve_range=True)
highpass = data - lowpass
highpass[lowpass > data] = 0
im.data[t, c, z, ...] = highpass
return im
gaussian_filtered_experiment = map_images(experiment, high_pass_filter, sigma=3)
gaussian_filtered_experiment.images.keys()
[12]:
dict_keys(['A1-102', 'A1-103'])
The above function applies the high_pass filter to every slice of t, c and z dimensions, returning an experiment with the transformed data.
Apply a Function To All Common Keys In Multiple Experimentsο
You can also group multiple experiments and apply a function to all common keys. For example, letβs say you want to compute the differences between the gaussian_filtered_experiment and the original experiment. You can simply use map_images like so:
[13]:
def diff(data: xr.DataArray, filtered_data: xr.DataArray):
return data - filtered_data
diff_experiment = map_images((experiment, gaussian_filtered_experiment), diff)
diff_experiment.images.keys()
[13]:
dict_keys(['A1-102', 'A1-103'])
This will return a new experiment with the same keys, after applying the function. Note that the function pass must take the key-sharing xarrays as inputs.
Read and Write an Experimentο
Now, after we have modified our experiment, you want want to save it. Fortunately SCALLOPS Experiment class comes with a method called save it in a zarr format:
[14]:
experiment.save("test.zarr")
The test.zarr will contain the images of the two tiles:
βββ images
βββ A1-102
βββ A1-103
can can be read as before:
[15]:
read_experiment("test.zarr", group_by=("well", "tile"), files_pattern="{well}-{tile}")
[15]:
Experiment with 2 images and 0 labels