{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "# Data Structures and data manipulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we'll walk you through some of the data structures we use in scallops, how to read it and manipulate them. We will intoduce you to some useful functions when using SCALLOPS API. Let's start with some general imports:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:24:59.001235Z", "start_time": "2023-06-28T13:24:51.855297Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:20:55.570916Z", "iopub.status.busy": "2026-01-22T20:20:55.570406Z", "iopub.status.idle": "2026-01-22T20:20:59.211182Z", "shell.execute_reply": "2026-01-22T20:20:59.210863Z", "shell.execute_reply.started": "2026-01-22T20:20:55.570877Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import skimage\n", "import xarray as xr\n", "from scallops.io import read_image\n", "from scallops.datasets import feldman_2019_small" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:24:59.071480Z", "start_time": "2023-06-28T13:24:59.001764Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:20:59.211957Z", "iopub.status.busy": "2026-01-22T20:20:59.211637Z", "iopub.status.idle": "2026-01-22T20:20:59.245400Z", "shell.execute_reply": "2026-01-22T20:20:59.245044Z", "shell.execute_reply.started": "2026-01-22T20:20:59.211942Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Frozen({'t': 1, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment_c_path = feldman_2019_small()\n", "image1 = read_image(\n", " experiment_c_path\n", " / \"input\"\n", " / \"10X_c1-SBS-1\"\n", " / \"10X_c1-SBS-1_A1_Tile-102.sbs.tif\"\n", ")\n", "image1.sizes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that this data, which comes from the public dataset [experiment C](https://dspace.mit.edu/handle/1721.1/128137). This particular image comes from well A1, tile 102 of the SBS experiment cycle 1 (see the `c1` in the name). As you can see, the sizes of each dimension can be accessed in this data structure through the `sizes` attribute, showing a single round (cycle 1), 5 channels corresponding to DAPI + nucleotide channels, one z-plane, and x and y contain 1024 pixels each. Some images will contain the names of each element in those channels, simply by accessing the `c` attribute of image:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-01-22T20:20:59.247954Z", "iopub.status.busy": "2026-01-22T20:20:59.247811Z", "iopub.status.idle": "2026-01-22T20:20:59.257236Z", "shell.execute_reply": "2026-01-22T20:20:59.256815Z", "shell.execute_reply.started": "2026-01-22T20:20:59.247944Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'c' (c: 5)> Size: 220B\n",
       "array(['Channel:0:0', 'Channel:0:1', 'Channel:0:2', 'Channel:0:3',\n",
       "       'Channel:0:4'], dtype='<U11')\n",
       "Coordinates:\n",
       "  * c        (c) <U11 220B 'Channel:0:0' 'Channel:0:1' ... 'Channel:0:4'
" ], "text/plain": [ " Size: 220B\n", "array(['Channel:0:0', 'Channel:0:1', 'Channel:0:2', 'Channel:0:3',\n", " 'Channel:0:4'], dtype='\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 10.00 MiB 2.00 MiB
Shape (1, 5, 1, 1024, 1024) (1, 1, 1, 1024, 1024)
Dask graph 5 chunks in 19 graph layers
Data type uint16 numpy.ndarray
\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 5\n", " 1\n", "\n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 1024\n", " 1024\n", " 1\n", "\n", " \n", " \n", "" ], "text/plain": [ "dask.array" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image_dask = read_image(\n", " experiment_c_path\n", " / \"input\"\n", " / \"10X_c2-SBS-2\"\n", " / \"10X_c2-SBS-2_A1_Tile-102.sbs.tif\",\n", " dask=True,\n", ")\n", "image_dask.data" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "In this case the slicing, and operations with the xarray remain the same, but only will be executed after a computing the operations (see [dask](https://www.dask.org/) for more details).\n", "\n", "## Read Directory of Images\n", "Reading each cycle separate and then concatenating them is one way to create a single data structure containing the combination of multiple images. However, it becomes cumbersome when not only you might have multiple cycles but also different folders. For example, in our `data_path`, in `experimentC/input` we have a structure like this:\n", "\n", "```\n", "├── 10X_c1-SBS-1\n", "│   ├── 10X_c1-SBS-1_A1_Tile-102.sbs.tif\n", "│   └── 10X_c1-SBS-1_A1_Tile-103.sbs.tif\n", "├── 10X_c10-SBS-10\n", "│   ├── 10X_c10-SBS-10_A1_Tile-102.sbs.tif\n", "│   └── 10X_c10-SBS-10_A1_Tile-103.sbs.tif\n", "├── 10X_c2-SBS-2\n", "│   ├── 10X_c2-SBS-2_A1_Tile-102.sbs.tif\n", "│   └── 10X_c2-SBS-2_A1_Tile-103.sbs.tif\n", "├── 10X_c3-SBS-3\n", "│   ├── 10X_c3-SBS-3_A1_Tile-102.sbs.tif\n", "│   └── 10X_c3-SBS-3_A1_Tile-103.sbs.tif\n", "├── 10X_c4-SBS-4\n", "│   ├── 10X_c4-SBS-4_A1_Tile-102.sbs.tif\n", "│   └── 10X_c4-SBS-4_A1_Tile-103.sbs.tif\n", "├── 10X_c5-SBS-5\n", "│   ├── 10X_c5-SBS-5_A1_Tile-102.sbs.tif\n", "│   └── 10X_c5-SBS-5_A1_Tile-103.sbs.tif\n", "├── 10X_c7-SBS-7\n", "│   ├── 10X_c7-SBS-7_A1_Tile-102.sbs.tif\n", "│   └── 10X_c7-SBS-7_A1_Tile-103.sbs.tif\n", "├── 10X_c8-SBS-8\n", "│   ├── 10X_c8-SBS-8_A1_Tile-102.sbs.tif\n", "│   └── 10X_c8-SBS-8_A1_Tile-103.sbs.tif\n", "└── 10X_c9-SBS-9\n", " ├── 10X_c9-SBS-9_A1_Tile-102.sbs.tif\n", " └── 10X_c9-SBS-9_A1_Tile-103.sbs.tif\n", "```\n", "9 cycles (missing 6) of SBS data in two tiles of the well A1. There are multiple ways we might want to group these, but for now let's assume that we want to access each tile independently with well A1. SCALLOPS provides a nice reading function ```read_experiment```, that allows you to read all images at once, based on a parent folder path, and a file pattern:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:24:59.197558Z", "start_time": "2023-06-28T13:24:59.156720Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:20:59.345261Z", "iopub.status.busy": "2026-01-22T20:20:59.345163Z", "iopub.status.idle": "2026-01-22T20:20:59.353331Z", "shell.execute_reply": "2026-01-22T20:20:59.352962Z", "shell.execute_reply.started": "2026-01-22T20:20:59.345252Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Experiment with 2 images and 0 labels" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scallops.io import read_experiment\n", "\n", "experiment = read_experiment(\n", " experiment_c_path/ \"input\",\n", " group_by=(\"well\", \"tile\"),\n", " files_pattern=\"{prefix}/{mag}X_c{t}-{skip}_{well}_Tile-{tile}.{data_type}.tif\",\n", ")\n", "experiment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `read_experiment` function returns an `Experiment` object, which contains images or labels. In this particular case, we only see images:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:24:59.230801Z", "start_time": "2023-06-28T13:24:59.197763Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:20:59.353833Z", "iopub.status.busy": "2026-01-22T20:20:59.353722Z", "iopub.status.idle": "2026-01-22T20:20:59.356221Z", "shell.execute_reply": "2026-01-22T20:20:59.355879Z", "shell.execute_reply.started": "2026-01-22T20:20:59.353823Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['A1-102', 'A1-103'])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.images.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We access those images through their names:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:24:59.434118Z", "start_time": "2023-06-28T13:24:59.231482Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:20:59.356865Z", "iopub.status.busy": "2026-01-22T20:20:59.356775Z", "iopub.status.idle": "2026-01-22T20:20:59.561247Z", "shell.execute_reply": "2026-01-22T20:20:59.560892Z", "shell.execute_reply.started": "2026-01-22T20:20:59.356856Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Frozen({'t': 9, 'c': 5, 'z': 1, 'y': 1024, 'x': 1024})" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment.images[\"A1-102\"].sizes" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "Notice how the pattern follows the [f-string](https://docs.python.org/3/reference/lexical_analysis.html#f-strings), or formatting string literals, where the pattern is done by giving a group name to the curly brackets or `{skip}`. The latter case is akin to the `*` in bash. An important obsevation is that the literal `t` is special, as it will be collected to group by it. In our example above, you can see that t got grouped into the t dimension, while the well/tile grouping allows you to index by a dash-separated string (e.g. A1-102 for well A1 tile 102)\n", "\n", "## Apply a Function To All Images In An Experiment\n", "Now that we have an experiment object, let's explore the how to apply functions to every image in an experiment. To showcase this we can use a simple [high_pass_filter](fhttps://gred-cumulus.pages.roche.com/scallops/scallops.utils.high_pass_filter.html#scallops.utils.high_pass_filter) filter:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:25:05.094391Z", "start_time": "2023-06-28T13:24:59.435752Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:20:59.563233Z", "iopub.status.busy": "2026-01-22T20:20:59.563097Z", "iopub.status.idle": "2026-01-22T20:21:07.265785Z", "shell.execute_reply": "2026-01-22T20:21:07.265327Z", "shell.execute_reply.started": "2026-01-22T20:20:59.563222Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['A1-102', 'A1-103'])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from itertools import product\n", "\n", "from scallops.experiment.util import map_images\n", "\n", "\n", "def high_pass_filter(image: xr.DataArray, sigma: float) -> xr.DataArray:\n", " \"\"\"High pass filter typically used to remove background using gaussian filters.\n", "\n", " :param image: Input image to filter.\n", " :param sigma: Standard deviation for Gaussian kernel.\n", " :return: Filtered image.\n", " \"\"\"\n", " im = image.copy()\n", " for t, c, z in product(\n", " range(image.t.size), range(image.c.size), range(image.z.size)\n", " ):\n", " data = im.isel(t=t, c=c, z=z).data.squeeze()\n", " lowpass = skimage.filters.gaussian(data, sigma=sigma, preserve_range=True)\n", " highpass = data - lowpass\n", " highpass[lowpass > data] = 0\n", " im.data[t, c, z, ...] = highpass\n", " return im\n", "\n", "\n", "gaussian_filtered_experiment = map_images(experiment, high_pass_filter, sigma=3)\n", "gaussian_filtered_experiment.images.keys()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "The above function applies the high_pass filter to every slice of t, c and z dimensions, returning an experiment with the transformed data.\n", "\n", "## Apply a Function To All Common Keys In Multiple Experiments\n", "You can also group multiple experiments and apply a function to all common keys. For example, let's say you want to compute the differences between the `gaussian_filtered_experiment` and the original `experiment`. You can simply use ```map_images``` like so:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:25:06.978737Z", "start_time": "2023-06-28T13:25:05.089807Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:21:07.268743Z", "iopub.status.busy": "2026-01-22T20:21:07.266662Z", "iopub.status.idle": "2026-01-22T20:21:12.233623Z", "shell.execute_reply": "2026-01-22T20:21:12.233245Z", "shell.execute_reply.started": "2026-01-22T20:21:07.268727Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['A1-102', 'A1-103'])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def diff(data: xr.DataArray, filtered_data: xr.DataArray):\n", " return data - filtered_data\n", "\n", "\n", "diff_experiment = map_images((experiment, gaussian_filtered_experiment), diff)\n", "diff_experiment.images.keys()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "This will return a new experiment with the same keys, after applying the function. Note that the function pass must take the key-sharing xarrays as inputs.\n", "\n", "## Read and Write an Experiment\n", "Now, after we have modified our experiment, you want want to save it. Fortunately SCALLOPS ```Experiment``` class comes with a method called save it in a [zarr format](https://zarr.dev/):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:25:07.373811Z", "start_time": "2023-06-28T13:25:06.979466Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:21:12.234275Z", "iopub.status.busy": "2026-01-22T20:21:12.234154Z", "iopub.status.idle": "2026-01-22T20:21:12.997200Z", "shell.execute_reply": "2026-01-22T20:21:12.996831Z", "shell.execute_reply.started": "2026-01-22T20:21:12.234263Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "experiment.save(\"test.zarr\")" ] }, { "cell_type": "markdown", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The `test.zarr` will contain the images of the two tiles:\n", "\n", "```\n", "└── images\n", " ├── A1-102\n", " └── A1-103\n", "```\n", "can can be read as before:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2023-06-28T13:25:07.410639Z", "start_time": "2023-06-28T13:25:07.374512Z" }, "collapsed": false, "execution": { "iopub.execute_input": "2026-01-22T20:21:12.997708Z", "iopub.status.busy": "2026-01-22T20:21:12.997615Z", "iopub.status.idle": "2026-01-22T20:21:13.000996Z", "shell.execute_reply": "2026-01-22T20:21:13.000748Z", "shell.execute_reply.started": "2026-01-22T20:21:12.997699Z" }, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "Experiment with 2 images and 0 labels" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "read_experiment(\"test.zarr\", group_by=(\"well\", \"tile\"), files_pattern=\"{well}-{tile}\")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.12" } }, "nbformat": 4, "nbformat_minor": 4 }