Optical Pooled Screening Example
Before starting any full scale experiment, it is always recommended to explore a few tiles such that thresholds are set, segmentation is evaluated and in general have a better understanding of the data. This notebook walks you through some of the steps that you can use to do such a task
Load Experiment
The first step in any pipeline is to load the experiment. SCALLOPS can load single image with read_image, or an entire experiment (or a subset if requested) with read_experiment:
[1]:
from scallops.datasets import feldman_2019_small
from scallops.io import read_experiment
experiment_c_path = feldman_2019_small()
image_path = experiment_c_path / "input"
image_pattern = "{prefix}/{mag}X_c{t}-{experiment}-{t}_{well}_Tile-{tile}.{datatype}.tif" # Pattern that is followed by the images
In this particular example we will use several tiles of the public dataset experiment C. Its directory structure is (from the input directory):
├── 10X_c1-SBS-1
│ ├── 10X_c1-SBS-1_A1_Tile-102.sbs.tif
│ └── 10X_c1-SBS-1_A1_Tile-103.sbs.tif
├── 10X_c10-SBS-10
│ ├── 10X_c10-SBS-10_A1_Tile-102.sbs.tif
│ └── 10X_c10-SBS-10_A1_Tile-103.sbs.tif
├── 10X_c2-SBS-2
│ ├── 10X_c2-SBS-2_A1_Tile-102.sbs.tif
│ └── 10X_c2-SBS-2_A1_Tile-103.sbs.tif
├── 10X_c3-SBS-3
│ ├── 10X_c3-SBS-3_A1_Tile-102.sbs.tif
│ └── 10X_c3-SBS-3_A1_Tile-103.sbs.tif
├── 10X_c4-SBS-4
│ ├── 10X_c4-SBS-4_A1_Tile-102.sbs.tif
│ └── 10X_c4-SBS-4_A1_Tile-103.sbs.tif
├── 10X_c5-SBS-5
│ ├── 10X_c5-SBS-5_A1_Tile-102.sbs.tif
│ └── 10X_c5-SBS-5_A1_Tile-103.sbs.tif
├── 10X_c7-SBS-7
│ ├── 10X_c7-SBS-7_A1_Tile-102.sbs.tif
│ └── 10X_c7-SBS-7_A1_Tile-103.sbs.tif
├── 10X_c8-SBS-8
│ ├── 10X_c8-SBS-8_A1_Tile-102.sbs.tif
│ └── 10X_c8-SBS-8_A1_Tile-103.sbs.tif
└── 10X_c9-SBS-9
├── 10X_c9-SBS-9_A1_Tile-102.sbs.tif
└── 10X_c9-SBS-9_A1_Tile-103.sbs.tif
that is 9 cycles (missing cycle #6) of SBS data in two tiles of the well A1. The pattern that we have provided above ({prefix}/{mag}X_c{t}-{experiment}-{t}_{well}_Tile-{tile}.{datatype}.tif) contains:
prefix: this will store any parent directory name, e.g. 10X_c8-SBS-8, etc.
mag: Will store only the integer part of the magnification, 10 in this case.
t: Will store the t dimension. Since this is SBS data, it constitutes the cycle information.
experiment: The type of experiment.
well: from which well it comes from.
tile: which tile we are dealing with.
datatype: which kind of data it is. In this case, sbs.
All of these patterns can be arbitrary names that makes sense to you with the exception of the t, c, or z handles, as SCALLOPS will look for this name to store it appropriately.
Now that we know the structure of our data, let’s read it:
[2]:
iss_experiment = read_experiment(image_path, image_pattern, group_by=("well", "tile"))
print(iss_experiment)
print(iss_experiment.images.keys())
print(
read_experiment(
image_path, image_pattern, group_by=("well", "tile"), subset=["A1-102"]
)
) # If you want to subset your experiment
Experiment with 2 images and 0 labels
dict_keys(['A1-102', 'A1-103'])
Experiment with 1 image and 0 labels
Now we have our experiment loaded and ready to go. Now let’s access some images, and deal with the barcodes.
Load ISS Image and Barcodes
For some of the examples we will show you, we will use the image A1-102 as key. For experiment C the order of the bases was G, T, A C, and the dapi channel was capture first (index 0). All these parameters are important to do some of the processing of the data:
[3]:
import numpy as np
from scallops.io import read_barcodes
image_key = "A1-102" # Set the image key
dapi_channel = 0 # set a single dapi channel index
barcodes_path = experiment_c_path / "barcodes.csv"
iss_image = iss_experiment.images[image_key].max(
dim="z", keep_attrs=True
) # Get the image, and use the max z if any exist
iss_channels = np.delete(
np.arange(iss_image.sizes["c"]), dapi_channel
) # get the ISS only channels (remove dapi)
df_barcode = read_barcodes(
barcodes_path, iss_image.t.values - 1
) # Read the barcodes using the actual t indexes (missing cycle 6)
print("Image dimensions:", iss_image.sizes)
df_barcode.head()
Image dimensions: Frozen({'t': 9, 'c': 5, 'y': 1024, 'x': 1024})
[3]:
| barcode | sgRNA | gene_symbol | |
|---|---|---|---|
| 0 | CATTGACGC | CATTGCACGCCACAGCATTG | non-targeting |
| 1 | GAGTTCTAG | GAGTTTCTAGTAGTGGTAGG | non-targeting |
| 2 | AACGCGTCG | AACGCTGTCGTACGTGTATA | non-targeting |
| 3 | GCGGGCGGT | GCGGGGCGGTGACTTTCAAG | non-targeting |
| 4 | TTCTAGCCA | TTCTAAGCCACGTGTGGTAC | non-targeting |
Note that the barcodes contain all cycles except 6, as we expected. The read_barcodes is a utility simple function that also allows you to have a different name for the barcode by passing the barcode_column argument.
ISS Image
Now that we have a single image loaded into memory (iss_image), it is time to explore how it looks. There are a number of things that you might want to check, starting with what your data looks like in each channel. Luckily SCALLOPS’ plotting submodules will help with that. Let’s start with a montage plot of the channels with montage_plot:
[4]:
from scallops.visualize.composite import montage_plot
montage_plot(iss_image)
The figure above shows you every channel and you can make inferences about the quality, etc. However, it tells us very little about the alignment of the channels. During capture, channels and cycles can get misaligned, and that can create issues for downstream analyses.
Registration
The act of aligning channels and cycles is called registration. This is a crucial step, as it will be needed for base calling, etc. But before we go to the registration functions, let’s first verify how badly mis-aligned things are. SCALLOPS provides two ways to check this: imcomposite and diagnose_registration.
imcomposite
This shows a composite of all channels in a single image over a particular dimension. It is the most flexible and complete function for this task:
[5]:
from scallops.visualize.composite import imcomposite
imcomposite(iss_image.isel(c=dapi_channel), dim="t", figsize=(10, 10));
Here you can clearly see misalignment.
diagnose_registration
The other lightweight option is called diagnose_registration which is a simpler function. It has more limitations (only up to 6 images can be combined), but provides a bit more contrast than the imcomposite:
[6]:
import matplotlib.pyplot as plt
from scallops.visualize.registration import diagnose_registration
# Let's get some selectors
selectors = [
{"t": 1, "c": "Channel:0:0"}, # DAPI in first cycle
{"t": 1, "c": "Channel:0:1"}, # G in first cycle
{"t": 1, "c": "Channel:0:2"}, # T in first cycle
{"t": 1, "c": "Channel:0:3"}, # A in first cycle
{"t": 1, "c": "Channel:0:4"}, # C in first cycle
{"t": 2, "c": "Channel:0:0"}, # DAPI in second cycle
]
f, ax = plt.subplots(figsize=(10, 10))
axes = diagnose_registration(iss_image, *selectors, title="Before alignment")
The reason for the cap in the number of selectors is to keep the colormaps more dissimilar.
Both above cases showed us that the cycles and channels are not fully aligned, and we need to deal with this. This process is called registration. You can register an images with align_image:
[7]:
from scallops.registration.crosscorrelation import align_image
iss_image = align_image(
iss_image,
align_within_time_channels=iss_channels, # Ensure we align all channels within a cycle
align_between_time_channel=dapi_channel, # Ensure we align all cycles
filter_percentiles=[0, 90], # reduces potential noise during aligning (is optional)
)
imcomposite(iss_image.isel(c=dapi_channel), dim="t", figsize=(10, 10));
Now our ISS image is aligned.
Segmentation
The next step after registration is to segment the cellular compartments, starting with nuclei. SCALLOPS provides multiple algorithms for segmentation.
Segment Nuclei
Most OPS experiments start with segementing the nuclei. In fact most of our cellular segmentation options are seeded with the nuclear mask. For this tutorial we will focus on the stardist method, as we find it usually gives a good segmentation. That being said, the proper segmentation algorithm is very dataset dependent, so we encourage you to explore multiple algorithms. In SCALLOPS, stardist nuclear segmentation is called segment_nuclei_stardist.
[8]:
from scallops.segmentation.stardist import segment_nuclei_stardist
all_nuclei = segment_nuclei_stardist(image=iss_image, nuclei_channel=dapi_channel)
print(f"Nuclei, {len(np.unique(all_nuclei)) - 1:,}")
bioimageio_utils.py (2): pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
tf2onnx_lib.py (8): In the future `np.object` will be defined as the corresponding NumPy scalar.
Nuclei, 2,968
After acquiring the nuclei mask we would like to:
Visualize the mask
Check some geometric features
These two steps will help you verify and filter potential artifact of bad segmented areas, and define thresholds. For the first one we can use imcomposite again:
[9]:
imcomposite(
iss_image.isel(c=dapi_channel), dim="t", figsize=(10, 10), labels_contour=all_nuclei
);
Now lets look at the distribution of nuclei areas and set some thresholds. To extract this feature we will be using skimage.measure.regionprops_table:
[10]:
from skimage.measure import regionprops_table
nuclei_props = regionprops_table(all_nuclei, properties=["area"])
plt.hist(nuclei_props["area"], bins=100);
The distribution shows us a long tail, which is something that you would want to be inspected. From the visualization we can also see that there are a set of nuclei that seem too small. For the purposes of this example, we will assume that objects smaller than 30 and bigger than 200 are either artifacts, cells in mitotic state, or in general something we would like to filter out. SCALLOPS has a utility function filter_labels_by_area:
[11]:
from scallops.segmentation.util import remove_labels_by_area
nuclei_area_min = 30
nuclei_area_max = 200
nuclei = remove_labels_by_area(all_nuclei, nuclei_area_min, nuclei_area_max)
nuclei_props = regionprops_table(nuclei, properties=["area"])
f, axs = plt.subplots(ncols=2, figsize=(20, 10), layout="constrained")
axs[0].hist(nuclei_props["area"], bins=100)
imcomposite(iss_image.isel(c=dapi_channel), dim="t", labels_contour=nuclei, ax=axs[1]);
Segment cells
Now that we nuclei, we segment the cells. In some datasets, we found that given good nuclei seeds, watershed does an good job in segmenting the cells if the cell staining channel is well saturated. For this example we will be using segment_cells_watershed. In order to so, however, we do need to provide a threshold. There are a number of automatic algorithms available in this function (see API), but for simplicity we will pass a single threshold value. To do this, we need to explore the
distribution of the intensities, including and excluding nuclei. Since our data so far is ISS, we do not have a cytosolic channel (such as ConA for example). In these cases, we can create an average mask by using SCALLOPS’ function cyto_channel_summary:
[12]:
from scallops.segmentation.util import cyto_channel_summary
cell_mask = cyto_channel_summary(
iss_image.isel(t=0, c=iss_channels)
) # get an average mask for cells
f, ax = plt.subplots(figsize=(6, 6))
cell_mask_clipped = cell_mask.clip(
max=cell_mask.quantile(0.98)
) # Avoid really bright spots
in_nuc = cell_mask_clipped.data[nuclei > 0].flatten() # Get values in nuclei
out_nucl = cell_mask_clipped.data[nuclei == 0].flatten() # Get values outside nuclei
ax.boxplot([in_nuc, out_nucl], tick_labels=["In Nuclei", "Not in Nuclei"])
ax.set_ylabel("Intensity in Cell Mask");
Judging by the above plot we set the cell threshold to 650:
[13]:
from scallops.segmentation.watershed import segment_cells_watershed
cell_segmentation_threshold = 650 # Set the threshold
all_cells, _ = segment_cells_watershed(
iss_image, nuclei, threshold=cell_segmentation_threshold
) # segment the cells
imcomposite(
iss_image.isel(c=dapi_channel),
dim="t",
labels=nuclei, # visualize nuclei solid
labels_contour=all_cells, # visualize cells as boundaries
figsize=(10, 10),
)
plt.title(f"Cells, {len(np.unique(all_cells)) - 1:,}");
We can see a pretty good segmentation (despite some holes and lack of definition here and there) given that there is no cytosolic mask. Let us assume that this segmentation is good, and lets clean it up a little. Similar to the nuclei, we can visualize and inspect the cellular area, and then filter based on set thresholds:
[14]:
cell_props = regionprops_table(all_cells, properties=["area"])
plt.hist(cell_props["area"], bins=100);
Now we can set the minimum and maximum areas allowed to 40 and 500:
[15]:
cell_area_min = 40
cell_area_max = 500
cells = remove_labels_by_area(all_cells, cell_area_min, cell_area_max)
cell_props = regionprops_table(cells, properties=["area"])
fig, axes = plt.subplots(1, 2, figsize=(10, 5), layout="constrained")
axes[0].set_title("Cell Area")
axes[0].hist(cell_props["area"], bins=100)
axes[1].set_title(f"Cells, {len(np.unique(cells)) - 1:,}")
imcomposite(image=iss_image.isel(c=dapi_channel), dim="t", labels=cells, ax=axes[1])
axes[1].axis("off");
Now we have both cells and nuclei segmented, we can move to barcode calling.
Barcode calling
In OPS to call the barcodes we need to:
Find peaks: Identify the intensity peaks that are shared across all ISS channels
Filter peaks: Identify false positive and filter them
Cross-talk correction: Compute the cross-talk and corrected
Call Bases: Based on the max corrected intensities in each cycle, call the bases
Call reads: Call reads by calling the base with the highest intensity
Find Peaks
Finding the peaks requires to denoise the image with Laplacian of Gaussian (LoG) transform, followed by computing the standard deviation across sequencing channels, to identify where peaks are. This can be done with transform_log, std, and find_peaks, respectively:
[16]:
from scallops.spots import find_peaks, std, transform_log
loged = transform_log(iss_image.isel(c=iss_channels))
std_arr = std(loged)
peaks = find_peaks(std_arr)
bases_array = find_peaks(std_arr)
Call Bases
To call bases we need to compute the maximum filter over the LoG transform using the max_filter function followed by the calling function peaks_to_bases:
[17]:
from scallops.reads import peaks_to_bases
from scallops.spots import max_filter
maxed = max_filter(loged)
bases_array = peaks_to_bases(maxed=maxed, peaks=peaks, labels=cells)
bases_array.sizes
[17]:
Frozen({'read': 10557, 't': 9, 'c': 4})
Now we have all unfiltered peaks. We divide the unfiltered peaks into high quality and low quality peaks by computing the probability of the bases in the read at the peak location. We choose the peak threshold with the maximimum F0.5 score using . peak_thresholds_from_bases
[18]:
from scallops.spots import peak_thresholds_from_bases
threshold_peaks_crosstalk_correction_df = peak_thresholds_from_bases(
bases_array=bases_array
)
threshold_peaks_crosstalk_correction = threshold_peaks_crosstalk_correction_df.iloc[0][
"threshold"
]
f"Crosstalk peak threshold: {threshold_peaks_crosstalk_correction}"
[18]:
'Crosstalk peak threshold: 103.40227203832168'
Cross-talk correction
Now we can visualize the cross-talk with pairwise_channel_scatter_plot:
[19]:
from scallops.visualize.crosstalk import pairwise_channel_scatter_plot
fig, _ = pairwise_channel_scatter_plot(
bases_array.where(
bases_array.peak > threshold_peaks_crosstalk_correction, drop=True
)
)
fig.suptitle("Uncorrected Intensity (In Cells)")
fig.tight_layout()
We can see that there is some cross-talk, especially betwen A and C and T and G. Now we need to compute the cross-talk matrix with channel_crosstalk_matrix, correct the cross-talk with apply_channel_crosstalk_matrix, and again visualize the result of the correction.
[20]:
from scallops.reads import apply_channel_crosstalk_matrix, channel_crosstalk_matrix
w = channel_crosstalk_matrix(
bases_array.where(
bases_array.peak > threshold_peaks_crosstalk_correction, drop=True
)
)
corrected_bases_array = apply_channel_crosstalk_matrix(bases_array, w)
fig, _ = pairwise_channel_scatter_plot(corrected_bases_array)
fig.suptitle("Corrected Intensity (In Cells)")
fig.tight_layout()
Call Reads
Now that we have a corrected set of base intensities, we can call the reads using SCALLOPS’ decode-max function, and we can see some descriptive read statistics with read_statistics:
[21]:
from scallops.reads import decode_max, read_statistics
from scallops.spots import peak_thresholds_from_reads
df_reads = decode_max(corrected_bases_array, barcodes=df_barcode)
peak_thresholds_lower_df = peak_thresholds_from_reads(df_reads.query("barcode_match"))
threshold_peaks = peak_thresholds_lower_df.iloc[0]["threshold"]
f"Lower peak threshold: {threshold_peaks:.2f}"
[21]:
'Lower peak threshold: 49.46'
[22]:
df_reads = df_reads.query(f"peak>{threshold_peaks}")
read_statistics(df_reads)
[22]:
{'mapped_reads': 5981,
'number_of_reads': 7632,
'mapping_rate': 0.7836740041928721,
'mapping_rate_within_labels': 0.7836740041928721,
'mapped_reads_within_labels': 5981,
'average_reads_per_label': 3.0974025974025974,
'average_mapped_reads_per_label': 2.427353896103896,
'number_of_unique_barcodes_in_labels': 825,
'mean_barcode_count_in_labels': 7.24969696969697,
'labels_with_reads': 2464,
'labels_with_mapped_reads': 2202}
Visualize Single Base Calling Errors
After base calling we want to visualize the biases and mismatches across our bases. We can obtain the summary of the mismatches using SCALLOPS’ summarize_base_call_mismatches, and then visualize with base_call_mismatches_heatmap:
[23]:
from scallops.reads import summarize_base_call_mismatches
from scallops.visualize.heatmap import base_call_mismatches_heatmap
base_call_mismatches_df = summarize_base_call_mismatches(df_reads, df_barcode)
display(base_call_mismatches_df.head())
base_call_mismatches_heatmap(base_call_mismatches_df);
| called_base | whitelist_base | read_position | count | fraction | |
|---|---|---|---|---|---|
| 0 | A | C | 0 | 2 | 0.003868 |
| 1 | A | C | 1 | 2 | 0.003868 |
| 2 | A | C | 2 | 2 | 0.003868 |
| 3 | A | C | 3 | 2 | 0.003868 |
| 4 | A | C | 4 | 8 | 0.015474 |
Correct Reads With One Mismatch
The figure above showed us that there are mismatches from our expected barcode list. We can try fliping bases to match the whitelist. This can be done by computing the hamming distance and restricting the flipping to a desired number of mismatches you want fixed. This can easily be achieved by SCALLOPS’ correct_mismatches function:
[24]:
from scallops.reads import correct_mismatches
df_reads = correct_mismatches(df_reads, df_barcode)
read_statistics(df_reads)
[24]:
{'mapped_reads': 6498,
'number_of_reads': 7632,
'mapping_rate': 0.8514150943396226,
'mapping_rate_within_labels': 0.8514150943396226,
'mapped_reads_within_labels': 6498,
'average_reads_per_label': 3.0974025974025974,
'average_mapped_reads_per_label': 2.637175324675325,
'number_of_unique_barcodes_in_labels': 949,
'mean_barcode_count_in_labels': 6.847207586933615,
'labels_with_reads': 2464,
'labels_with_mapped_reads': 2297}
[25]:
df_reads.head()
[25]:
| y | x | sigma | peak | label | read | barcode | Q | Q_mean | Q_min | barcode_match | mismatches | closest_match | mismatches2 | closest_match2 | barcode_uncorrected | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 5 | 756 | 1 | 1189.094209 | 924 | 19 | AAGCCAATT | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 60.0 | 60.0 | True | NaN | NaN | NaN | NaN | NaN |
| 20 | 5 | 791 | 1 | 488.943090 | 2258 | 20 | AGCATGCCA | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 60.0 | 60.0 | True | NaN | NaN | NaN | NaN | NaN |
| 21 | 5 | 796 | 1 | 372.566588 | 2258 | 21 | AGCATGCCA | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 60.0 | 60.0 | True | NaN | NaN | NaN | NaN | NaN |
| 22 | 5 | 809 | 1 | 187.319520 | 2258 | 22 | AAGAGTTCT | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 60.0 | 60.0 | True | NaN | NaN | NaN | NaN | NaN |
| 39 | 6 | 481 | 1 | 58.776617 | 1506 | 39 | AAGGGGTCT | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 60.0 | 60.0 | False | 1.0 | AAGGCGTCT | 1.0 | AAGGGCTCT | NaN |
Our overall mapping rate is now higher and we also find more cells with mapped reads than before the correction.
Interactive visualization of Base Calls
We can use Napari to interactively visualize the base calls and the underlying data with SCALLOPS’ add_bases:
[26]:
import napari
from scallops.visualize.napari import add_bases
viewer = napari.Viewer()
viewer.add_image(
iss_image.squeeze(),
channel_axis=1,
visible=[False, True, True, True, True],
name=["DAPI"] + ["G", "T", "A", "C"],
)
add_bases(viewer, df_reads)
viewer.add_labels(
cells,
name="cells",
colormap=napari.utils.DirectLabelColormap(color_dict={None: (1, 1, 1, 0.5)}),
).contour = 1
The above code should open a new window with napari, and the bases should be in the left-hand side panel. A zoom in should look like this:
Assign Barcodes To Cells
So far we have successfully called the barcodes, but we still need to assign the barcode to each label. This can be achieved by calling assign_barcodes_to_labels:
[27]:
from scallops.reads import assign_barcodes_to_labels
df_cells = assign_barcodes_to_labels(df_reads).set_index("label")
df_cells
[27]:
| mismatch | barcode_Q_mean | barcode_Q_min | barcode_peak | barcode_count | barcode_0 | barcode_Q_mean_0 | barcode_Q_min_0 | barcode_peak_0 | barcode_count_0 | barcode_Q_0 | barcode_1 | barcode_Q_mean_1 | barcode_Q_min_1 | barcode_peak_1 | barcode_count_1 | barcode_Q_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| label | |||||||||||||||||
| 924 | False | 60.000000 | 60.000000 | 1189.094209 | 1 | AAGCCAATT | 60.000000 | 60.000000 | 1189.094209 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 0.000000 | 0.000000 | NaN | 0 | None | |
| 2258 | False | 300.000000 | 300.000000 | 1801.712283 | 5 | AGCATGCCA | 180.000000 | 180.000000 | 1479.391942 | 3 | [180.0, 180.0, 180.0, 180.0, 180.0, 180.0, 180... | AAGAGTTCT | 60.000000 | 60.000000 | 187.319520 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... |
| 1506 | True | 60.000000 | 60.000000 | 58.776617 | 1 | AAGGGGTCT | 60.000000 | 60.000000 | 58.776617 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 0.000000 | 0.000000 | NaN | 0 | None | |
| 2900 | False | 120.000000 | 120.000000 | 1040.681147 | 2 | AAGCCAATT | 120.000000 | 120.000000 | 1040.681147 | 2 | [120.0, 120.0, 120.0, 120.0, 120.0, 120.0, 120... | 0.000000 | 0.000000 | NaN | 0 | None | |
| 2664 | False | 238.460434 | 226.906937 | 1075.447078 | 4 | TCCATTGCA | 238.460434 | 226.906937 | 1075.447078 | 4 | [240.0, 226.90694, 240.0, 240.0, 240.0, 240.0,... | 0.000000 | 0.000000 | NaN | 0 | None | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1291 | True | 106.944305 | 2.498775 | 1709.110210 | 2 | TTTGGGCAG | 53.472153 | 1.249387 | 1355.472936 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | GGGGTGCAG | 53.472153 | 1.249387 | 353.637274 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... |
| 2826 | True | 53.472153 | 1.249387 | 87.979374 | 1 | ACAACCCAG | 53.472153 | 1.249387 | 87.979374 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 60.... | 0.000000 | 0.000000 | NaN | 0 | None | |
| 2584 | True | 47.288986 | 1.249387 | 232.257161 | 1 | GCCATGCTG | 47.288986 | 1.249387 | 232.257161 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 4.35147, ... | 0.000000 | 0.000000 | NaN | 0 | None | |
| 2539 | True | 47.441696 | 1.249387 | 458.368261 | 1 | GGCAGGCTG | 47.441696 | 1.249387 | 458.368261 | 1 | [60.0, 60.0, 60.0, 60.0, 60.0, 60.0, 5.725899,... | 0.000000 | 0.000000 | NaN | 0 | None | |
| 2768 | True | 47.338661 | 1.249387 | 54.133295 | 1 | ATTTTTCTG | 47.338661 | 1.249387 | 54.133295 | 1 | [58.82297, 60.0, 60.0, 60.0, 60.0, 60.0, 5.975... | 0.000000 | 0.000000 | NaN | 0 | None |
2464 rows × 17 columns
Now the label ID is in the index of df_cells.
Extracting phenotype information
Now that we have our barcode called, we can extract the phenotypes. These steps are independent of the ISS steps and can be run in parallel up to the merging step.
Loading phenotypic data
The loading of the pheno data proceeds in the same fashion as the ISS data:
[28]:
phenotype_image_path = experiment_c_path / "10X_c0-DAPI-p65ab"
phenotype_image_pattern = (
"{mag}X_c{t}-{experiment}-{t}_{well}_Tile-{tile}.{datatype}.tif"
)
phenotype_experiment = read_experiment(phenotype_image_path, phenotype_image_pattern)
print(phenotype_experiment)
print(phenotype_experiment.images.keys())
Experiment with 2 images and 0 labels
dict_keys(['A1-102', 'A1-103'])
[29]:
phenotype_experiment.images[image_key].sizes # using the same image key as in ISS
[29]:
Frozen({'t': 1, 'c': 2, 'z': 1, 'y': 1024, 'x': 1024})
Note that this image contains only one “cycle” or timepoint, and only 2 channels of phenotyping. The channels correspond to DAPI and to the NF-κB protein expression.
Alignment with ISS
Since our goal is to estimate features and match it with the ISS data, is important to have a reliable registration between the two modalities (ISS and phenotype). In this case both modalities are in 10X and the alignment between them is not too off, therefore, we can use the cross correlation alignment again. Note that SCALLOPS also provides ITK registration.
[30]:
from scallops.registration.crosscorrelation import ( # note that this is a different function than align_image
align_images,
)
pheno_image = phenotype_experiment.images[image_key].max(dim="z", keep_attrs=True)
pheno_aligned = align_images(iss_image.isel(t=0), pheno_image.isel(t=0))
pheno_aligned.sizes
[30]:
Frozen({'c': 2, 'y': 1024, 'x': 1024})
We can check how good the alignment is by either plotting a phenotype composite with the ISS labels or we can merge the two images and visualize the composite:
[31]:
imcomposite(pheno_aligned, labels=nuclei, labels_contour=cells, figsize=(10, 10));
[32]:
import xarray as xr
combined = xr.concat(
[iss_image, pheno_aligned.expand_dims(["t"])], dim="c", join="outer"
)
imcomposite(combined.isel(c=dapi_channel), dim="t", figsize=(10, 10));
Compute Features
To compute the features you need to pass one of the features you want with the parameters for said features. In this example we will get the area, eccentricity and the correlation between DAPI (channel index 0) and NF-κB (channel index 1):
[33]:
from scallops.features.find_objects import find_objects
from scallops.features.generate import label_features
objects_df = find_objects(cells)
phenotype_results = label_features(
objects_df=objects_df,
intensity_image=pheno_aligned.transpose(
*("y", "x", "c")
).data, # data needs to be transposed so that channel is last
label_image=cells, # You can pass any segmentation label
features=["intensity_*", "size-shape"],
)
df_phenotype = objects_df.join(phenotype_results)
df_phenotype
measureobjectsizeshape.py (637): divide by zero encountered in divide
zernike.py (181): invalid value encountered in divide
zernike.py (184): invalid value encountered in divide
[33]:
| AreaShape_BoundingBoxMinimum_Y | AreaShape_BoundingBoxMaximum_Y | AreaShape_BoundingBoxMinimum_X | AreaShape_BoundingBoxMaximum_X | AreaShape_Center_Y | AreaShape_Center_X | AreaShape_Area | AreaShape_BoundingBoxArea | AreaShape_ConvexArea | AreaShape_EquivalentDiameter | ... | Intensity_IntegratedIntensityEdge_Channel0 | Intensity_MeanIntensityEdge_Channel0 | Intensity_StdIntensityEdge_Channel0 | Intensity_MinIntensityEdge_Channel0 | Intensity_MaxIntensityEdge_Channel0 | Intensity_IntegratedIntensityEdge_Channel1 | Intensity_MeanIntensityEdge_Channel1 | Intensity_StdIntensityEdge_Channel1 | Intensity_MinIntensityEdge_Channel1 | Intensity_MaxIntensityEdge_Channel1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 117 | 136 | 698 | 718 | 125.938525 | 708.340164 | 244.0 | 380.0 | 261.0 | 17.625846 | ... | 0.515236 | 0.010103 | 0.003856 | 0.007797 | 0.026612 | 1.439216 | 0.028220 | 0.007859 | 0.019028 | 0.053559 |
| 2 | 418 | 440 | 355 | 378 | 427.787645 | 367.528958 | 259.0 | 506.0 | 352.0 | 18.159544 | ... | 0.871229 | 0.009470 | 0.002003 | 0.007767 | 0.017273 | 2.553872 | 0.027759 | 0.003378 | 0.021912 | 0.036515 |
| 3 | 264 | 280 | 491 | 510 | 271.485000 | 500.485000 | 200.0 | 304.0 | 223.0 | 15.957691 | ... | 0.490776 | 0.009438 | 0.001078 | 0.008225 | 0.012848 | 1.407126 | 0.027060 | 0.003307 | 0.020035 | 0.034806 |
| 4 | 208 | 228 | 663 | 676 | 217.015000 | 668.940000 | 200.0 | 260.0 | 221.0 | 15.957691 | ... | 0.606455 | 0.011891 | 0.004497 | 0.007965 | 0.027039 | 1.345937 | 0.026391 | 0.003446 | 0.019196 | 0.035111 |
| 5 | 509 | 534 | 896 | 916 | 521.124113 | 906.379433 | 282.0 | 500.0 | 322.0 | 18.948708 | ... | 0.674846 | 0.009780 | 0.002776 | 0.007721 | 0.021027 | 1.410269 | 0.020439 | 0.003635 | 0.013596 | 0.031800 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2963 | 42 | 51 | 534 | 544 | 45.354167 | 539.437500 | 48.0 | 90.0 | 59.0 | 7.817640 | ... | 0.257374 | 0.010295 | 0.002209 | 0.008026 | 0.016495 | 0.550072 | 0.022003 | 0.003705 | 0.016495 | 0.031434 |
| 2964 | 816 | 823 | 627 | 642 | 819.115942 | 634.565217 | 69.0 | 105.0 | 76.0 | 9.373021 | ... | 0.358755 | 0.011958 | 0.003057 | 0.008164 | 0.017624 | 0.744167 | 0.024806 | 0.005798 | 0.014466 | 0.035080 |
| 2965 | 798 | 813 | 410 | 427 | 805.957265 | 418.205128 | 117.0 | 255.0 | 189.0 | 12.205287 | ... | 0.674815 | 0.011063 | 0.004951 | 0.007691 | 0.025071 | 1.199924 | 0.019671 | 0.004311 | 0.014527 | 0.032441 |
| 2967 | 259 | 270 | 175 | 185 | 263.964286 | 180.571429 | 56.0 | 110.0 | 75.0 | 8.444016 | ... | 0.391089 | 0.013036 | 0.003799 | 0.008179 | 0.020432 | 0.852430 | 0.028414 | 0.004930 | 0.018250 | 0.038239 |
| 2968 | 487 | 499 | 918 | 928 | 492.149254 | 922.970149 | 67.0 | 120.0 | 81.0 | 9.236182 | ... | 0.390738 | 0.013025 | 0.005218 | 0.008042 | 0.026551 | 0.798428 | 0.026614 | 0.006177 | 0.019364 | 0.039643 |
2896 rows × 148 columns
[34]:
import pandas as pd
from scallops.reads import merge_sbs_phenotype
merged_df = merge_sbs_phenotype(
df_labels=df_cells,
df_phenotype=df_phenotype,
df_barcode=pd.read_csv(barcodes_path),
sbs_cycles=iss_image.coords["t"].values,
)
merged_df
[34]:
| mismatch | barcode_Q_mean | barcode_Q_min | barcode_peak | barcode_count | barcode_0 | barcode_Q_mean_0 | barcode_Q_min_0 | barcode_peak_0 | barcode_count_0 | ... | Intensity_MinIntensityEdge_Channel1 | Intensity_MaxIntensityEdge_Channel1 | barcode | sgRNA | gene_symbol | duplicate_prefix | barcode_1_barcode_1 | sgRNA_1 | gene_symbol_1 | duplicate_prefix_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | False | 360.0 | 360.0 | 3272.963489 | 6.0 | AAATGATCT | 360.0 | 360.0 | 3272.963489 | 6.0 | ... | 0.019028 | 0.053559 | AAATGGATCTATGTTCACAA | AAATGGATCTATGTTCACAA | FBXO32 | False | NaN | NaN | NaN | NaN |
| 2 | True | 180.0 | 180.0 | 1098.411341 | 3.0 | AGGATTTAG | 180.0 | 180.0 | 1098.411341 | 3.0 | ... | 0.021912 | 0.036515 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | False | 60.0 | 60.0 | 268.992267 | 1.0 | GCTGTAGAC | 60.0 | 60.0 | 268.992267 | 1.0 | ... | 0.020035 | 0.034806 | GCTGTGAGACCCAGCACCTG | GCTGTGAGACCCAGCACCTG | DHX58 | False | NaN | NaN | NaN | NaN |
| 4 | False | 300.0 | 300.0 | 1994.343939 | 5.0 | GCTCTGCAG | 300.0 | 300.0 | 1994.343939 | 5.0 | ... | 0.019196 | 0.035111 | GCTCTTGCAGGAATTTCCAG | GCTCTTGCAGGAATTTCCAG | USP17L2 | False | NaN | NaN | NaN | NaN |
| 5 | False | 300.0 | 300.0 | 2196.703452 | 5.0 | ACTCAAAAC | 300.0 | 300.0 | 2196.703452 | 5.0 | ... | 0.013596 | 0.031800 | ACTCACAAACAACATTGCTG | ACTCACAAACAACATTGCTG | IL6R | False | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2963 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.016495 | 0.031434 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2964 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.014466 | 0.035080 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2965 | False | 180.0 | 180.0 | 1285.524961 | 3.0 | CTCACTGCA | 180.0 | 180.0 | 1285.524961 | 3.0 | ... | 0.014527 | 0.032441 | CTCACCTGCAATTAGGTGGA | CTCACCTGCAATTAGGTGGA | UBE2E1 | False | NaN | NaN | NaN | NaN |
| 2967 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 0.018250 | 0.038239 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2968 | False | 60.0 | 60.0 | 64.093509 | 1.0 | GCACGGTGG | 60.0 | 60.0 | 64.093509 | 1.0 | ... | 0.019364 | 0.039643 | GCACGTGTGGCAGATCACGT | GCACGTGTGGCAGATCACGT | SPSB1 | False | NaN | NaN | NaN | NaN |
2896 rows × 173 columns
Post-processing results
After getting all the features and barcode results, we still need to process the data so it is clean and trustworthy.
Keep cells in which top barcode accounts for > 50% of aligned reads
The so-called 50% rule allows you to have more trust in your data. When you have multiple barcodes in a cell, you’d like that the top barcode represents more than 50% of all barcodes in the cell, such that you are sure the main effect in that cell is due to the intended guide. This filtering can easily be done in pandas:
[35]:
filtered_merged_df = merged_df.query("barcode_count_0/barcode_count>0.5")
f"Kept {len(filtered_merged_df):,} / {len(merged_df):,} cells"
[35]:
'Kept 2,146 / 2,896 cells'
Now lets visualize some important stats that allows you to gauge the reliability of the data.
Percent of cells that contain barcode reads
We might want to know how many exact matches are respresented in cells. We can do this by using in_situ_barcode_hist_plot on the df_reads dataframe we computed earlier:
[36]:
import matplotlib.pyplot as plt
from scallops.visualize.histogram import in_situ_barcode_hist_plot
in_situ_barcode_hist_plot(df_reads);
Cellular read distribution categorized by read identity
To visualize how the top barcode reads match the total number of reads, we can use the in_situ_identity_matrix_plot in the df_cells dataframe computed earlier:
[37]:
from scallops.visualize.heatmap import in_situ_identity_matrix_plot
in_situ_identity_matrix_plot(df_cells.rename(columns={"barcode_0": "barcode"}));
Whitelist and Called Barcodes Distribution
Once you have the called reads, you might want to compare them with your white list, especially in composition. We can use pandas and seaborn to that end:
[38]:
import seaborn as sns
from scallops.reads import base_counts
barcode_counts_df = base_counts(
df_barcode, normalize=True
) # compute counts per base per cycle
barcode_counts_df["source"] = "whitelist"
called_barcode_counts_df = base_counts(df_reads, normalize=True)
called_barcode_counts_df["source"] = "called barcodes"
combined_counts = pd.concat((barcode_counts_df, called_barcode_counts_df))
sns.catplot(
combined_counts,
x="t",
y="count",
col="source",
hue="base",
kind="bar",
sharey=False,
);
Fraction of neighbors with same target
You might also be wondering what is the proportion of neighbors have the same guide. This might be interesting to see the potential biases in guide uptake, and to see how uniform the experiment is. We can achieve this with scipy’s KDTree to find nearest neighbors based on coordinates of the filtered dataframe:
[39]:
from scipy.spatial import KDTree
kdt = KDTree(
filtered_merged_df[["AreaShape_Center_Y", "AreaShape_Center_X"]].values
) # Create a K dimensional tree
neighbors = kdt.query_ball_tree(
kdt, r=40
) # Find all pairs of points between self and other whose distance is at most r.
neighbors_df = filtered_merged_df[["gene_symbol"]].copy()
fraction_same_gene = []
number_of_neighbors = []
for i in range(len(neighbors)):
indices = neighbors[i]
n = len(indices) - 1
target_gene = neighbors_df["gene_symbol"].values[i]
n_same = (neighbors_df["gene_symbol"].values[indices] == target_gene).sum() - 1
fraction_same_gene.append(n_same / n if n > 0 else 0)
number_of_neighbors.append(n)
neighbors_df["number_of_neighbors"] = number_of_neighbors
neighbors_df["fraction_same_gene"] = fraction_same_gene
neighbors_df = neighbors_df.sort_values("fraction_same_gene", ascending=False)
neighbors_df
[39]:
| gene_symbol | number_of_neighbors | fraction_same_gene | |
|---|---|---|---|
| 2631 | non-targeting | 8 | 0.875000 |
| 1777 | non-targeting | 7 | 0.857143 |
| 1196 | non-targeting | 6 | 0.833333 |
| 1645 | non-targeting | 11 | 0.727273 |
| 1941 | non-targeting | 10 | 0.700000 |
| ... | ... | ... | ... |
| 2301 | NaN | 3 | -0.333333 |
| 359 | NaN | 2 | -0.500000 |
| 2878 | NaN | 2 | -0.500000 |
| 867 | NaN | 2 | -0.500000 |
| 2826 | NaN | 1 | -1.000000 |
2146 rows × 3 columns
[ ]: