Automatically Score Tissue Images Like a Pathologist by Transfer Learning

Cancer is the second leading cause of death in the world. Diagnosing cancer early on can save many lives. Pathologists have to look at tissue microarray (TMA) images manually to identify tumors, which can be time-consuming, inconsistent and subjective. Existing algorithms that automatically detect tumors have either not achieved the accuracy level of a pathologist or require substantial human involvements. A major challenge is that TMA images with different shapes, sizes, and locations can have the same score. Learning staining patterns in TMA images requires a huge number of images, which are severely limited due to privacy concerns and regulations in medical organizations. TMA images from different cancer types may have common characteristics that could provide valuable information, but using them directly harms the accuracy. By selective transfer learning from multiple small auxiliary sets, the proposed algorithm is able to extract knowledge from tissue images showing a ``similar"scoring pattern but with different cancer types. Remarkably, transfer learning has made it possible for the algorithm to break the critical accuracy barrier -- the proposed algorithm reports an accuracy of 75.9% on breast cancer TMA images from the Stanford Tissue Microarray Database, achieving the 75\% accuracy level of pathologists. This will allow pathologists to confidently use automatic algorithms to assist them in recognizing tumors consistently with a higher accuracy in real time.


Introduction
Cancer continues to affect millions of people, yet no known cure exists, leaving treatments such as surgery, chemotherapy, and radiation therapy as the only options.As early cancer treatment significantly improves survival rates, the need for early diagnosis is imperative.Despite the importance and necessity for early diagnosis, it can take up to weeks before pathology reports are delivered.
The tissue microarray (TMA) is an emerging technology to analyze tissue samples.It uses thin slices of tissue core samples which are arranged in an array format in paraffin blocks [16].Biomarkers are applied and typically biomarker-specific darker colors of yellow stains are shown when there is presence of tumors.TMA images are then captured with a high-resolution microscope.The scoring of TMA images is based on the severity of tumors, and a common system for scoring TMA images uses a scale from 0 to 3, where a score of 0 indicates no tumors, and a score of 3 implies very severe tumors.TMA technology makes it possible to efficiently analyze many tissue samples together, thus normalizing conditions for comparative studies.These benefits give TMA the potential to be widely used as an effective technique for diagnosis and prognosis oncology [21,26].
The diagnosis for tumors-identifying tumors from staining patterns in tissue imagesmanually is time-consuming and can easily be inconsistent [7,33].Although structured procedures have been proposed, manual scoring can still be fairly subjective [35,4], and the same pathologist may score differently at different sessions for the same image.Because of these issues, a number of algorithms, including ACIS, Ariol, TMALab, AQUA [6] and TACOMA [35], have been developed to automate this process.These algorithms, although transformative, are unable to be widely adopted due to limits in their capabilities.They require substantive involvement of pathologists on tasks such as TMA image background subtraction, feature segmentation, thresholds of hue or pixel intensity, and the provision of representative TMA image patches by pathologists.
The variability of staining patterns in TMA images and the scarcity of training samples make it particularly challenging to develop an automatic algorithm.Staining patterns in TMA images can have very different sizes, locations, shapes, and colors, despite having the same score.Figure 1 demonstrates the variability of staining patterns in the TMA images.The high variability in the staining patterns also increases the need for larger sample sizes in training algorithms.Having more images would allow the algorithm to capture more variability in the staining patterns, leading to more consistent results.However, the number of TMA images is severely limited due to privacy and regularization concerns in medical organizations.Moreover, the TMA images currently available are for over 100 different types of cancers, with very few numbers of TMA images available for each individual cancer type.In recent years, deep learning [14] has become the method of choice for many image recognition tasks.However, that is not feasible for TMA image scoring due to the small sample size available for a given cancer type.This work aims to develop an automatic algorithm for the scoring of TMA images at the accuracy level of pathologists by augmenting the training sample with transfer learning [8].The key observation is that, some of the TMA images of a different cancer type may look similar (i.e., have a similar staining pattern) as those of the given cancer type and with the same score, thus makes it possible for transfer learning.However, usual algorithms for transfer learning (many based on deep neural networks [30]) are not applicable since we do not have a large dataset from similar problems as the basis for knowledge transfer-while there exist many different cancer types, the number of TMA images of each individual type is small.This gives rise to a new transfer learning setting: there are multiple similar learning problems around but each only has a small training sample, and we wish to design an algorithm that could effectively transfer knowledge from all the similar learning problems.The approach we take is instance-based transfer learning [8], where we propose to selectively include TMA images of other cancer types with similar staining patterns as the cancer type of interest.With an enlarged training set, we can expect to improve accuracy on the scoring of TMA images of the original cancer type.Given the huge success of transfer learning in application domains such as natural language processing and image recognition, we expect that it would enable the automatic evaluation of cancer tumors in TMA images at the level of pathologists.
The remaining of this paper is organized as follows.In Section 2, we describe our approach and algorithms design.Experiments and results are presented in Section 3, along with a discussion of connections of our algorithm to recent theoretical developments in transfer learning.Finally we conclude in Section 4.

Methods
Our approach can be summarized as follows.A type of tumor-specific spatial histograms is used to capture key features characterizing the staining patterns in tissue images, then transfer learning is adopted to increase the training sample size by selectively including images of other cancer types.The enlarged training set is input to Random Forests [2] to re-fit the classification model and results reported.The goal is to achieve or exceed the accuracy level of pathologists.
One major challenge in the automatic scoring of TMA images is the variability in the staining patterns, which may have different shapes, sizes, and locations despite having the same score.The variability in staining patterns is encoded by spatial histograms, following work in [35].The spatial histogram also helps to greatly reduce the data dimensionality, another challenge in the automatic scoring of TMA images which is caused by large image sizes (i.e., 1504 × 1410).We formulate the scoring of TMA images as a classification problem, and use Random Forests (RF) [2] as the classifier for its built-in feature selection capability.To implement instance-based transfer learning, our main effort lies in evaluating whether an image of other cancer types is conformal to the hypothesis (or decision rule) induced by images of the original cancer type.Figure 2 shows the overall architecture of our approach.Note that here for illustration purpose, we use 'ER', the name of a biomarker for breast cancer, to indicate the cancer type of interest, while using 'NMR', 'CK56', or 'CD117' for other cancer types.In the remaining of Section 2, we will present a detailed description about spatial histogram, RF, transfer learning, and the algorithmic implementation of our approach.

The spatial histogram
The staining patterns on TMA images can vary highly, even for those with the same score.Thus, it is desirable to look for image features that are relatively stable across images of the same score.The feature we use is a spatial histogram matrix (or gray level cooccurrence matrix in the remote sensing literature [17]), which is commonly used for textured images.It is a suitable image feature because TMA images belong to one of two major classes of textured images-images taken from very far away in the sky (for example, remote sensing images) and those taken under a high resolution microscope.Indeed, the spatial histogram has been used in literature for TMA images [35] and fares well.
Similar to the conventional histogram, the spatial histogram is a collection of counting statistics about pairs of adjacent (or neighboring) pixels in an image and represented as a matrix.In generating a spatial histogram, we only consider neighboring pixels that have a predefined spatial relationship.That is, the two pixels need to have a pair of given pixel values and are neighbors along a certain direction.A typical choice for the direction is the 45°direction in the image grid.Figure 3 shows an example of such neighboring pixels where the pairs of pixels of interest are marked by green rectangles and both have grey value 1.The counting statistic is the total number of times such a pair occur in the image along the 45°direction, and this gives the value of the (1,1)-entry of the spatial histogram matrix.Similarly, pairs of pixels along the same direction but with pixel values i and j, respectively, would produce a counting statistic for the (i,j)-entry of the matrix, and so on.
One can choose to use a different spatial relationship, like pairs along a direction of 0°o r 135°etc in the image grid, and the difference in TMA image scoring would be small.Also, one can extend the distance between the neighboring pixels.The default distance is 1, indicating immediate neighbors.The two pairs of example pixels in Figure 3 all have a distance 1, while a distance of 2 or larger would indicate a larger staining pattern.In this work, a spatial distance of 2 is used.
One nice feature about the use of spatial histogram is dimension reduction.The TMA images are large in size, and those from the Stanford TMA image database [24] have a size of 1504 × 1440.Directly working with such images would require enormous computing power and memory, and worse still, that will lead to the curse of dimensionality, as each image would be treated as a huge vector of a dimension more than 2 million (i.e., 1504 × 1440).As the gray value of TMA image pixels have a range between 0 and 255, the spatial histogram will reduce the data dimension to a value, 256 × 256, much smaller than that from the original image.One step further is to apply a quantization to the image gray values.That will lead to an even smaller spatial histogram matrix.We follow work in [35], and apply a linear quantization to the gray values into 51 levels.That is, a gray value of 5 • x + y for 0 ≤ x ≤ 50 and 0 ≤ y ≤ 4 will be transformed to x + 1 (255 is converted to 51 for simplicity).Thus the spatial histogram matrix now has a dimension of 51 × 51, and computation can be done very efficiently.Importantly, this makes the resulting image features stable against small variations due to varying physical conditions such as lighting or illumination.

Random Forests
Random Forests (RF) [2] is used as the classification engine for our TMA image scoring algorithm.RF is an ensemble of decision trees, and each tree is grown by recursively splitting the data.At each node split, RF randomly samples a number, mtry, of features and selects one leading to a best partition of that node, according to some criterion.Each tree progressively narrows down the decision for an instance.The node split continues until there is only one point in the node (for classification) or when the node is pure (i.e., all the points in the node have the same label).At classification, an instance receives a vote from each tree in RF, and the final decision given by RF is a majority vote on the class labels, according to the number of votes that each label gets from all trees.
Many studies have reported excellent performances of RF [2,9].For TMA image classification, previous studies [19,35] also show that RF outperforms competitors like SVM [10] or boosting [13].Compared to its competitors, RF scales well against some main challenges in TMA image scoring, including high dimensionality and label noise, thanks to its strong feature selection ability and ensemble nature.RF is easy to use with very few tuning parameters-often one just need to set the number of trees and the mtry parameter.

Transfer learning
Transfer learning [8] is an emerging learning paradigm to address the problem of insufficient training data when there is a large set of auxiliary data (called auxiliary set) that entails knowledge helpful in solving the original problem.Transfer learning algorithms can be classified as instanced-based, mapping-based, representation-based, or feature-based transfer learning [30].Instance-based transfer learning [11] transfers knowledge in the form of enlarging the original training set by finding instances in the auxiliary set that are consistent with the hypothesis entailed by the original training set (such instances are called transferable).Mapping-based transfer learning [32] learns semantically sensible invariant representation across the original and auxiliary sets.Feature-based approaches [23] learn features that would help the learning of the original problem.Representationbased [15,25] tries to find representations that can be transferred.Recently, deep transfer learning [14] has become very popular and achieves impressive performance in a number of domains, for example, large natural language processing systems such as BERT [12] and GPT-3 [3], and pre-trained image models [31].The literature on transfer learning is enormous, and we can only mention a few here.More discussion can be seen in [27,30,1] and references therein.
The lack of a large auxiliary dataset makes transfer learning particularly challenging for the problem of TMA image scoring.The big family of deep neural networks based transfer learning algorithms are not applicable here due to the reliance on the training of large deep networks, which inevitably requires a huge training set.In the scoring of TMA images, there are typically multiple auxiliary sets available as images from a number of other cancer types can look very similar to those of the cancer type of interest.However, none of the auxiliary sets is large enough for the typical deep neural networks based approaches to be feasible.Thus, we now have a new problem setting for transfer learning, and we wish to enable knowledge transfer from multiple small auxiliary sets.
The approach we take is instance-based transfer learning.From the auxiliary set, we try to identify TMA images that are consistent with the original hypothesis (roughly, those images looking similar to some in the original set while having the same label).Clearly we do not require a large auxiliary set to achieve this.Then we can add those transferable images to enlarge the original training set.For a small training set, increasing its size will likely improve performance on the test set.In Figure 4, the left 3 columns show images for the target cancer type-breast cancer (indicated by ER) where each row corresponds to a different score.The right columns are images from a different cancer type (marked by NMB, CK56 and CD117, respectively) that look similar (to certain extent) to the breast cancer images with the same score.

An algorithmic description
In our algorithm, transfer learning is implemented as function, tmaT rans f er(), which finds transferable images from a given auxiliary set.The main function, tmaS core(), calls tmaT rans f er() to obtain transferable images from other cancer types, does model fitting on the enlarged training set via RF, and then reports final results on the test set.
tmaT rans f er() is implemented as follows.We first fit a classification model, M 0 , by RF on the original training set T 0 .Then we apply M 0 to the set of auxiliary images, W. If the predicted label for an image, W ∈ W, is the same as its given label (note the label for these images in the auxiliary set are known since they come with the TMA images in the Stanford database), then we say image W (along with its label) is consistent with the original model hypothesis M 0 .As there might be label noise for image W, we only transfer images that are predicted more confidently.The confidence in prediction can be estimated from the number of votes W receives on different class labels (or scores) from trees in RF.If the majority class gets substantially more votes than other classes, then we say the instance is predicted with high confidence.Here, the majority class is the one that receives the most votes.An easy way to estimate the confidence is to use the difference in the fraction of votes received for the top and the second majority class (the class that receives the second most number of votes).Let n 1 (W) and n 2 (w) be number of votes W gets on the top and second majority class, and T be the number of trees in RF.We can estimate the confidence in predicting instance W as follows So β(W) has a value in the range [0, 1], and a TMA image W in the auxiliary set is selected if it is classified with a confidence larger than a predetermined level β 0 .The choice of β 0 seeks to include TMA image instances that are valuable to the original problem.Singh and his coauthors [29] study the value of data points in semi-supervised classification problem, and they find that data points that are along the decision boundary barely help, while the value of a data point increases when the data point is slightly away from the boundary.Our definition of confidence aims to avoid data points that are along or very close to the decision boundary (such data points would be classified with very low confidence), while trying to include data points that are slightly away from the decision boundary.Note that a too big value of β 0 is not desirable either as that will cause the inclusion of only data points very far away from the decision boundary.Let F denote the transferable set (which is the set of images transferred from the auxiliary sets).The tmaT rans f er() function is implemented as Algorithm 1.

4:
Apply the original model M 0 to W; if predicted label on W is different from its given label then 6: Skip to the next round of the loop; Calculate the prediction confidence β(W) for image W; Add W to the transferable set, F ← F ∪ {W}; 11: end if 12: end while 13: return(F ); To describe algorithm for the main function tmaS core(), assume that the other cancer types for transfer learning are associated with biomarkers NMB, CK56 and CD117 for simplicity in description.Let the set of auxiliary images for these cancer types be denoted by W nmb , W ck56 , W cd117 , respectively.Function tmaS core() first fits a prediction model, M 0 , from the original training set using RF.Then, it identifies the transferable set from each auxiliary set in {W nmb , W ck56 , W cd117 }.Add the transferable set to the original training set T 0 , re-fit the prediction model, then apply it to the test set and report results.The main function is implemented as Algorithm 2.
Algorithm 2 tmaScore() Apply transfer learning to image set W; 8: Add F t to the transfer set F ← F ∪ F t ; 10: end for 11: Add transferable set F to original training set T 0 and re-fit RF; 12: Apply the re-fitted model to test set T s and report accuracy;

Experiments and results
We conduct experiments using TMA images from the Stanford TMA image database [24], available from https://tma.im/tma_portal/.The cancer type we choose to work with is breast cancer, due to the fact that it is one of the best understood cancer types to date.The associated biomarker is estrogen receptor (ER).There are 690 images in total for ER in the database, and the training and testing sets are split evenly.The reported results are averaged over 100 runs.
TMA images in the Stanford TMA database are from several dozens of different cancer types.One can use TMA images from all other different cancer types, but we take a more conservative approach.Transfer learning requires the training set to be similar as the auxiliary set.We browse through the Stanford TMA image database, and determine that TMA images associated with biomarkers NMB, CK56, and CD117 are visually, to a certain extent, similar to TMA images for ER.
We use the R package randomForest for our experiments.The tuning parameter mtry is chosen among { √ p, 2 √ p}, where p is the input data dimension.For TMA images as we work on the spatial histogram matrix, we have p = 51 • 51 = 2601.The number of trees in RF is fixed at T = 100.The confidence level β 0 for instance transfer is picked as 10%, meaning that only instances with the top majority class leading the second majority class by at least 10% votes (out of T = 100 trees) are considered to be transferable.

Results
The evaluation metric is the test set accuracy, which is the percent of test images with a predicted class label agreeing with the given one (that is, the label comes with the database).
The results are shown in Figure 5.The accuracy achieved with transfer learning over auxiliary sets associated with NMB, CK56 and CD117 is 75.9%,outperforming the algorithm without transfer learning (shown as the first bar in the figure).The accuracy of pathologists is estimated to be around 75-84% [35], so transfer learning allows our algorithm to achieve the level of pathologists.It is interesting to note that the achieved accuracy increases progressively when we apply transfer learning over more auxiliary sets, e.g., in the order of over W nmb only, over two auxiliary sets {W nmb , W ck56 }, and over three auxiliary sets {W nmb , W ck56 , W cd117 }.We also conduct experiments by simply combining the training set of TMA images for ER with all images in the auxiliary sets T N MB , T CK56 and T CD117 , respectively.This ac-tually leads to a decrease in accuracy compared to that without transfer learning, as shown in the second thru the fourth bars in Figure 5.Although directly combining data from the auxiliary sets greatly increases the training set, it also makes the data a lot more heterogeneous thus more challenging for classification as we now have to fit the model to accommodate images of sub-models within the same class label.In comparison, transfer learning with our approach over images from other cancer types, even with different distributions, allows us to improve the accuracy if we can properly control the confidence level (so that transferred instances are conformal to the original hypothesis).

Understanding the transfer learning scheme
We adopt instance-based transfer learning, and the particular scheme we propose improves the accuracy of TMA image classification.Our algorithm could be understood from recent theoretical developments in transfer learning, and is also empirically supported by experiments.
Transfer learning typically requires similarity in the distribution between the original and the auxiliary set.Recent work towards understanding of transfer learning focuses mainly on relaxing this ideal condition along two lines.One is the covariate-shift model where the marginal distribution of the original and the auxiliary data are different but their induced decision rules are similar.Here the marginal distribution refers to the probability distribution of the TMA images or their spatial histograms, while the induced decision rule is the rule that decides which class (score) a TMA image gets given the pixel values or the spatial histogram of an image.Kpotufe and Martinet [22] study, under the covariate-shift model, how much the target performance is impacted by sample size and the difference in the original and the auxiliary distribution.The other is the posterior-drift model where the marginal distribution of the original and the auxiliary data are similar but their induced decision rules could be very different.Cai and Wei [5] obtain the speed of convergence of the estimated decision rule to its limit in terms of the difference in the induced decision rules between the target and auxiliary data.For the scoring of TMA images, clearly the distribution of the original and the auxiliary data are different, and so are the induced decision rules.Our approach can be viewed as trying to satisfy the assumption of the covariateshift model, i.e., it tries to find a subset of the auxiliary data such that the induced decision rule agrees to that from the original data.This is achieved by searching from the auxiliary set those TMA images with the same label as the predicted one under the decision rule learned from the original data (i.e., conformal to the original hypothesis).This effectively overcomes the difficulty in requiring a similar induced decision rule between the original and the auxiliary data.Thus, our approach gives a solution to the challenging problem of enabling knowledge transfer from multiple small auxiliary sets with each inducing a potentially very different decision rule from that on the original data.
Next, we conduct some experiments.We first produce a visualization of the original training set (corresponding to breast cancer) and that enhanced by transfer learning from other cancer types, including those associated with NMB, CK56 and CD117.Each image in the training set corresponds to a point in the high dimensional space, and the points are plotted along the first and second component from a principal component analysis [18] of the data.
From Figure 6, it can be seen that for the enhanced training set, the separation of points become larger.In particular, points with color orange now becomes visible (previously they are mostly hiding among points with other colors or labels); some blue points are also better separated from the green and red point clouds.A better separation between classes would make the classification task easier, thus a better accuracy can be expected.Indeed, we can get a more precise characterization of the amount of class separation by the class separation ratio ρ = All pairs of classes i, j (S S W i + S S W j )/S S B i, j , which is the ratio of the total within-class distances out of the total between-class distances calculated over all pairs of classes, and S S W i and S S B i, j are defined as where distance(a, b) is the Euclidean distance between points a and b.The class separation ratio ρ is commonly used in the clustering literature [20,34] as a measure for the quality of clustering, so it gives hint on the difficulty in separating different classes.A smaller value of ρ indicates that the within-class distances is small relative to the between-class distance, thus a better separation of classes.The mean class separation ratio is calculated as 15.6668 and 12.1379 over 100 runs on the original and transfer learning enhanced training sets, respectively.This implies that the enhanced training set is better separated among data from different classes.This is consistent with our visualization and experimental results, thus giving empirical support to the transfer learning scheme we propose.Further experiments are expected to better understand this, which we leave for future work.
model applies, i.e., the selected subset from the auxiliary set is conformal to the original hypothesis.
Empirically, experiments have also been carried out to understand the algorithm.Data visualization shows that our algorithm increases the class separation, and a larger class separation often makes the classification problem easier and thus improved accuracy can be expected.This is corroborated by the empirical class (cluster) separation ratio, a commonly used cluster quality measure, and the enhanced training set has a better class separation ratio.One possibility of future work is to explore how to exclude unnecessary parts (or patterns) in TMA images or finding the most important features in TMA images to further increase the accuracy.Additionally, different notions of confidence may be explored for instance transfer, for example those using the concept of conformal classification [28].

Figure 1 :
Figure 1: The staining patterns vary highly across TMA images.Images with the same score can look drastically different.

Figure 2 :
Figure 2: Overall architecture of our approach.

Figure 3 :
Figure 3: Illustration of a toy image and the spatial histogram matrix.The left panel stands for the original image where the numbers are the gray values, and the right is the resulting spatial histogram matrix with a dimension 4 × 4. The two diagonally (i.e., along 45 °direction) neighboring pixels with both gray levels of 1 occur twice in the image, so the (1, 1)-entry of the spatial histogram matrix has a value of 2.

Figure 4 :
Figure 4: Transferable images from other cancer types.The left 3 columns of images are TMA images for breast cancer and indicated by the associated biomarker estrigen receptor (ER).The right 3 columns are TMA images for cancer types indicated by biomarkers NMB, CK56 and CD117, respectively, which have a similar appearance as those for ER with the same label.

Figure 5 :
Figure 5: Comparison of accuracy.'T' and 'NT' stand for transfer learning and without transfer learning, respectively.Cancer types I, II, III, IV indicate TMA images associated with biomarkers ER, NMB, CK56, and CD117, respectively.

Figure 6 :
Figure 6: The original and transfer learning enhanced training set visualized by their first and second principal directions via principal component analysis.The top panel is for the original training set, and the bottom panel is for training set enhanced by transfer learning.Different colors correspond to TMA images with a different class label (or score).

S
S W i = a, b all with label i distance(a, b) S S B i, j = a with label i, b with label j distance(a, b), 1: Let the number of trees in Random Forests be T ; 2: Apply RF to the original training set T 0 ; 3: Let the fitted Random Forests model be M 0 ; 4: Pick a predefined confidence level β 0 ; 5: Initialize the transferable set F ← ∅; 6: for W in {W nmb , W ck56 , W cd117 } do 7: