COVID-19 Infection and Lung Segmentation using CT Scans

Kanishka Pandey
8 min readNov 22, 2021

Recently, I came across an interesting dataset while searching for project ideas for my end-of-semester Computer Science project assignment. It has 20 detailed, full-resolution lung scans of patients who were diagnosed with the SARS-COV-2 along with segmentation of lungs and infections made by medical experts.

2-dimensional render of a CT scan showing the segmentation of healthy and infected pulmonary tissue

Reading through the python scripts and notebooks people have made utilizing the dataset got me thinking, what could I make using this data that could be applied in real-life while successfully completing my academic requirements? Finally, after discussing it over multiple discord meetings with my teammates we arrived at the conclusion that we would attempt to create an application that could tell you the chances of being infected with the coronavirus in the ongoing pandemic before rushing to hospitals demanding invasive procedures from already overworked medical staff around the world.

Before we started work on how the user would interact with our application, we had to select and train deep learning models on the dataset I found and then test and validate the model built on new scans so that our program could predict, classify and segment infected parts of the lung as seen in the scans.

After some research, we found that models specially designed for working on medical scan images like UNET and the new and improved UNET++ resulted in much higher mean accuracy compared to generalized CNN models or machine learning models that work directly on numeric data like linear SVMs and Logistic Regression models. So, we divided our tasks which ended up with me working on CT scans while the others in my group worked on X-rays and the frontend functionality of the final application.

In this post, I will try to explain the basic methodology and steps that go into preparing the raw data for such projects and how image segmentation problems (our first task) can be solved using Python and popular deep learning techniques. Also, I have included snippets of code and outputs wherever possible to help understand the process being followed.

We start by importing the required libraries and downloading the entire dataset to our environment of choice. (I am using Google’s Colab platform for the extra ram and GPU benefits)

Downloading the dataset as a .zip
Extracting and accessing the metadata file

The metadata of our dataset has 4 major categories of images labelled as the original scans, lung masks, infection masks, and combined masks for all the images in the dataset. Running the shape command we get,

20 full-fledged images having corresponding copies under each of the 4 labels highlighting different aspects of the original scan.

After looking through the data, the first major step we need in every ML/DL problem is its analysis and adequate pre-processing which will help us reduce common problems such as bias, code complexity, training time and such. The main steps I have followed in the pre-processing stage are:

1. Removing incomplete and faulty images

This is a major step when using a huge amount of unverified data from multiple sources as many of the images which you download for use may end up being cropped, low-res and unevenly colored which may lead to problems during training the model. We can make interpolations to account for missing numeric data, but usually faulty image data is truncated to avoid problems in the subsequent steps.

2. Separate model for empty mask prediction

To ensure that our model does not overcompensate for slices that do not contain any part of the infection we will create an independent model to calculate empty masks in the input CT and only train our first model on slices that have visible infections in the render. On figuring out that 497 slices were complete black masks, we will exclude these from pre-processing as we do not want to bother the segmentation model with this.

3. Use of enhancement algorithms to improve overall performance

Using popular image enhancement techniques on individual scans we can drastically increase the performance our model by helping it to distinguish it between healthy and infected tissue more easily. In this approach I have used Contrast Limited Adaptive Histogram Equalization (CLAHE) to enhance the difference between the pieces of intertwined infected cells between the pulmonary cells.

CLAHE function

Using custom read functions for the .nii format of the scans (read_nii) and plotting the original and enhanced scans along with their respective histograms, we can easily see the effect a single function has on separating the part of an image we need for our model.

4. Cropping the Region of Interest (ROI) using Otsu’s binarization and other methods

The images possess a lot of black space containing no part of the infection and parts that we are not interested in like the diaphragm below the lungs. These will take up valuable RAM space and unnecessary computing power. A possible solution is cropping the slices to only contain the ROI as per problem statement and use case. After cropping, we use Otsu’s thresholding method to avoid having to choose discrete values which is dynamically assigned by the algorithm to return infection and/or lung masks.

5. Data Augmentation

Neural Networks are only as good as the quality and quantity data you provide it with. In image segmentation and classification problems specifically, if the ratios of the number of images under different labels are skewed or images under different training labels are too similar, it may lead to bias errors which will lead to wildly incorrect classification. Data augmentation methodologies aim to eliminate this problem by using the already existing data to create new iterations which differ slightly from their source to sensitize your model to new variables which will in turn, help increase performance on new data that the model has never encountered. It also helps increase the apparent number of training examples as the model treats every new iteration as a different example. To achieve efficient augmentation in our dataset we will define a pipeline that takes in our already existing images and returns a sequence of scan slices after our user-defined transformations have been applied to it.

The Augmentation Pipeline

After execution, we can plot some random slices to see a part of our new set of images created after passing our original dataset through the pipeline.

Newly Generated Slices

Our pre-processing stage is now finished, now we will overlay our infection masks over their corresponding CT scans before creating and running our model on the augmented dataset.

Overlay Function
Scans with infection masks added on top

Now, we split the test and train data and define the loss function and metrics to be used for our model.

All we must do now is define, compile and fit our model on the data and then use the metrics of our choice to evaluate the performance of our model. Tuning our model hyperparameters is essential here as it can mean the difference between a highly accurate and efficient program and an inaccurate, slower model. The optimal learning rate will be dependent on the topology of your loss landscape, which is in turn dependent on both your model architecture and your dataset. While using a default learning rate (the defaults set by your deep learning library) may provide decent results, you can often improve the performance or speed up training by searching for an optimal learning rate. Using exponential decaying learning rates and a cosine annealing scheduler are popular methods which produce good results.

Setting optimal learning rates
Running UNet on our data

After fitting we can now look at our model’s predicted infection masks on the test data scans.

Further information can be obtained in post processing by analyzing DICE and IOU scores using the metrics we defined earlier.

Using other metrics like precision and recall may also be used to test the performance of the model.

As you can see, our model performs surprisingly good on a relatively low number of scans due to our attention on the quality of scans during pre-processing and augmenting our existing data to allow the model to be trained for a greater variety of images from a smaller dataset. Further improvements can be made to this approach to CT segmentation by using newer image enhancements techniques and creating an even larger augmented dataset of images by increasing the number of copies created for each slice and/or adding more transformations functions to the augmentation pipeline. Further, work is needed to create a UNet++ model for the classification of CT scans showing whether the patient has COVID-19 or some other pulmonary defect using the infection masks predicted by our code defined here (Tasks 2 and 3 of our entire project). Finally, everything has to be linked to as Python web framework like Streamlit or Flask to create a user interface easily usable by everyone as a utility application. You could also try a transfer learning approach using multiple models for the classification task of this problem which will allow you to improve every subsequent model by forwarding the errors made by the last generation.

As a third wave of cases have started popping up around the world since the start of winter in the northern hemisphere, the research conducted and products built by the open-source community to provide additional support to our society and the medical industry, are slowly becoming vital in maintaining the quality of life around the globe.

The dataset used in this article can be found at:

--

--