Deep Learning in neuroscience using HPC systems
Author: Christian Schiffer (firstname.lastname@example.org)
Who we are and what we do
The INM-1 (Institute for Neuroscience and Medicine - Functional and Structural organization of the brain) works on the creation of a three-dimensional multimodal and multilevel high-resolution atlas of the human brain. A brain atlas is a three-dimensional map of the brain, which incorporates information from different imaging techniques or studies in a common reference space. One important aspect of such an atlas is the analysis of cytoarchitectonic cortical brain areas. Cytoarchitectonic areas are specific regions in the outer surface of the brain - the cortex - which differ by their cytoarchitectonic structure. For example, different cytoarchitectonic areas differ in density, size, distribution or shape of the nerve cells (neurons).
To analyze cytoarchitectonic properties in different brains, the INM-1 cuts postmortem human brains into thin (20 micron) histological sections. An average adult human brain with coronal cutting plane ("parallel to the face") results in about 7000 histological sections. After cell-staining these sections to make individual nerve cells (neurons) visible, they are scanned using high-resolution and high-throughput light microscopes. As a result, we obtain one image with an in-plane resolution of 1 micron/pixel for each histological brain sections. The resulting data volume of multiple terabytes necessitates the use of HPC systems for processing and analyzing the data.
Established semiautomatic methods (Schleicher et al. 1999) for boundary identification between cytoarchitectonic areas are precise, but they rely on a coarse localization of a region of interest and delineations of the inner and outer cortical ribbon. Since these preprocessing steps are time-consuming and require profound expert knowledge the currently used methods are insufficient to handle the steadily increasing data originating from high-throughput microscopy. To speed up this time- and labor-intensive process, members of the Big Data Analytics (BDA) group at the INM-1 develop automated approaches based on deep convolutional neural networks. The team working on automated cytoarchitectonic mapping currently includes Hannah Spitzer and Christian Schiffer.
Schleicher, A., K. Amunts, S. Geyer, P. Morosan, and K. Zilles. 1999. “Observer-Independent Method for Microstructural Parcellation of Cerebral Cortex: A Quantitative Approach to Cytoarchitectonics.” NeuroImage 9 (1): 165–77. https://doi.org/10.1006/nimg.1998.0385.
The image data our deep learning methods operate on is aquired with high-resolution light microscopes. By scanning one histological brain section we obtain one image with an in-plane resolution of 1 micron/pixel. The size of the images varies with the size of the scanned brain section, sections from the center of the brain tend to be larger than sections from the poles. On average, images have a resolution of 120.000 x 80.000, which results in a file size of about 10 Gigabyte per image at 8-bit color depth. A complete human brain with about 7000 sections has a total data volume of around 50 Terabyte.
Scanned images are stored uncompressed as BigTIFF files. As opposed to regular TIFF files, BigTIFF files can hold data with a total size of over 4 Gigabyte. In addition to the original data coming from the scanner, a so called image pyramid is stored inside the file. Each level of the pyramid contains a downscampled version of the original image. This increases memory consumption, but it makes working with the image files much easier. While a full image is typically very hard to open and view on a regular workstation due to memory limitations, the downscaled pyramid levels can be used as preview images or for applications which do not require full resolution data (e.g. image registration).
BigTIFF images are not directly used for neural network training.
Instead, they are converted to
HDF5 prior to processing.
This has several reasons:
- HDF5 files can easily store different kinds of metadata accociated with the files. For example, we may save the physical spacing of the image, the number of the brain, the section number or even a complete history of preprocessing steps which have been applied to the image.
- One HDF5 file can contain multiple types of data. For example, it can store an image, a foreground mask of the image (with the same shape) and some annotations, which can be saved as list of x-y tuples.
- Specifically, we use the multi-dataset feature of HDF5 to port the image pyramid principle to HDF5, by storing each pyramid level as a separate dataset.
- HDF5 files can be easily accessed from Python using the h5py module. There is also a variety of command line tools to analyze or modify HDF5 files.
- In combination with MPI, HDF5 offers parallel read and write access from multiple processors at once.
- Data can be automatically compressed, which can result in a massive reduction of required memory, especially for masks or label images.
- The memory layout of HDF5 files can be adjusted to offer optimal performance for a specific use case. Sequential file access may benefit from a contignous memory layout, while random access may benefit from a chunked memory layout.
- While the original images have a resolution of 1 micron/pixel, our deep learning models operate on a resolution of 2 micron/pixel. Since the latter is not available in the default image pyramid (resolution decreases with a factor of 4, so 1 micron/pixel, 4 micron/pixel, 16 micron/pixel...), we would have to read parts of the images at full resolution and downscale it to 2 micron/pixel, which would require reading 4 times more data than we actually require. To prevent this, we choose 2 micron/pixel to be the root resolution for our HDF5 files, which dramatically speeds up file access during training and prediction.
Coming soon @schiffer1
Distributed Deep Learning on HPC
Coming soon @schiffer1