6-2012

Smart Memory Synthesis for Energy-Efficient Computed Tomography Reconstruction

Qiuling Zhu
*Carnegie Mellon University*

Larry Pileggi
*Carnegie Mellon University, pileggi@ece.cmu.edu*

Franz Franchetti
*Carnegie Mellon University, franzf@ece.cmu.edu*

---

Follow this and additional works at: [http://repository.cmu.edu/ece](http://repository.cmu.edu/ece)

Part of the [Electrical and Computer Engineering Commons](http://repository.cmu.edu/ece)

---

This Conference Proceeding is brought to you for free and open access by the Carnegie Institute of Technology at Research Showcase @ CMU. It has been accepted for inclusion in Department of Electrical and Computer Engineering by an authorized administrator of Research Showcase @ CMU. For more information, please contact research-showcase@andrew.cmu.edu.
Abstract—As nanoscale lithography challenges mandate greater pattern regularity and commonality for logic and memory circuits, new opportunities are created to affordably synthesize more powerful smart memory blocks for specific applications. Leveraging the ability to embed logic inside the memory block boundary, we demonstrate the synthesis of smart memory architectures that exploits the inherent memory address patterns of the backprojection algorithm to enable efficient image reconstruction at minimum hardware overhead. An end-to-end design framework in sub-20nm CMOS technologies was constructed for the physical synthesis of smart memories and exploration of the huge design space. Our experimental results show that customizing memory for the computerized tomography parallel backprojection can achieve more than 30% area and power savings with marginal sacrifice of image accuracy.

Index Terms—Smart Memory; Hardware Synthesis; Computed Tomography; Parallel Backprojection;

I. INTRODUCTION

Computationally intensive algorithms in medical image processing (e.g., computerized tomography (CT)) require rapid processing of large amounts of data and often rely on hardware acceleration [1], [8], [2]. Inherent parallelism in the algorithms is exploited to achieve the required performance by increasing the number of parallel functional units at a cost of power and area. The overall performance is often defined by the limited bandwidth of the on-chip memory as well as the high cost of memory access.

One approach to address these challenges is to optimize the on-chip memory organization by constructing a customized smart memory module that is optimized for a particular function for higher performance and/or energy efficiency [11], [10]. Recent studies of sub-20nm CMOS design indicate that memory and logic circuits can be implemented together using a small set of well-characterized pattern constructs [5], [6]. Our early silicon experiments in a commercial 14nm SOI CMOS process demonstrate that this construct-based design enables logic and bitcells to be placed in a much closer proximity without yield or hotspots pattern concerns. Moreover, such restrictive patterning enables the synthesis (not just compilation) of customized memory blocks with user control of flexible SRAM architectures and facilitates smart memory compilation.

To efficiently leverage this new technology, however, algorithms and hardware architectures need to be revisited. In this paper we revisit the Shepp and Logan’s backprojection algorithm that is widely used in the CT image reconstruction. It is observed that in the parallel implementation of the algorithm, the memory address differences are fairly small for adjacent projection angles and adjacent pixels. We exploit this property via a customized memory structure that could feed in-parallel running image processing engines with a large amount of required projection data in one clock cycle. The implementation is realized by embedding “intelligent” functionality into the traditional interleaved memories and allow multiple memory sub-banks to share the memory periphery. We further construct a smart memory design framework that provides the end user with finer control of the customized SRAM architecture parameters, thus enabling automatic generation of the specified implementation. Physical implementations were carried out in a commercial 14 nm SOI CMOS process. Our results indicate more than 40% area savings and 30% power savings. The marginal impact on accuracy is minimized with appropriate constraints on the algorithm.

II. ADDRESS PATTERN EXPLORATION

In a parallel-beam CT scanning system, the object to be scanned is placed between the evenly spaced array of an unidirectional X-ray source and the detector. Radiation beams from the source pass through the object and are measured at the detector. A complete set of projections is obtained by rotating the arrays and taking measurements for different angles over 180°, forming the Radon transform of the image (i.e., projection data). The inverse of the projection data allows to reconstruct the tomographic images (i.e., backprojection) [9], [1].

Shepp and Logan backprojection algorithm. The Shepp and Logan backprojection algorithm is the most well-known backprojection algorithm [2], [3]. For each pixel, P located at (x, y), and each projection angle θi, the first step in backprojection is to locate the pixel in an appropriate beam (ray). If the center of P is not on a ray, the distance (d) to its adjacent rays is calculated and the contribution from the adjacent rays to the pixel (Qp) is computed according to the linear interpolation equation (1), assuming that pixel is enclosed by the tkh and (t + 1)kh rays,

\[ Q_p(x, y, \theta_i) = R_t + (d/L) \cdot (R_{t+1} - R_t), \]

where \( R_t \) is the value of tkh ray, d is the interpolation distance, and L is the ray interval. \( Q_p \) represents the contribution of the projection angle \( \theta_i \) to the current pixel value.

In the above equation, the address \( t \) to the projection data memory and the interpolation distance \( d \) are computed as follows (assuming the target image has the dimension size of \( r \times c \)):

\[ t_{x,y,\theta_i} = \left( x - \frac{r}{2} \right) \cdot \cos \theta_i - \left( y - \frac{c}{2} \right) \cdot \sin \theta_i + t_{offset}. \]

\[ d = t(\theta) - \lfloor t(\theta) \rfloor. \]

Address difference. The above procedures are to be repeated for every angle and for every pixel, which involves significant address computation and memory access operations. To illustrate the inherent address pattern, we show the address to the next projection of angle \( \theta_{i+1} \) in (4):

\[ t_{x,y,\theta_{i+1}} = \left( x - \frac{r}{2} \right) \cdot \cos (\theta_{i+1}) - \left( y - \frac{c}{2} \right) \cdot \sin (\theta_{i+1}) + t_{offset}. \]

The address difference (\( \delta t_1 \)) between (2) and (4) could be as

\[ \delta t_1 = \left( x - \frac{r}{2} \right) \cdot \delta \cos \theta_i + \left( \frac{c}{2} - y \right) \cdot \delta \sin \theta_i, \]

with \( \delta \cos \theta_i = \cos (\theta_{i+1}) - \cos (\theta_i) \) and \( \delta \sin \theta_i = \sin (\theta_{i+1}) - \sin (\theta_i) \). Using trigonometric identities, we can compute the bounds on (5) as

Qiuling Zhu, Larry Pileggi, Franz Franchetti
Dept. of Electrical and Comp. Eng., Carnegie Mellon University, Pittsburgh, PA, USA
Email: qiulingz@andrew.cmu.edu, franzf@ece.cmu.edu
follows (assuming \( r = c \) and \( N \) is the total number of projections):

\[
|\delta t_1| \leq |2 \sin \left( \frac{\pi}{N} \right) \cdot \frac{r}{2} \left( \cos \left( \frac{2(t+1)}{N} \right) - \sin \left( \frac{2(t+1)}{N} \right) \right) |.
\]

(6) has a maximum bound of \( \sqrt{2} \pi - \frac{\pi}{N} \) for relatively large \( N \). This shows that \( \delta t_1 \) is limited to a fairly small range when the appropriate ratio of \( r \) and \( N \) is selected. For example, the value is always less than one when \( \frac{\pi}{N} \leq \frac{1}{2} \).

This observation can easily extend to two scenarios below:

(a) The address difference between the next \( k \) projection memory of angle \( \theta_k \) and the first memory of angle \( \theta_1 \) for the same pixel \( P(x, y) \) will increase proportionally to \( k \):

\[
|\delta t_k| = \left| x_{x,y,\theta_k} - x_{x,y,\theta_0} \right| \leq \sqrt{2} \pi \cdot \frac{r}{N} \cdot k \approx 4.44 \frac{r}{N} \cdot k. \quad (7)
\]

(b) The address differences when both pixel coordinate and projection angle are incremented are also bounded by a limited range. For demonstration purpose, we define the problem as to reconstruct four projection memories.

\[
\delta t_{max} = t_{x,y,\theta_{k+1}} - x_{x,y,\theta_0} = \cos \theta_1 + \sin \theta_0 + k \cdot \delta t_1. \quad (8)
\]

It is easy to show that (8) has the maximum value of \( \sqrt{2} + 4.44 \cdot \frac{r}{N} \cdot k \) and it is limited to small range, e.g., the value must be less than four when \( \frac{\pi}{N} \leq \frac{1}{8} \) and \( k = 4 \).

The basic idea is, as the address differences for adjacent projections and adjacent pixels are small, these addresses will activate the same or adjacent wordlines when such memories are located horizontally in parallel with each other. It leads to opportunities to share the memory decoder among these memories by programming “intelligent” logic functionalities into the memory periphery.

III. SMART MEMORY CUSTOMIZATION FOR PARALLEL BACKPROJECTION ARCHITECTURE

In this section, we describe our approach to optimize the memory organization and backprojection architecture based on the observed memory access patterns.

A. Consecutive Access Memory

As we mentioned, linear interpolation is an important procedure of the algorithm. Linear interpolation requires the access to two adjacent array addresses of the projection memory in a single clock cycle.
memories are designed as the 1 × 4 consecutive access memories to output more elements than required. In this example, t₂, t₃ along with their nearest neighbors t₁ and t₄ are all read out from the memories. Then the configuration logic (output-mux) is used to select the appropriate two elements from the four outputs. In this approach, the active wordlines for the two memories are always the same.

Horizontal and vertical parallel backprojection. To exploit the proposed smart memory to obtain superior hardware efficiency of the parallel backprojection, we propose two parallel approaches, horizontal and vertical parallel backprojection.

The horizontal parallel backprojection can perform more than two backprojections in parallel and all the involved projection memories share the same memory decoder using either decoder-mux or output-mux approach. Fig. 5 shows the example of accessing in eight adjacent projection memories. Assuming that the pixels addressed by the first memory addresses are t₃ and t₄, we highlight the possible locations of the two pixels accessed in the next seven memories. We observe that they are all clustered locally around t₃ and t₄, and are bounded by t₀ and t₇. Required pixels spread out further from t₃ and t₄ for memories that are further away from the first memory as explained by formulae (7), as the address difference of the next k projection memories from the first reference is increasing proportionally with k. Similar to the output-mux design shown in Fig. 4, we configure each projection memory as an 1 × 8 consecutive access memory to output all the shown eight pixels and use another 8-to-2 output-mux to select the appropriate two outputs from the eight outputs for each projection memory. In this way, all the eight memories could share the same decoder and seven memories decoders are saved. However, as the projection memories output more pixels than required, many memory outputs are actually wasted. An approach to use these wasted pixels is applying another vertical parallel backprojection, which performs the backprojections of multiple neighborhood pixels in parallel. E.g., in equation (8) we discuss the address differences for performing the backprojections of four neighborhood pixels concurrently. Backprojection of each pixel per projection angle requires one linear interpolation and involves memory accessing of two pixels, so totally it requires eight pixels to be accessed from each projection memory. (8) shows that these eight pixels will be contained in the outputs of the above 1 × 8 access memory in most situations. Therefore, the memory architecture needs no change for the vertical parallel backprojection since we just take advantage of the unused memory outputs from the horizontal parallel backprojection. By implementing both horizontal and vertical parallel backprojection concurrently using the modified consecutive access memory, all the memory outputs are utilized and a much higher throughput is achieved.

IV. DESIGN AUTOMATION

Design tradeoff space analysis. Designing a CT image reconstruction system is a tradeoff problem involving algorithmic constraints, performance, hardware cost, and image accuracy. The discussion of address patterns in Section II shows that the ratio of image dimension size (r) and the projection numbers (N), r/N, is an important algorithm constraint. Smaller r/N indicates smaller adjacent address differences, which allows for more adjacent projection memories sharing the memory decoder, saving more hardware cost. However, it also limits the use of the method in applications with larger image size r and/or fewer projection angles N. For larger r/N, the corresponding larger address difference will limit the number of projection memories that can share the decoder. For example, in Fig. 5, the last two projection memories of θᵣ₊₆ and θᵣ₊₇ may require to access two pixels at the two ends, which are not accessible along with other eight pixels from the 1 × 8 consecutive access memory. To solve this problem we could increase the memory access width and apply more complicated configuration logic. However, this would increase the hardware cost. Alternatively, to lower hardware cost we could assign the nearest neighborhood pixels if the requested pixels are not available, which unfortunately will result in the loss of image accuracy. This shows that different design decisions will result in different tradeoffs. The combination of these design choices constitutes a huge design space. Further, exploring the design tradeoff space requires customized memory designs, which are traditionally prohibitively expensive. Thus, a strong design automation tool is required to make the hardware synthesis feasible.

End-to-end smart memory design framework. We have developed a smart memory design framework that provides designers with a graphical user interface to select design parameters, and automatically generate the optimized smart memory hardware IP [11], [12]. As shown in Fig. 6, the tool frontend is built using the chip generator infrastructure “GENESIS” [7], [4]. It provides a user-configurable graphical interface that allows the user to input design specification and generates the optimized RTL automatically. The tool backend is a smart memory compiler for the physically synthesis of customized smart memory, which is developed based on the logic and memory co-design methodology [5], [6], [12]. Using this tool, embedded random logic and memory periphery are synthesized with the memory bitcells one shot to a small set of pre-characterized layout.
In Fig. 7 (a), we first compare the hardware cost of two smart memory approaches (decoder-mux and output-mux) to the conventional memory. The memories studied here have the size of 4,096 words and wordlength of 16 bits, and we only consider two memories implemented as 1 × 8 consecutive access memories sharing the decoder with each other. We observe that the output-mux approach is more cost-efficient. The reason is that in decoder-mux each wordline is accompanied by a set of configuration logic, and each set of logic communicates with its local wordline. This explains also why decoder-mux achieves relatively higher power-efficiency compared to its area-efficiency. In contrast, output-mux only requires a single large configuration logic at the memory output. Due to the superiority of the output-mux method, it will be used for our backprojection system in the following discussions.

In Fig. 7 (b) we evaluate the hardware cost of the proposed memory architecture for reconstructing a 256 × 256-size image from 1,024 projections. The x-axis is the parallel degree $P_d$, which is defined as the number of adjacent backprojections that are performed concurrently and share the same memory decoder. We vary $P_d$ from two to eight to show its impact on the cost. The y-axis shows the relative area and dynamic power compared to the conventional design where no memory sharing strategies are used. We see that more than 40% area savings and more than 30% power savings can be achieved with the increase of $P_d$. We also measure the mean square error (MSE) of the reconstructed image compared to the reference image (see Fig. 7 (c)). As expected, the error increases when either $P_d$ or algorithm parameter $(r/N)$ increases, which allows us to tradeoff image accuracy with hardware cost in applications where minor distortion is acceptable.

VI. CONCLUSION

The emergence of construct-based design facilitates the robust synthesis of cost-effective smart memory blocks that are customized for specific applications. This creates opportunities to re-design algorithms and re-architect the hardware structure to match the advanced technology capabilities. In this paper we propose smart memory architectures and the end-to-end design framework to implement them for the CT image reconstruction problems. The results in a 14nm CMOS process demonstrate significant improvements in area and power. Moreover, we present the opportunities to tradeoff hardware cost with acceptable image accuracy based on appropriate algorithm tuning. This paper demonstrates that the embedded memories in data-intensive computing can exploit the smart memory design methodology and the inherent address pattern of the algorithm to achieve superior power and performance efficiency.

ACKNOWLEDGEMENT

The authors acknowledge the support of the C2S2 Focus Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity.

REFERENCES