Thursday, May 21, 2015

Image convolution with the DFT multiplication property


Determine if image needs to be processed in sections or at once.
If in sections, need to keep track of boundaries (history).

Using FFT


The result of multiplying the DFTs of two matrices is equivalent to taking the circular convolution of one matrix with the other. The larger matrix is the image, the smaller one is the kernel.

Circular convolution in 2D wraps in the horizontal, vertical, and diagonal directions. It would be equivalent to linear convolution by the following image with a 3x3 kernel:


  78 | 2345678
  89 | 3456789
--------------------
  67 | 1234567
  78 | 2345678
  89 | 3456789


The largest square above is the original image. If this image were circularly convolved with this kernel:


1 1 1
1 1 1
1 1 1


The resulting image would be:

array([[39, 27, 36, 45, 54, 63, 51],
       [39, 27, 36, 45, 54, 63, 51],
       [39, 27, 36, 45, 54, 63, 51]])

Performing the convolution with the original image in the frequency domain gives:

array([[51, 39, 27, 36, 45, 54, 63],
       [51, 39, 27, 36, 45, 54, 63],
       [51, 39, 27, 36, 45, 54, 63]])
      
Note that the columns are rotated by one to the right. Linear convolution of images is usually done starting with the center element of the kernel in the upper left corner of the image. This is done so that the output image will have the same dimensions as the input. The boundaries are often zero-padded.

This is only a practical convention, however. True convolution starts with the upper left corner of the kernel - the kernel is flipped in both directions - in the upper left corner of the image. This is analogous to 1D convolution. We have no control over how the convolution is performed when applying the DFT multiplication property. We're contrained by the mathematics: y = IFFT(X*H). So we have to set up the input matrices correctly to get the desired result.

In this case, the linear convolution was performed with the kernel hanging off the edge by one element in every direction. Or in other words, so that the center element (the anchor) was aligned with each boundary element.

On the other hand, multiplication in the frequency domain is equivalent to convolving with the kernel only overlapping the top row or left-most column on the boundary, and with no boundary overlap on the bottom row and right-most column. Again, this is analogous to 1D convolution. Of course, this is circular convolution, so this is just one way of visualizing it. The kernel is really only overlapping the image at all times, and there is no boundary padding.

If linear convolution were performed this way, there would be a pronounced boundary effect only on the top and left edges of the output, resulting in a biasing effect. Typically, we want to balance the boundary effect to all edges.
      
So how do we set up the input matrices for convolution via the Y=X*H property? We want to the output image to be the same as for linear convolution with a zero-padded image, e.g.:


0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 0
0 2 3 4 5 6 7 8 0
0 3 4 5 6 7 8 9 0
0 0 0 0 0 0 0 0 0



We will also need to slice the output image, in order to remove unwanted boundary content and preserve the dimensions. This is directly analogous to removing filter delay from 1D signals.

First of all, the kernel needs to be zero-padded, so that it has the same dimensions as the input image. Otherwise, their DFTs couldn't be multiplied. Some scientific computing packages (e.g Matlab and Numpy) will zero-pad for you, according to the FFT order(s) passed to them.

The simplest way to zero-pad is to place the kernel in the upper left corner of a zeroed buffer. This is what the automatic zero-padding in NumPy will do. That way, after the matrix is flipped, convolution will start with the upper left corner of the kernel. There are other types of padding that can be used, but we'll use zeros for simplicity.

We need to zero-pad the input image in order to preserve the input dimensions in the output. The only difference here from direct convolution is that both image dimensions should be zero-padded to be powers of two, for FFT optimization.

Note that the result of DFT multiplication will be circular convolution with the zero-padded image, not the original, as the algorithm doesn't "know" what part is the original. Remember that circular convolution stops at the bottom and right edges, rather than sliding off them like linear convolution. So we'll want to zero-pad the bottom and right edges the same as for direct convolution: so that the center element of the kernel will overlap the edge elements.

A note here regarding terminology. We'll define "direct convolution" as convolution performed directly, i.e. without any transforms. In "linear convolution" the kernel slides off the edge of the image, without wrapping around. Unless otherwise stated, direct convolution is assumed to be linear.

So for an MxN kernel, the bottom and right edges should be padded with (M-1)/2 and (N-1)/2 (assuming M and N odd) rows/columns of zeros, respectively. We would then pad the upper and left boundaries to make the dimensions powers of two. This will make the slicing algorithm much simpler than if we tried to make the padding the same on all edges.

With linear convolution, we'd only want to remove (M-1)/2 rows and (N-1)/2 columns from the top and left of the output, respectively. This is, of course, analogous to removing filter delay from a 1D signal. It actually is filter delay, we just don't call it "delay" because it's an image instead of a time-varying signal.

For a KxL image, the vertical "delay" will be:
   
    nextpowof2(K + min_zero_pad - 1) - (K + min_zero_pad - 1) + 2*(linear_delay)
   
    nextpowof2(K+M-1) - (K+M-1) + 2*(M-1)/2

    = nextpowof2(K+M-1) - K
   
We start with minrows, which is the number of rows in the image plus the zero-padding, then extend it to a power of 2:

    minrows = (M-1)/2 + K + (M-1)/2 = K + M - 1
   
Not suprisingly, this is the formula for the length of a 1D convolution. Likewise, the horizontal "delay" will be:

    nextpowof2(L+N-1) - (L+N-1) + 2*(N-1)/2 = nextpowof2(L+N-1) - L
   
Again, this is only for odd kernel dimensions.

Note that if we hadn't zero-padded the input image before computing the FFT, it would have "chopped off" part of the right and bottom edges of the output image, when compared to direct convolution. It also would have shifted the image down and to the right, with some unintended boundary effect to the top and left. this boundary effect results from the kernel wrapping around the input image.


Understanding circular convolution on images


To better understand how circular convolution works in 2D, recall that circular convolution is based on the idea of periodicity. So imagine an infinite number of adjacent copies of the image:


  1234567 | 1234567 | 1234567
  2345678 | 2345678 | 2345678
  3456789 | 3456789 | 3456789
---------------------------------------------
  1234567 | 1234567 | 1234567
  2345678 | 2345678 | 2345678
  3456789 | 3456789 | 3456789
---------------------------------------------
  1234567 | 1234567 | 1234567
  2345678 | 2345678 | 2345678
  3456789 | 3456789 | 3456789



If we convolve a kernel with the center matrix, overlapping values from adjacent matrices will create the wrap-around effect discussed previously.