Design and Implementation of a Semi-Unified High Performance Signal Processing Coprocessor

Utilizing the DFT, the DHT, the DCT or the DST is an obvious choice in signal processing domain. This paper describes the implementation of a semi-unified high performance coprocessor of transform length '8' for the synchronous design in XC3S1400AN-4FG484 FPGA device of Xilinx Company. The operating frequency of 20 MHz is achieved. The paper presents the trade-offs involved in designing the architecture, the design for performance issues and the possibilities for future development.


Introduction
Memory based Field Programmable Gate Arrays (FPGAs) have the advantage of real-time in-circuit re-configurability as opposed to other gate arrays of similar gate density. This advantage translates into unlimited, in-circuit flexibility, re-configurability and reliability, facilitating prototyping of complex electronic designs [1]. The high capacity and performance that FPGAs have achieved in recent years allow them to accelerate digital signal processing (DSP) tasks. FPGA devices have been used to implement Custom DSPs since the beginning of this decade [2]. Usually, FPGAs are used as VLSI replacement on low volume production or prototyping devices which are to be eventually implemented as ASICs. Their 100% testability and the possibility of achieving a high degree of fault coverage makes them increasingly attractive for complex designs with multiple (and of course limited) iterations on their design cycles [1]. The FPGA devices have benefited from the improvements in VLSI technology, leading to higher speed and capability as well as lower power consumption [2].
The discrete transform algorithms are very well known and due to their versatility and very simple hardware implementation are widely used in VLSI digital signal processing systems. The discrete Hartley transform (DHT) is similar to the DFT, with the only difference that it deals only with real computation. The discrete cosine transform (DCT) has long been used in image and speech processing. The JPEG standard till JPEG2000 used the DCT as the basis function. The discrete sine transform (DST) is useful for spe-ctrum analysis, data compression, speech processing, biomedical signal processing and in many other applications. These basic signal processing transforms are required in almost all the phases of image and signal processing and cover a large range of biomedical signal and image processing, for various imaging techniques and spectral analysis of the signals [4].
A number of architectures are proposed for the realization of these transforms [2][3][4][5][6][7]. However, a unified architecture, which can compute all these transforms, can serve the purpose of a general DSP chip, and therefore a unified architecture has been adopted to obtain all the transforms in a single FPGA chip. The basic structure of all the transforms, DFT, DCT, DHT and DST, are almost equivalent and this property has been exploited in the design of the unified architecture.

Discrete Transforms
This Section presents the transforms in detail and the possibility of their implementation as the basic processing elements. For a real sample sequence x(n), where n is (0,1,..., N-1) the discrete transforms which are the DFT, the DHT, the DCT and the DST, can be defined as: DFT

DHT based on Direct Algorithm
T is an 8 8 × cas (cosine and sine) matrix [6]. Let The transform matrix for the 8-DHT is therefore: We start by remarking initially that Which follows from the addition of arcs formula: Clearly, modules of components on the 2nd column are identical to the corresponding elements at the 6th column; the same is true for the 3rd and 7th column. We can thus consider new variables ( ) ( ) ( ) instead of x(2) and x(6) , and so on.

An Algorithm for the DFT Implemented by DHT
According to the definition of DFT and DHT the DFT data Sequence is given by the following relation:

Fast Cosine Transform based on Direct Algorithm
According to the definition of DCT, for a given data se- is given by (equation (3)). The discrete Cosine Transform is defined as a matrix multiplication which is illustrated below [7][8].

An Algorithm for the DST Implemented by DCT
In this part a method of composing the discrete sine transform from the discrete cosine transform is demonstrated. Let x (n): n 0, 1, 2,}, N-1, be a sequence of N data values [9]. Substituting m N k; k 1, 2… N into the discrete cosine transform (equation (5)), results in: Where, S(k) is the discrete sine transform (DST) of the sequence x(n) . Therefore, the procedure for obtaining the sine transform of the sequence x(n) is composed of three steps.
1. Change the signs of all odd numbered data to the opposite sign to form a new sequence ) (n x . (Notice that the sequence number is counted from zero).
2. Compute the discrete cosine transform on the sequence ) (n x . 3. By reversing the sequence order of data which were produced by step 2, the discrete sine transform of the sequence x(n) is obtained.
This procedure may be represented in the form of matrix multiplication. Let

DCT_DST Block
The DCT block is first implemented according to the direct Algorithm (equation (13)) and then we have used this DCT block to implement the DCT_DST block (equation (19)). Figure. 1 illustrates the proposed architecture for DCT_DST block. If the "S" input signal has the logic value of zero, the DCT transform would be applied on the input data vector and if the "S" input signal has the logic value of one, the DST transform would be applied on the input data vector.

DHT_DFT Block
First, the DHT block is implemented according to the Direct Algorithm using its matrix form in (equation (10)) and then it is used to implement the DHT_DFT block (equations (11), (12)). Figure. 2 is our proposed architecture. The hardware is extracted from this data flow diagram. If the "S" input signal has the logic value of zero, the DHT transform would be applied on the input data vector and if the "S" input signal has the logic value of one, the DFT transform would be applied on the input data vector. The "I" signal is also used to select the real or imaginary part of the DFT transform. This signal is just for understanding the block diagram and is ignored in the top module.  Figure 4. illustrates the simulation result of this module. During this simulation all of the four transforms of this coprocessor have been applied to an eight-bit data input.
If the "T_SEL" signal has the hexadecimal value of "00", the outputs will be zero. Having the value of "01", the "T_SEL" signal will lead the DST transform to the output.  The DCT transform will appear on the output when the "T_SEL" signal has the value of "02". The values of "03" and "04" will lead the DFT and DHT transforms on the output, respectively.

Implementation Results
The whole architecture including the computation and data path is modelled at Register Transfer Level in VHDL, simulated and tested by a test bench using ModelSim simulator and implemented in XC3S700An-4FG484 FPGA device of Xilinx Company. The Simulation result of the proposed coprocessor has been shown on Figure 4.
The Hardware description of this architecture for DCT, DST, DHT and DFT implementations of transform length '8' was synthesized using Xilinx Series FPGA tool (ISE) and mapped on the XC3S700An-4FG484 FPGA chip. In the 8-bit coprocessor implementation, the worst delay time is about 48 ns and thus a frequency of 20 MHz is achieved. The routed IP takes total of 3426 Slices which is 58 percent of the chip. The total number of I/Os used in the design is 328 which are 88 percent of the total I/Os of this chip.

Conclusions
This paper has proposed an efficient mapping on FPGA of a common Coprocessor. The DFT algorithm is implemented by DHT, which is based on Direct Algorithm. The Direct fast DCT algorithm is presented and then a method of computing the discrete sine-transform from the discrete cosine transform is demonstrated. For the future work we can implement this coprocessor using DCT as the base transform for implementing other transforms to obtain more surface reduction.