# A 21-Gb/s 87-mW Transceiver with FFE/DFE/Linear Equalizer in 65-nm CMOS Technology

Huaide Wang<sup>1</sup>, Chao-Cheng Lee<sup>2</sup>, An-Ming Lee<sup>2</sup>, and Jri Lee<sup>1</sup>

<sup>1</sup>National Taiwan University, Taipei, Taiwan

<sup>2</sup>Realtek Semiconductor Corp., Hsinchu, Taiwan

#### Abstract

This paper presents an ultra low-power transceiver for 20-Gb/s backplane communications. Incorporating half-rate, power-saving transmitter and full-rate, high-speed receiver with 2-stage equalization, this work achieves 21 Gb/s with BER<10<sup>-12</sup> over a 40-cm (16-inch) regular FR4 channel, while consuming a total power of only 87 mW from a 1.2-V supply.

#### I. INTRODUCTION

The ever growing volume of backplane communication encourages transceiver design with higher speed and lower power. Figure 1(a) illustrates the measured response of a typical 40-cm FR4 channel, presenting 21.2-dB and 30.5-dB loss at 10 GHz and 20 GHz, respectively. With such a high loss, the far-end data eye at 20 Gb/s is fully closed as expected [Fig.1 (b)]. We thus need high-frequency boosting and proper amplification at both the transmitter (Tx) and the receiver (Rx).





To reduce power and save area, one efficient approach is to parallelize the data stream in the Tx, and minimize the number of CML stages and inductors. On the other hand, full-rate operation in the Rx can significantly reduce the circuit complexity. As a combination, this paper introduces a 21-Gb/s transceiver preserving the advantages of these two schemes. Employing half-rate Tx and full-rate Rx with novel DFE design, the transceiver provides error-free (BER<10<sup>-12</sup>) data link over 40-cm FR4 channel with a power of only 87 mW.

#### **II. TRANSMITTER**

The transmitter is shown in Fig. 2, where half-rate data process with 3-tap feedforward equalizer (FFE) is employed. A demultiplexer (DMUX) deserializes the input into two 10-Gb/s data streams, which are subsequently fed into two parallel latch chains. The two demultiplexed and delayed data are then deployed for the multiplexers (MUXes) to pick up alternatively, producing appropriate bit sequence to multiply with the corresponding coefficients  $\alpha_{-1}$ ,  $\alpha_0$ ,  $\alpha_{+1}$ . The output driver thus combines the three and delivers the pre-emphasized output. The 20-GHz clock needs to be divided by a factor of 2 before driving the DMUX and FFE. Following the design in [1], a single-to-differential (S/D) converter is employed in front to redeem the limited testing equipments. The 10-GHz clock is buffered by delays  $\Delta T_1$  and  $\Delta T_2$ (made of CMOS inverters) to provide proper phases for the DMUX, the latches, and the MUXes.

Since most of the blocks are operated at 10 Gb/s, it is possible to use CMOS logics with rail-to-rail swing to minimize the power. Here, TSPC latches are used in the data paths. (A TSPC flipflop can achieve an operation bandwidth of 16 GHz in 90-nm CMOS [2].) With a 10-GHz bandwidth, its power consumption is 1/10 as large as that of a CML flipflop. For such an arrangement, a proper interface between the CML and CMOS logics is mandatory. As can be shown in Fig. 3(a), a current-steering adaptor converts the CML into full-swing levels with a bandwidth of 18 GHz. The MUX design is of great importance as well. With the help of rail-to-rail data and clock, it is possible to realize a pseudo-NMOS MUX at 20 Gb/s [Fig. 3(b)]. Here, the sign-bit selection of the two data streams is accomplished by two-way switches made of transmission gates. Note that the MUX in Fig. 3(b) naturally restores the output signal back to CML levels. Such a semi-digital Tx saves at least 40 mW of power. The output driver combines the weighted inputs in current mode [1], generating an output swing of 200 mV and maximum boost of 9.5 dB. Owing to the half-rate structure, no inductor is used throughout the Tx.





While recently becoming attractive in high-speed receiver designs, the half-rate architecture would suffer from circuit complexity if we were to include a decision-feedback equalizer (DFE) in it. Sub-rate DFE with loop unrolling relaxes the stringent speed requirement, but the extra hardware and difficult routing complicate the design. We employ full-rate architecture in this design. Here, two techniques are introduced to implement the receiver with very low power. We use an analog equalizer in front to boost the high-frequency part, and a 1-tap DFE after it to further compensate for the loss. As illustrated in Fig. 4, a single-stage transimpedance amplifier with shunt-shunt feedback and  $50-\Omega$ on-chip termination is placed as the front-end buffer [1]. The analog equalizer provides a maximum boost of 12 dB at 10 GHz, alleviating the DFE design. Since there is only one dimension for tuning in both equalizers, one can easily incorporate adaptability in the receiver even in analog approach. The whole Rx operates with CML levels.



Fig. 4. Receiver architecture.

However, typical full-rate DFE suffers from inadequate settling time for the feedback signal at high speed. The proposed full-rate DFE (Fig. 5) merges the adder and the slicer into the flipflop to accelerate the operation. Here, the flipflop output feeds back to the input with the coefficient  $\alpha$  realized in current mode, i.e., the additional pair  $M_{11}$ - $M_{12}$  creates the 1-tap feedback. It is equivalent to dynamically adjusting the threshold level of the sampler based on the previous result. Note that the total tail current of the adder and the master latch remains constant in order to keep a fixed swing. That is, the adder pair also steers the tail current (by  $M_{13}$ - $M_{14}$ ) synchronously with the master latch, resetting the feedback when the comparison (or "slicing") is accomplished. As a result, the master latch maintains a constant output swing in locking state, since the regenerative pair  $M_3$ - $M_4$  carries all the tail current  $(I_{SS})$ . Actually, the boost at Nyquist frequency (half data rate) can be derived as  $20\log_{10}[(1+\alpha)/(1-\alpha)]$ . Simulation shows that this topology achieves 5-dB boost at Nyquist even for a data rate of 30 Gb/s without using inductive peaking. It also saves 14 mW as compared with half-rate 1-tap DFE at 20 Gb/s.



### IV. EXPERIMENTAL RESULTS

The transceiver has been designed and fabricated in 65-nm CMOS technology. Figure 6 illustrates the photos of the dies as well as the testing board. Transceiver performance for different channel lengths (from 5 cm to 40 cm) of a conventional FR4 board is thoroughly examined with chip-on-board assembly. The input data pattern is  $2^{31}$ -1 PRBS. The transmitter and the receiver consume 45 and 42 mW, respectively, from a 1.2-V supply. Figure 7 depicts the transmitter's output at 20 Gb/s with minimum (0 dB) and maximum (9.5 dB) Nyquist boost, presenting maximum swing of 200 mV. The recovered data for different channel lengths at the receiver's output is shown in Fig. 8. With full boost at both ends (Tx and Rx), the transceiver is capable of delivering 21-Gb/s data over 40-cm FR4 channel (which presents 21.2-dB loss at 10 GHz) with BER  $< 10^{-12}$ . Figure 9(a) plots the BER performance for different lengths. A bathtub BER test for different clock phase error at the Rx is shown in Fig. 9(b), suggesting a tolerable range of 240°. Table I compares the performance of this work with prior arts.



Fig. 6. Die photos and testing board (with 40-cm channel).



Fig. 7. Transmitter's output with (a) minimum (0 dB), (b) maximum (9.5 dB) boost. (data rate: 20Gb/s, vertical scale: 50mV/div, horizontal scale: 20ps/div.)



Fig. 8. Receiver's output for (a) 10 cm, (b) 40 cm channels. (data rate: 20Gb/s, vertical scale: 50mV/div, horizontal scale: 20ps/div.)



Fig. 9. (a) BER for different channel lengths, (b) bathtub plot for clock phase.

| TABLE I<br>Performance Summary   |                    |      |                     |      |                     |      |
|----------------------------------|--------------------|------|---------------------|------|---------------------|------|
|                                  | [3]                |      | [4]                 |      | This Work           |      |
| Data Rate                        | 10 Gb/s            |      | 20 Gb/s             |      | 20 Gb/s             |      |
| BER<br>(2 <sup>31_</sup> 1 PRBS) | 40cm Tyco          |      | 5cm FR4             |      | 40cm FR4            |      |
|                                  | < 10 <sup>-9</sup> |      | < 10 <sup>-12</sup> |      | < 10 <sup>-12</sup> |      |
| Power (mW)                       | Тх                 | Rx   | Tx+Rx               |      | Тх                  | Rx   |
|                                  | 70                 | 130  | 318                 |      | 45                  | 42   |
| Supply                           | 1.2/1.0 V          |      | 1.2 V               |      | 1.2 V               |      |
| Area (mm <sup>2</sup> )          | Тх                 | Rx   | Тх                  | Rx   | Тх                  | Rx   |
|                                  | 0.43               | 0.43 | 0.34                | 0.22 | 0.03                | 0.04 |
| Technology                       | 90nm CMOS          |      | 90nm CMOS           |      | 65nm CMOS           |      |
|                                  |                    |      |                     |      |                     |      |



## ACKNOWLEDGEMENT

The authors thank TSMC and NSC for support.

#### REFERENCES

[1] Jri Lee et al., "Design and Comparison of Three 20-Gb/s Backplane Transceivers for Duobinary, PAM4, and NRZ Data," *JSSC*, 2008.

[2] Jri Lee and H. Wang, "Subharmonically Injection-Locked PLLs for Ultra-Low-Noise Clock Generation," *ISSCC*, 2009.

[3] J. F. Bulzacchelli et al., "A 10-Gb/s 5-Tap DFE/4-Tap FFE Transceiver in 90-nm CMOS Technology," JSSC, 2006.

[4] J. Jaussi et al., "A 20Gb/s Embedded Clock Transceiver in 90nm CMOS," ISSCC, 2006.

[5] E. Yeung et al., "Power/Performance/Channel Length Tradeoffs in 1.6 to 9.6Gbps I/O Links in 90nm CMOS for Server, Desktop, and Mobile Applications," *VLSI Symp.*, 2006.

[6] T. Masuda et al., "A 250mW Full-Rate 10Gb/s Transceiver Core in 90nm CMOS Using a Tri-State Binary PD with 100ps Gated Digital Output," *ISSCC*, 2007.

[7] K. Krishna et al., "A 0.6 to 9.6Gb/s Binary Backplane Transceiver Core in 0.13μm CMOS," *ISSCC*, 2005.

[8] M. Harwood et al., "A 12.5Gb/s SerDes in 65nm CMOS Using a Baud-Rate ADC with Digital Receiver Equalization and Clock Recovery," *ISSCC*, 2007.