The enable driven stream combiner (EDSC) combines information from different sources into one stream. You can use the EDSC to map different fields into the frequency domain of one OFDM symbol. This concept is illustrated in Figure 38 for 3 computational VIs.
Figure 38: Enable driven stream combiner
The Stream Generation VI generates control information for the computational VIs. This comprises an enable signal for each computational VI. Only one of these enable signals is asserted at a given time. The asserted enable signal defines the structure of the stream that is generated. Afterwards, an unlimited number of computational VIs provide their data on the output whenever the corresponding enable signal is asserted. When the enable signal is not asserted, the output is zero. Because of this constraint, a simple OR gate can be used to combine the streams.
This design pattern does not use a throttle control mechanism. The computational module can always provide data when the enable signal is asserted. The stream generation ensures that the module can provide data prior to starting.
If the computation of one of the VIs requires pipelining, the other paths between the Stream Generation VI and the OR gate must be delayed to equalize path latencies. Since the output of the computation usually has a wider bit width than the enable signal, add the delay before the computational modules.
5.2 LTE PxSCH Channel Encoder
5.2.1 Functionalities and design considerations
The LTE PxSCH Channel Encoder comprises the following tasks.
Segmentation of incoming bits of shared channel to code blocks and transport blocks as described in 3GPP  and .
Calculation of code block and transport block CRCs and concatenation of CRCs
Figure 39 shows the block diagram of the LTE PxSCH Channel Encoder module. The module has the following features.
Dedicated for LTE PDSCH / PUSCH
LTE data channel (UL + DL) support only
75 Mbit/s throughput (20 MHz, SISO)
Supports all transport block sizes compliant to 3GPP 
Turbo Encoder requires no filler bits
All code blocks of a transport block are equal sized
Supports redundancy version RV=0 only (no HARQ)
Supports UE cat 5 only (no soft buffer limitation in rate matcher)
|Clock rate in LTE AF
||LTE 75 Mbit/s support
||Worst case latency for PRB=100, MCS=28; of minor importance without real time MAC layer
||numbers also include parameter computation for configuration
||Values in () refer to Xilinx Kintex-7 FPGA K410T as used in supported NI USRP and FlexRIO devices
|Block RAMs (36k)
Table 1: LTE PxSCH Channel Encoder facts
All inputs and outputs offer a 4-Wire handshake interface. The LTE PxSCH Channel Encoder requires a configuration prior to the data. This configuration comprises of the following.
Number of resource elements used for transmission
Modulation (QPSK, 16QAM, 64QAM)
Redundancy version index (0)
Transport block size (Table 22.214.171.124.1-1 of )
The incoming data shall be given as Boolean values. The PxSCH Channel Encoder output is also given as Booleans. The mapping from Booleans to symbols is done such that a false equals 1 and true equals -1.
5.2.3 Implementation Overview
The LTE PxSCH Channel Encoder consists of four main modules, as shown in Figure 39. Internally, multiple stages parallelize the execution on a code block basis. Each stage can contain a different configuration. The state machines of the modules handshake with each other at the end of the operations to check if computation can continue. If computation cannot continue, the operation of the stages will be stalled. The nomenclature for the data samples on the blue path is based on section 5.1.3 of . The yellow arrows indicate the control information.
Figure 39: PxSCH Channel Encoder block diagram
In the Configuration Stage, the Parameter Calculation module derives the internal parameters from the given parameter cluster (see section 5.2.1). This calculation is performed once for each transport block.
The internal configuration cluster is consumed by the CRC module. This module calculates the 24-bit CRC checksum for the incoming transport block. Depending on the transport block size, it subsequently segments the transport block into code blocks. If the segmentation results in more than one code block, a 24-bit CRC checksum is calculated for each code block. The CRC checksums are calculated and mapped into the bit stream according to Sections 5.1.1 and 5.1.2 of . The CRC module generates a new configuration out of the given transport block parameters for each code block because all subsequent blocks are working on a code block base.
The uncoded input bits c of each code block are fed into the Turbo Encoder block where the bit stream is duplicated. The first stream is fed directly into a Turbo Encoder, whereas the second copy is fed into an internal interleaver prior to encoding. The output of this interleaver is the bit stream c’. After interleaving, the resulting stream c’ is also encoded using an identical Turbo Encoder as for stream c before. The actual Turbo Encoding algorithm is implemented as defined in Section 126.96.36.199.1 of .
The termination of the trellis in the encoder is performed by feeding the bits from the feedback shift registers into the encoder after all information bits are encoded. The resulting termination bits are collected, reordered as defined in Section 188.8.131.52.2 of , and mapped into the encoder output streams, forming the d(0), d(1) and d(2) outputs of the encoder. The Encoder block has a separate output for the d(0) stream, while the d(1) and d(2) streams are using the same output port, so that d(0) and d(1) are transferred in parallel in the first chunk of data followed by a second chunk with the remaining d(2) bits. The transfer length of each chunk is based on the code block size K.
The encoder output is written into the circular buffer of the rate matcher. The sub-block interleaving defined in Section 184.108.40.206 of  is performed by calculating interleaved write addresses for the circular buffer. When all bits of one code block are written into the circular buffer, the encoding stage is complete and is able to process the next code block. Meanwhile the output stage is able to read out the circular buffer. The readout begins at address k0 and stops after the length of the output sequence reaches E.
The timings of the different stages of the whole PxSCH Channel Encoder depend on their current configuration. Table 2 lists the processing time in clock cycles for each stage (compare to Figure 39). This list does not take into account the stage handshaking by the state machine.
||Processing Time [clock cycles]
Table 2: PxSCH Channel Encoder processing time per stage
The timing of the PxSCH Channel Encoder is visualized in Figure 40. The rectangles indicate the valid samples in each stage. The grey rectangle in the upper right corner serves as a scaling reference. The colors indicate the samples of one code block. The configurations belonging to different transport blocks are provided in Table 3.
Figure 40: PxSCH Channel Encoder timing for three contiguous transport blocks
||Processing Time [clock cycles]
Table 3: Configurations for PxSCH Channel Encoder timing figure (Figure 40)
Prior to the first transport block (blue), all modules are empty. Therefore, the single code block is passed from one stage to the next right after the stage is finished. At the end of the output stage the resulting bit sequence is available.
The single code block of the next transport block (green) has a lower code rate. Thus, the processing time of the Output Stage is much longer than the processing time of the Encoding Stage. In this case, the reading of the circular buffer of the Rate Matcher during the Output Stage determines the overall throughput. Upon completion of the Output Stage, the Encoding Stage of the next code block (red) is already complete, so the Output Stage of this code block as well as the Encoding Stage of the fourth code block (yellow) can begin immediately.
The third transport block consists of two code blocks (red and yellow). Only one configuration is required for both code blocks. The CRC module segments the input data stream into two successive code block streams and provides a corresponding configuration for each code block.
Due to the high code rate of the last configuration the Output Stage of the first code block (red) is much faster than the Encoding Stage of the next code block (yellow). In this case, the handover to the Output Stage is delayed and the Turbo Encoder limits the throughput of the whole subsystem.
5.2.5 Throughput and Latency
The throughput of the PxSCH Channel Encoder subsystem is limited by the stage with the longest processing time. This could either be the Encoding Stage, the transfer of the encoded data into the circular buffer of the rate matcher, or the Output Stage, depending on the configuration. Based on the assumption that all modules are ready for input data, the throughput can be calculated using the values from Table 2 as processing times (PT) using Equation 1. The clock frequency is named fCLK. For the maximum number of resource block,s PRB=100 and the highest format MCS=28 the throughput equals 95.8 Mbit/s at a clock rate of 192 MHz (see also section 5.2.1).
Equation 1: Throughput calculation
The latency of the complete channel encoding process between the assertion of a valid configuration on the input and the availability of all bits on the Rate Matcher output can be calculated by Equation 2. Additional cycles are required for the stage handshaking, but they are negligible for larger code blocks. For the configuration of 100 PRBs and MCS 28 the latency L is about 0.8 ms.
Equation 2: Latency calculation
5.3 LTE PxSCH Channel Decoder
5.3.1 Functionalities and design considerations
The LTE PxSCH Channel Decoder comprises the following tasks as described in 3GPP  and .
Performs rate matching
Performs turbo decoding
Checks code block and transport block CRCs; concatenates and outputs decoded bits
Figure 41 shows the block diagram of the LTE PxSCH Channel Decoder module. The channel decoder is fully compliant to the 3GPP LTE standard  and . The supported transport block sizes are provided in Table 220.127.116.11.1-1 of . There is no circular buffer limitation (corresponding to UE category 4 or 5 for SISO transmission). Retransmissions are not supported by the LTE Application Framework. Thus, HARQ combining is not included in the channel decoder and the parameter computation is implemented for redundancy version index 0 only. However, the interface lists a currently ignored field for redundancy version already. The signal processing itself is also able to cope with parameter values that result from redundancy version values other than 0.
Dedicated for LTE PDSCH / PUSCH
LTE data channel (UL + DL) support only
75 Mbit/s throughput support (20MHz, SISO)
Support for all transport block sizes compliant to 3GPP 
o Turbo decoder requires no filler bits
o All code blocks of a transport block are equal sized
Redundancy version RV=0 support only (no HARQ)
Support for UE cat 5 only (no soft buffer limitation in rate matcher)
|Clock rate in LTE AF
||Depends on number of iterations n of Turbo decoder; LTE 75 Mbit/s support is achieved with n ≤ 4.5
|Decode performance: Implementation loss
||Depends on configuration of n; largest n providing processing time below 1 ms, required better SNR to achieve a FER=10% compared with an ideal Log-MAP decoder with n → ∞
||Worst case latency for n ≤ 4.5; of minor importance without real time MAC layer
||numbers also include parameter computation for configuration
||Values in () refer to Xilinx Kintex-7 FPGA K410T as used in supported NI USRP and FlexRIO devices
|Block RAMs (36k)
Table 4: LTE PxSCH Channel Decoder facts
All inputs and outputs offer a 4-Wire handshake interface. For its operation the LTE PxSCH Channel Decoder requires a configuration prior to the data. This configuration comprises the following.
Number of resource elements used for transmission
Modulation (QPSK, 16QAM, 64QAM)
Redundancy version index (0), other values are ignored
Transport block size (table 18.104.22.168.1-1 of )
The incoming data is given as Log-likelihood Ratios LLR as defined in Equation 3. The quantization of the input LLR also defines the required internal precision of the decoder processing unit. The fewer bits that are spent, the more the performance degrades compared to the floating point model. The best results for precision and resource usage are achieved with 8 bits. The fixed point format is a signed FXP5.3. The LLR input comprises the range from -16 to +15.875 where stronger LLRs have to coerced to the max values. In case of puncturing, a LLR of 0 represents the maximum uncertainty.
Equation 3: Log-likelihood Ratios
The output of the PxSCH Channel Decoder is given as Booleans. The mapping from Booleans to bits is performed such that a False equals 1 and True equals -1 (Refer to Equation 4).
Equation 4: BPSK Mapping
The number of half-iterations m (in literature it is more common to specify the number of iterations n = m/2) to execute in the Turbo Decoder is configurable at runtime. The value’s fixed-point format of m is an unsigned FXP4.0. For best results regarding decoding performance and throughput, set the number of half-iterations in the range from m = 5 to m = 9. Setting the number of half-iterations m > 9 (n > 4.5) will limit the throughput below 75 Mbit/s at a clock rate of 192 MHz for the decoder.
5.3.3 Implementation Overview
The LTE PxSCH Channel Decoder consists of four main modules, as shown in Figure 41. Internally, multiple stages parallelize the execution on a code block basis. Each stage can contain a different configuration. The state machines of the modules handshake with each other at the end of the operations to check if computation can continue., If computation cannot continue, the stages will be stalled. The nomenclature for the data samples on the blue path is based on section 5.1.3 of . The yellow arrows indicate the control information.
Figure 41: PxSCH Channel Decoder block diagram
In the Configuration Stage, the Parameter Calculation module derives the internal parameters from the given parameter cluster (see section 5.3.2). This calculation is performed once for each transport block.
The internal configuration cluster is consumed by the Rate Matcher. Using a 4-Wire handshake, the weighted softbits e can now be transferred into the circular buffer. The Input Stage is complete when the rate matching output sequence length E is reached. This step is repeated for each code block in the transport block without taking a new configuration. Thus, the configuration cluster contains both values of E according to section 22.214.171.124.2 of .
In the Transfer Stage, a reduced configuration is given to the Turbo Decoder. This cluster comprises:
Number of code blocks (C)
Code block size (K)
Last code block flag
After configuration handover, the sequences d(0), d(1), d(2) are read from the circular buffer of the rate matcher and stored into the Turbo Decoder’s Softbit/LLR Input Buffer. Punctured softbits/LLRs are represented by zeros in the sequences. Therefore, no additional puncturing information is needed. This transfer is divided into two chunks. While the sequences d(0) and d(1) are transmitted in parallel in the first chunk, d(2) is transmitted in the second consecutive chunk. The transfer length of each chunk is based on the code block size K.
In the Decoding Stage the Turbo Decoder estimates the encoded bit sequence b in multiple iterations. One full iteration consist of two half-iterations where one half iteration is based on the input sequences d(0) and d(1), whereas the other half-iteration uses an interleaved sequence of d(0) and the received sequence d(2) for estimation of b. During the last half iteration, the bit sequence b is written to the Reordering Buffer. The number of half-iterations m can be changed during runtime.
In the Output Stage the decoded bits are passed to the CRC check module aligned with the configuration cluster. The CRC check module removes and checks the transport block as well as the code blocks CRC checksums. On the output only the bits of sequence a are marked as valid. At the end of the transport block the result of the CRC check is given as control information. There is a Boolean for the result of the transport block check and a cluster of 13 Booleans for the code blocks CRC check where each entry 1 .. C represents one code block.
5.3.4 Rate Matcher details
Rate matching in LTE consists of separate interleaving of the three bit streams from the encoder followed by a circular buffer storing all these bits (see Figure 42). The desired code rate is achieved by reading from the circular buffer the amount of bits according to the scheduled resources. This concept allows the theoretical adjustment to any code rate between 0 and 1. If the code rate is higher than the code rate of 1/3 of the encoder, not all bits are read from the circular buffer, whereas in case of smaller code rates, some bits are read more than once.
Figure 42: Rate matching for turbo coded transport channel  at transmitter
The rate matcher at the receiver must execute these operations in reverse order. At the beginning the circular buffer is filled with Zeros to easily implement puncturing. Additionally, writing to the circular buffer can be implemented easily as softbit combining by read, add and write back operations. After all received softbits are written to the circular buffer, the three streams are read. Some softbits can still be Zeros, indicating that those bits were not transmitted at all. After deinterleaving, the streams are handed over to the decoder separately.
The Rate Matcher implementation on FPGA consists solely of the circular buffer. The writing to the buffer is in linear order starting at k0 (see section 126.96.36.199.2 of ) which is adopted to reflect the leave out of filler bits. It uses read, manipulate and write-back mechanism to enable softbit combining. The following read out of the circular buffer uses special address calculation to reverse the sub-block interleaving (described in section 188.8.131.52.1 of ) of the three bit streams d(0), d(1), and d(2) on the fly and provide them in linear order for the decoder.
Additionally, the implementation of the circular buffer does not contain any filler bits unlike the definition in section 5.1.4 of . This has to be regarded for the parameter (e.g. k0) and address calculation as well.
5.3.5 Description of Turbo Decoder implementation
The Turbo Decoder is based on the Max-Log-MAP algorithm also known as the BCJR algorithm and described in Chapter 4 of . The LabVIEW implementation is capable of handling code blocks with a length that is a multiple of eight bits (byte aligned). This condition is fulfilled by all the transport block sizes given in .
To achieve the throughput of 75 Mbit/s with up to n = 4.5 iterations (m = 9 half iterations) using a single Turbo Decoder instance, the decoder is internally parallelized with P = 4. Each incoming code block is divided into P = 4 equal length segments of length K / P. Thus, P = 4 identical Max Log MAP decoders estimate the extrinsic information for all code block segments in parallel. Furthermore, the Max Log MAP decoder implementation uses the BCJR algorithm with an additional windowing approximation called next iteration initiation technique to reduce the amount of memory for storage of all internal states.
184.108.40.206 Operation Principle
Three softbit sequences are handed over from the Rate Matcher. These are systematic (S=d(0)) softbits, parity 1 (P1= d(1)) softbits originating from the first convolutional encoder and parity 2 (P2=d(2)) softbits from the second convolutional encoder using an interleaved version of the systematic bits. Internally, the decoder uses two different softbit sets as shown in Figure 43. The first set comprises the received systematic S and parity 1 P1 softbits. The second set consists of the interleaved systematic bits S’ (derived from ) and the parity 2 P2 bits.
The decoding is done iteratively. In each half-iteration the decoder is fed with one set of softbits. Internally the full Trellis diagram is evaluated to search the likeliest way through all states. The a-posteriori output represents the Log-likelihood ratio for each bit. There is also extrinsic information that represents the information gain for each bit from the half-iteration. For the next half-iteration this additional information (called a-priori on the input) is used along with the other set of softbits to refine the estimation. Between half-iterations the extrinsic information must be interleaved or de-interleaved to match the order of softbits (according to the original or QPP interleaved order in the encoder). In the last half-iteration a hard decision is done on the sign of the a-posteriori information to get the decoded bits.
Figure 43: Turbo Decoder principle
220.127.116.11 Mathematical Operations
Based on the AWGN channel model the probability of encoded bit x can be expressed as exponential term. Because of the Log-likelihood ratios the sum of such terms can be approximated by finding the maximum of the exponents:
Equation 5: Simplification in the operations
Figure 44: Summary of key operations in the MAP algorithm
For each half-iteration the Trellis diagram for the current code block is set up and evaluated where the key operations are visualized in Figure 44. The first step is the calculation of the state transition probability Gamma (Γ) for each bit (index k) of the code block from the input LLR(yk) (weighted softbits Lcykl) and the a-priori information L(uk) (see Equation 6). The index l enumerates the elements of the code word (encoded bits). There are two elements in LTE code words of one component encoder (systematic and parity). The previous state is denoted by s’ while the next state is s. The state numbering is based on the encoder’s internal registers. In LTE the encoder has three registers, which translates to eight states. The channel reliability factor Lc is already weighted by the LLR demapper. xkl are the encoded bits created by the encoder during this state transition.
Equation 6: Gamma Computation
Based on this state transition probability Gamma the forward recursive calculation of Alpha (A) can be performed. Alpha is a vector of probabilities for all eight states of the encoder’s internal registers. This relates to the search of the likeliest path in the Trellis diagram in forward direction. Since the component convolutional encoder in the LTE data channel processing is terminated the start state s is known to be the all zeros state. The start vector A0 therefore exhibits a much higher probability for the zero state than for all other states. Subsequent Alpha vectors Ak are calculated recursively using the Gamma values. The new vector is calculated element by element (Ak(s) for state s) as shown in Equation 7.
Equation 7: Alpha computation
The backward recursive calculation of Beta (B) starts at the end of the code block. This relates to the search for the likeliest path in the Trellis diagram in reverse order. The end state s is also known to be the all zeros state due to the terminated encoder. Equation 8 defines the recursive Beta calculation starting with BK with the highest probability value for the zero state.
Equation 8: Beta computation
Based on Alpha, Beta, and Gamma the A-Posteriori LLR L(uk|y) for bit index k can be calculated as defined in Equation 9. It uses the Alpha vector Ak-1 corresponding to the accumulated state transition probabilities from the start of the code block up to the previous bit index k-1, with the Beta vector Bk representing the accumulation of future state transition probabilities starting from bit index k up to the end of the code block, and the current Gamma transition probability vector from the received symbol with bit index k. The hard decision for the decoded bit can be derived from the sign of the A-Posteriori result.
Equation 9: A-Posteriori computation
As a last step, the Extrinsic information (probability gain from decoding in such a half-iteration) is calculated based on the A-Posteriori values as defined in Equation 10. By subtracting the A-Priori information and the influence of the transmitted softbit estimation, only the information gain is preserved.
Equation 10: Extrinsic value computation
In order to execute multiple iterations to increase the performance of the Turbo Decoder and improve the overall system sensitivity, the decoding operation must be parallelized to meet the throughput requirements from section 5.3.1.
The Turbo Decoder uses code block segmentation and windowing to reduce the execution time. The split of one code block is shown in Figure 45. The size of any code block in LTE is a multiple of eight bits. Therefore it is always possible to split the execution in P = 4 equal length subsegments which are processed in parallel. P = 4 was chosen as tradeoff between achievable throughput or maximum number of half iterations and required resource utilization. For each subsegment there is a separate BCJR Subsegment Decoder instance needed, according to 18.104.22.168.
Figure 45: Code block segmentation
The subsegments are further split into smaller windows of 32 bit indices. If the subsegment size is not multiple of 32, the first window (window 0) can be smaller in size. The window count w is limited to 48 for the largest code block size. The purpose of this split is primarily to reduce the amount of memory need to store all internal state information, and also to reduce decoding latency from a maximum of K/4 to 32. Instead of complete execution of backward Beta and forward Alpha computation for all K/4 bits before starting A-Posteriori computation, decoding is started at least every 32 bits.
For each half-iteration, Gamma and Beta calculation are started on a subsegment basis. This is shown by orange arrows in Figure 46. As soon as the Beta vector for the last code bit of one window is available, Alpha computation is triggered, which is indicated by green arrows. Gamma and Beta values are preserved for each code bit in a LIFO to reverse their ordering and enable calculation of the A-Posteriori values in combination with the Alpha computation output.
Figure 46: Subsegment execution principle for w=3
Both segmentation and windowing split the underlying Trellis diagram into multiple parts. Because of the termination of the encoder, only the probabilities of the very first and the very last state of the code block Trellis are fixed prior to decoding. For all intermediate subsegments and window cutting edges, the state probabilities are unknown. All the state probabilities are equally set to zero to express this uncertainty.
For the next half-iteration on the same set of softbit inputs (the over next half-iteration) the probability vectors Alpha and Beta of all end states of predecessor subsegments and windows are used as the improved starting values for the successor subsegments and windows. This reflects the actual continuity of the Trellis. Such transitions are illustrated in Figure 47 as dashed arrows.
These transitions do not work for consecutive half-iterations because even and odd half-iterations are based on different softbit input sets. The difference originates from the interleaving of the systematic bits for the second component encoder. Thus the order of bits is not the same, which leads to different Alpha / Beta state probabilities. The transition of probability vectors at the cutting edges leads to a completion of the Trellis diagram after a certain number of half-iterations.
Figure 47: Two exemplary state transitions
22.214.171.124 FPGA Implementation
The block diagram of the Turbo Decoder is shown in Figure 48. As described in previous sections, the Turbo Decoder covers multiple operation stages. The handshaking between the stages and the control signal for each half-iteration are generated in a state machine not shown in the block diagram.
Figure 48: Turbo Decoder block diagram
At the end of the Transfer Stage, the systematic bits and parity bits are available in the Softbit Input Buffer. This double buffer supports the independent operation between the Transfer and Decoding stage. The systematic bits are stored in linear (S) order as well as in interleaved order (S’) for even and odd half-iterations using two different memories. The encoder termination bits are separated from the incoming data streams by the Termination Bit Extraction module and stored in the Termination Bit Memory.
Upon startup of the Decoding Stage operation, the termination bits are read from the Termination Bit Memory into the Initial Beta Calculation module. They are used to determine the start values of the Beta probability vectors for the first and the second set of softbits. Both vectors are saved to the Stake Memory that handles the state transitions described in section 126.96.36.199 between subsegments and windows.
Upon completion of the initial Beta calculation, the first half-iteration is triggered. As indicated by the thickness of the arrows in Figure 48, four parallel streams are read from the A-Priori and Softbit Buffer to feed the four BCJR Subsegment Decoder instances within the BCJR decoder module. The start states of Alpha and Beta are provided in parallel by the Stake Memory. At the end of each subsegment window the probability vectors are written back to that memory. The memory uses double buffering to store two different sets of state vectors assigned to the different sets of softbits (S & P1 or S’ & P2).
The A-Posteriori and Extrinsic outputs of the BCJR decoder provide four elements per clock cycle from the four subsegment decoders. The QPP Reordering module assigns addresses to each element and reorders the four streams corresponding to the QPP interleaving in the encoding process. The operation mode toggles between interleaving and deinterleaving for even and odd half-iterations, respectively, to always enable linear read-out of the A-Priori buffer. Double buffering is used to allow read and write operations simultaneously.
During the last half-iteration, hard decision of the A-Posteriori values is done inside the QPP Reordering module, and the Boolean data is written to the Bit Reordering Buffer.
In the Output Stage, the decoded bits are read from the Bit Reordering Buffer using a 4-Wire handshake to throttle the output based on the downstream modules. Due to the implemented double buffering, the decoding of the next code block can already begin.
5.3.6 Timing of the PxSCH Channel Decoder
The timings of the different stages of the whole PxSCH Channel Decoder depend on their current configuration. Table 5 lists the processing time in clock cycles for each stage (compare to Figure 41). This list does not take into account the stage handshaking by the state machine.
||Processing Time [clock cycles]
||19+(K/4+24+min(32, K mod 128))*(number of half-iterations)
Table 5: PxSCH Channel Decoder processing time per stage
The timing of the PxSCH Channel Decoder is demonstrated in Figure 49. The rectangles indicate the valid samples in each stage. The grey rectangle in the upper right corner serves as a scaling reference. The colors indicate the samples of one code block. The configurations belonging to different transport blocks are given in Table 6.
Figure 49: PxSCH Channel Decoder timing for three contiguous transport blocks (here with m = 8 half iterations)
||Number of PRBs
|Red / Yellow
Table 6: Configurations for PxSCH Channel Decoder timing figure (Figure 49)
Prior to the first transport block (blue), all modules are empty. Therefore, the single code block is passed from one stage to the next immediately after the stage is finished. At the end of the output stage the transport block (TB) CRC is removed and the resulting bit sequence is available.
The single code block of the next transport block (green) has a lower code rate. Thus the processing time of the Input Stage is much longer than the processing time of the Transfer Stage. In this case the writing of the circular buffer of the Rate Matcher during the Input Stage determines the overall throughput. Upon completion of the Input Stage the previous code block (blue) is already in the Decoding stage. Thus, the Transfer Stage for the code block can begin immediately and fill the second page of the Turbo Decoder’s softbit buffer. This consecutive execution continues up to the output since the code block sizes of the first two transport blocks are equal.
The third transport block consists of two code blocks (red and yellow). Only one configuration is needed for both code blocks. The Rate Matcher ensures that the softbits on the input are taken code block by code block. The configuration is asserted close to the completion of the Input Stage of the code block of the second transport block (green), but it can be asserted anytime during the previous Input Stage(s).
Due to the high code rate of the last configuration, the Input Stage of the first code block (red) is much faster than the Transfer Stage of the previous code block (green). In this case the Input Stage is stalled until the previous code block enters the Decoding Stage. This occurs for the second code block (yellow) as well. The processing time of the Decoding Stage is a few clock cycles longer than the Transfer Stage for this configuration. Upon completion of the Transfer Stage for the second code block (yellow), the handover to the Decoding Stage is delayed as well. In both cases the Turbo Decoder limits the throughput of the whole subsystem.
After the last code block of a multi code block transport block has been processed by the Output Stage, the transport block CRC and all code block CRCs are available.
5.3.7 Throughput and Latency
The PxSCH Channel Decoder subsystem throughput is limited by the stage with the longest processing time. This depends on configuration, especially the code rate. Based on the assumption that all modules are ready for input data, the throughput can be calculated using the values from Table 5 as processing times (PT) using Equation 11. The clock frequency is named fCLK. For the maximum of 100 PRBs and the highest MCS 28 with the number of half iterations set to m = 9, the throughput reaches about 82 Mbit/s at a clock rate of 192 MHz. This number still outreaches the requirement from 5.3.1.
Equation 11: Throughput calculation
The latency of the complete channel decoding process between the assertion of a valid configuration on the input and the availability of the CRC result(s) on the output can be calculated by Equation 12. Additional cycles are needed for the stage handshaking, but they are negligible for larger code blocks. For the configuration of 100 PRBs, MCS 28, and m = 8 half-iterations, the latency L is about 0.94 ms. This value is sufficient to connect the decoding core with a real-time MAC.
Equation 12: Latency calculation
The control channel, called PDCCH in LTE, is protected with a convolutional code against transmission errors. The corresponding receiver uses a Viterbi decoder implementing the Maximum Likelihood Sequence Estimation (MLSE) algorithm based on softbit input. Convolutional codes with a constraint length of 7 are used. Thus there is a 64-state Trellis. Other parameters of the convolutional code are summarized in Table 7.
||[133, 171, 165]
Table 7: Parameters of the convolutional encoder
5.4.1 Design considerations
The LTE PDCCH has a maximum code block length of 70 bits (for DCI format 2C using 20 MHz bandwidth as defined in section 188.8.131.52.5C of ). Currently the code block length in the LTE Application Framework is fixed to 48 bits including CRC. This code block is received once every TTI of 1 ms. The signal processing in the LTE Application Framework runs at a clock rate of 192 MHz.
5.4.2 Operation Principle
The Viterbi decoder consists of the three modules: branch metric computation, path metric accumulation and survivor selection, and traceback handling for actual decoding as shown in Figure 50.
Figure 50: Viterbi operation principle
In the branch metric computation, the received softbits are multiplied by the hypothesis to form the state transition metric. This branch metric updates the path metric of all 64 states and calculate the surviving path. The corresponding Boolean bit value is stored in the traceback memory. After a certain number of iterations, the maximum path metric is determined, and from its state the traceback memory is evaluated backward to decode bits in history along the most likely path in the Trellis.
The metric computation runs in streaming mode and fills the traceback buffer continuously, but the actual decoding with evaluation of the traceback memory is initiated only every traceback length time instances. Thus you must flush the metric computation with artificial softbits to enable traceback evaluation and decoding for the last bits of a code block as well.
The Viterbi core can handle one bit per clock cycle. Handshaking is implemented in the direction of upstream and downstream modules. All modules must be able to handle continuous data streaming. The input valid and output valid signals are used to indicate valid samples.
Aligned to the data is a data bit? flag. This Boolean is not used by the core but delays parallel to the processing. It can be used to distinguish data bits and flushing bits, which are required to decode the last bits of the code block.
The incoming data is given as Log-likelihood Ratios, as defined in Equation 3. Based on the code rate, 2 or 3 code bit inputs must be used. The fixed-point format is FXP4.1. Based on the quotient a strong probability for a positive transmitted symbol uk is mapped to 7.5. The strong probability towards a negative transmitted symbol uk is mapped to -8. In case of puncturing, zero represents the maximum uncertainty.
The output of decoded bits is given as a Boolean. The mapping from Booleans to symbols is performed such that a False equals 1 and True equals -1 (see Equation 4).
The operation mode and the traceback length must be constant. The traceback length defines the minimum number of states the Trellis is continued before decoding the current state. The valid range is 1 to 127.
The block diagram of the implementation is illustrated in Figure 51.
Figure 51: Viterbi block diagram
The branch metric computation is implemented with simple sign changes and additions. For LTE, three softbits build the input and are used to compute the 8 different branch metric values. This reflects the code rate of 1/3 of the encoder.
Both Application Frameworks use only one implementation of the path metric computation, often named Add-Compare-Select in literature. For each of the 64 states, the path metric values of the two preceding states are updated with the corresponding branch metrics. The larger of the resulting values is stored as the new path metric for this state. At the same time, the result of the comparison is stored as a Boolean value to mark the more likely state transition of the surviving path. The outputs of the submodule are a new 64 element vector of path metrics and a 64 element Boolean vector of survivor paths for every bit vector input.
In LTE, tail-biting is used, and at the receiver no information about the start state is available. Hence the best path metric start vector has equal values for all states.
The path metric computation submodule does not have a reset. Thus, at the end of a code block, the path metric memory must be similar to the described start vector to allow continuity of code block handling. This is achieved by flushing appropriate softbits (see section 5.4.2). For the tail-biting convolutional coding in LTE, all path metric values should be the same at the start of a code block. This is achieved by flushing softbits with the value 0, representing complete uncertainty.
Normalization of the path metric values is used to avoid infinitively growing values and restrict the bitwidth. Since only the difference between path metrics is of interest, but not their absolute value, normalization does not influence the decoding result. The process occurs over two clock cycles. In the first clock cycle, all path metric values are checked against a threshold before they are written to memory. In the second clock cycle dependent on the threshold comparison, a constant value is subtracted from the branch metric prior to updating the path metrics.
The survivor path is written to two traceback memories. After traceback length samples one of the two traceback paths is triggered. The most probable state at this point in time is the one with the largest path metric value. Its index is provided by the Find Best State module. Starting from this state, the Traceback Calculation module recursively calculates the previous state based on the survivor path vectors from the traceback memory. The decoded bit is derived from the LSB of this survivor state.
Because the first decoded bits of the survivor path show lower reliability than later elements in the traceback, the first half of the bits is discarded. The order of the remaining decoded bits must be reversed because the traceback memory is evaluated backwards. Both operations are performed in the Bit Reordering module.
As a last step, the outputs of the two traceback chains are combined to a final decoded sequence that is available on the output.
The timing of the Viterbi decoder is demonstrated in Figure 52. Timing is independent of the chosen operation mode, but depends on the traceback length. As described in section 5.4.4 there are two traceback chains which are illustrated in different colors. The horizontal axis represents the time. For reference, a scale with multiples of traceback length clock cycles is visible on the top. The timing diagram assumes that there is a valid input in each clock cycle. The traceback memory is empty at the beginning.
Figure 52: Viterbi timing
All input data is processed in the branch and path metric calculation. This adds two cycles of latency before storing the data to the traceback memory. The first traceback memory is read as soon as two times traceback samples are written. The second traceback chain starts another traceback samples delay. The Traceback Calculation module adds one cycle of latency. At the output of the Bit Reordering module, only the second half of the samples is declared valid after two times traceback length elements have been written. The output of both traceback chains is combined to a continuous output stream.
If the input is not valid, each clock cycle the input pattern is kept until the traceback memory input. Afterwards, the traceback decoding and bit reordering are performed burst-wise. In this case, the latency every wait cycle on the input increases the latency for the first code block input by one cycle. NI recommends that you flush the Viterbi core with a continuous stream to have the minimum latency for the end of the code block.
This concept results in a decoding latency of four times the traceback length (plus 13 clock cycles processing time) because during two times the traceback length, the traceback buffer is written and during another two traceback lengths evaluation and decoding takes place. The evaluation in chunks of two times the traceback length makes it necessary to flush the Viterbi decoder with exactly the same number of input softbit triples. The latency of each module is summarized in Figure 53.
Figure 53: Viterbi latency
5.4.6 Resource Usage
The Viterbi implementation occupies the FPGA resources listed in Table 8.
|Block Ram (36k)
Table 8: Viterbi resource usage
The throughput in MS/s is equal to the clock rate in MHz since the core is capable of handling one sample each clock cycle. Synthesis of the core is successful up to a clock rate of 300 MHz.