03m-wp2-040.170.010-td-001-1-pulsar SP on Uniboard.pdf

03m-wp2-040.170.010-td-001-1-pulsar SP on Uniboard.pdf 03m-wp2-040.170.010-td-001-1-pulsar SP on Uniboard.pdf

skatelescope.org
from skatelescope.org More from this publisher
12.07.2015 Views

PULSAR SIGNAL PROCESSING ON UNIBOARDDocument number .................................................................. WP2‐ong>040.170.010ong>‐TD‐ong>001ong>Revision ........................................................................................................................... 1Author .......................................................................................................... A AhmedSaidDate ................................................................................................................. 2011‐04‐01Status ............................................................................................... Approved for releaseName Designation Affiliation Date SignatureAdditional AuthorsRob Ferdman, Ben StappersSubmitted by:A AhmedSaid UMAN 2011‐04‐01Approved by:W. Turner Signal ProcessingDomain Specialistong>SPong>DO 2011‐03‐26

PULSAR SIGNAL PROCESSING ON UNIBOARDDocument number .................................................................. WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> ........................................................................................................................... 1Author .......................................................................................................... A AhmedSaidDate ................................................................................................................. 2011‐04‐01Status ............................................................................................... Approved for releaseName Designati<strong>on</strong> Affiliati<strong>on</strong> Date SignatureAdditi<strong>on</strong>al AuthorsRob Ferdman, Ben StappersSubmitted by:A AhmedSaid UMAN 2011‐04‐01Approved by:W. Turner Signal ProcessingDomain Specialist<str<strong>on</strong>g>SP</str<strong>on</strong>g>DO 2011‐03‐26


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1DOCUMENT HISTORYRevisi<strong>on</strong> Date Of Issue Engineering ChangeNumberCommentsA 1 st April 2011 ‐ First draft release for internal review1 1 st April 2011DOCUMENT SOFTWAREPackage Versi<strong>on</strong> FilenameWordprocessor MsWord Word 2003 <str<strong>on</strong>g>03m</str<strong>on</strong>g>‐<str<strong>on</strong>g>wp2</str<strong>on</strong>g>‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐<str<strong>on</strong>g>td</str<strong>on</strong>g>‐<str<strong>on</strong>g>001</str<strong>on</strong>g>‐1‐<str<strong>on</strong>g>pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>SP</str<strong>on</strong>g> <strong>on</strong> <strong>Uniboard</strong>‐2003Block diagramsOtherORGANISATION DETAILSNamePhysical/PostalAddressSKA Program Development OfficeJodrell Bank Centre for AstrophysicsAlan Turing BuildingThe University of ManchesterOxford RoadManchester, UKM13 9PLFax. +44 (0)161 275 4049Website www.skatelescope.org2011‐04‐01 Page 2 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1LIST OF FIGURESFigure 1 The design methodology ........................................................................................................... 9Figure 2 The FFT Dedispersi<strong>on</strong> method ................................................................................................ 10Figure 3 : The Polyphase filter bank dedispersi<strong>on</strong> method .................................................................. 11Figure 4 Pulsar data folding .................................................................................................................. 12Figure 5 Incoherent dedispersi<strong>on</strong> for 1 DM .......................................................................................... 14Figure 6 Incoherent dedispersi<strong>on</strong> architecture .............................................................................. 15Figure 7 4 inputs Taylor tree ................................................................................................................. 15Figure 8 Rearranged 4 inputs Taylor tree ............................................................................................. 16Figure 9 Radix 2 incoherent dedispersi<strong>on</strong> architecture ........................................................................ 16Figure 10 Radix 4 incoherent dedispersi<strong>on</strong> architecture ...................................................................... 17Figure 11 Parallel Pulsar Searching algorithm ...................................................................................... 17Figure 12 Computati<strong>on</strong> time Vs The total number of sub bands processed using the tree algorithm 22No table of figures entries found.LIST OF TABLES2011‐04‐01 Page 4 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1LIST OF ABBREVIATIONSDDR ............................... Double Data RateDM ................................. Dispersi<strong>on</strong> MeasureFFT ................................ Fast Fourier TransformFPGA ............................. Field Programmable Gate ArrayHDL ............................... High Definiti<strong>on</strong> LanguageIFFT ............................... Inverse Fast Fourier TransformISM ................................ Inter Stellar MediumRAM .............................. Random Access MemoryRTL ................................ Register Transfer LevelSKA ............................... Square Kilometre ArrayTDD ............................... Total DeDispersi<strong>on</strong> time in clock cyclesTFFT .............................. Total FFT computati<strong>on</strong> time in clock cycles2011‐04‐01 Page 5 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1cases, it has been possible to do so in real time, depending <strong>on</strong> the required FFT lengths. Otherwise,the baseband data is recorded to disk for offline processing.The real‐time observati<strong>on</strong> and coherent de‐dispersi<strong>on</strong> of <str<strong>on</strong>g>pulsar</str<strong>on</strong>g> data is becoming a necessity, asthere are <strong>on</strong>going efforts towards obtaining larger and larger bandwidths of data and in the case ofthe SKA there will be multiple <str<strong>on</strong>g>pulsar</str<strong>on</strong>g>s observed simultaneously, up to tens. The data rate requiredto obtain such volumes of raw data is incompatible with offline storage capabilities (and costs),whereas folded, de‐dispersed pulse profile data (to be discussed later) is, by comparis<strong>on</strong>, negligiblein size. However, the ability of computer processors to perform real‐time coherent de‐dispersi<strong>on</strong> insoftware am<strong>on</strong>gst nodes <strong>on</strong> a computer cluster, or by using graphics processing units (which hasrecently seen a surge in popularity), is also becoming limited, due in part to disk input/outputc<strong>on</strong>straints. This is why the c<strong>on</strong>cept of UNIBOARD is a great match for the requirements of <str<strong>on</strong>g>pulsar</str<strong>on</strong>g>observati<strong>on</strong>s.5 Design MethodologyFigure 1 The design methodologyThe adopted methodology c<strong>on</strong>sists of creating high level algorithm simulati<strong>on</strong>s, primarily within theMATLAB envir<strong>on</strong>ment. These algorithms are then c<strong>on</strong>verted into accurate system models, withfurther simulati<strong>on</strong>. Finally the models are c<strong>on</strong>verted into hardware implementati<strong>on</strong>s using anappropriate methodology.Figure 1 illustrates this methodology.5.1 Algorithmic Simulati<strong>on</strong>Using the Mathworks Matlab software, it is possible to simulate complex algorithms relativelyquickly and easily. This algorithmic simulati<strong>on</strong> allows the rapid evaluati<strong>on</strong> of the different D<str<strong>on</strong>g>SP</str<strong>on</strong>g>methods that can be used and the explorati<strong>on</strong> of the parameter space.5.2 Architectural ModellingThe next step after the algorithmic simulati<strong>on</strong> is the translati<strong>on</strong> of the high level algorithms intosystem accurate Simulink models. This modelling stage takes into c<strong>on</strong>siderati<strong>on</strong> the hardwarearchitecture, its capabilities and limitati<strong>on</strong>s. For example fixed point integer arithmetic is usedinstead of floating point arithmetic.5.3 Hardware Implementati<strong>on</strong>Finally, the models are c<strong>on</strong>verted into hardware using a combinati<strong>on</strong> of automatic netlist generati<strong>on</strong>tools and HDL coding to structural and RTL level.2011‐04‐01 Page 9 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1This method, illustrated in Figure 2 c<strong>on</strong>sists of taking the signal into the frequency domain using anFFT, multiplying it with the dedispersi<strong>on</strong> coefficients (phase shifts) then returning the signal into thetime domain using an inverse FFT.The FFT/IFFT lengths are relatively high and might range from 256M‐1G points. For this reas<strong>on</strong>, it isnecessary to develop an FPGA architecture for large FFTs and IFFTs. Such architecture would usemultiple ‘standard’ FFT cores to deal with these sizes and a sufficient amount of off chip memory. Itis estimated that 4 FFT cores will be needed per large FFT/IFFT pair and for every 150‐250MHz ofbandwidth (depending <strong>on</strong> the FGPA clock). The off chip memory requirement in bytes is at least 12times the size of the large FFT in order to store 2x3 blocks of data with 2 bytes per complex sample.The total bandwidth requirement is at least 12 times that of the FFT input data (1.8 GB/s‐3 GB/s)because 6 complex sample need to be read/written every clock cycle.6.3 The Polyphase Filter Bank MethodFigure 3 : The Polyphase filter bank dedispersi<strong>on</strong> methodIn this method, illustrated in Figure 3, the signal is split into a number of frequency channels using apolyphase filter bank, and then the FFT method is applied to de‐disperse each channel.The number of channels is likely to be in the range 64‐256 channels of 4MHz bandwidth. With 4‐8coefficients per channel, the filter bank prototype filter will have 256‐1024 coefficients and eachcoefficient will be represented by 8‐12 bits.In this method, the size of the FFTs in the dedispersi<strong>on</strong> part of the processing is smaller than in theprevious method. It is in fact divided by the number of channels. Therefore, the FFT length will be atmost 4M points.2011‐04‐01 Page 11 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1Let T be the time durati<strong>on</strong> of a buffer, BW the signal’s bandwidth and Ffpga the FPGA’s clockfrequency. There will be T x BW = N x M complex samples in a buffer and they have to be processedin no more than T x Ffpga cycles. Therefore, in order to be able to process data in real time:4 x T x BW x log 2 (T x BW / N)/(P x B) BW x log 2 (T x BW / N)


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1Figure 8 Rearranged 4 inputs Taylor treeThe Taylor tree algorithm is similar in its structure to the FFT algorithm. A comm<strong>on</strong> basic structure inthis algorithm is a Delay‐Add Butterfly as highlighted in Figure 8. It is very similar to the FFT’s‘Multiply‐Add Butterfly’.Inspired by FFT architectures, a possible incoherent dedispersi<strong>on</strong> architecture is presented in Figure9.Figure 9 Radix 2 incoherent dedispersi<strong>on</strong> architectureIn this architecture, data is stored in the off chip memory, it is then loaded into internal inputbuffers, passed through a Delay‐Add Butterfly and the result is stored in internal output buffer. Fromthere, output data is stored back in the off chip memory and the process is repeated until all theoperati<strong>on</strong>s are completed. The Delay‐Add Butterfly c<strong>on</strong>sists of two adders and a ‘d’ samples delay.The delay ‘d’ could be equal to 1 or whatever is deemed appropriate. The delay ‘d’ can also bevariable to allow the generalisati<strong>on</strong> of the Taylor tree algorithm which assumes unit delays and is<strong>on</strong>ly applicable to small bandwidths where the dispersi<strong>on</strong> curve is linear. By having a variable delay itmight be possible to apply the Taylor tree algorithm to large bandwidths if a good schedule of delayaddoperati<strong>on</strong>s can be found (using a computer program).2011‐04‐01 Page 16 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1Figure 12 Computati<strong>on</strong> time Vs The total number of sub bands processed using the tree algorithmThe best case scenario is when D = 1 and K =N. In this case, the de‐dispersi<strong>on</strong> computati<strong>on</strong> time isminimised and more importantly, there is no need to have extra storage for intermediatecomputati<strong>on</strong>s. The input data can be overwritten because it will not be needed anymore thereforethe memory size requirement drops to:TotalExtMemSize = Nt x M x WsAnd the computati<strong>on</strong> times becomes:TDD = M x (Nt x (1+2 x log2(Nt/B)) / B+ 2 x (B+1) x Nt/B) x (Ws x Ffpga) / BWmemTFFT = 2 x Nt/B x M x log2(M) x (Ws x Ffpga) / (BWmem x B)Now the memory size is divided by 3 and assuming the same values for the other parameters asabove, the memory requirement is 8.73 GB per FPGA which is more likely to be feasible andaffordable.2011‐04‐01 Page 22 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 18.6 Mapping a search system into <strong>Uniboard</strong>A <strong>Uniboard</strong> has 8 FPGAs and each <strong>on</strong>e of them has access to two DDR3 banks of 2GB each capable of800 M transfers/s of 64 bits.The parameters values that have been used in the previous secti<strong>on</strong> match specifically the <strong>Uniboard</strong>case. Therefore, for D x K =Nt with D < 64 and: Observati<strong>on</strong>Time = 600 sec; SamplingFrequency = 2GHz; Nt = 8192; M = 9.16E6; Ffpga = 200MHz; Ws =8 bits.The memory required per FPGA is 26.2 GB which is higher than the available 4GB. In the sec<strong>on</strong>d casewhere D = 1 and K =N, 8.73 GB per FPGA are required. If Ws is reduced to 4 bits then 4.4 GB perFPGA will be required and with a 10 % reducti<strong>on</strong> of the signal bandwidth 4 GB per FPGA will berequired. Thus, using <strong>on</strong>e <strong>Uniboard</strong> it is possible to implement a real time <str<strong>on</strong>g>pulsar</str<strong>on</strong>g> search engine for abandwidth of 900MHz and 1024 DMs.Another possibility is to use several UNIBOARDS, for example 4 boards, with Ws =4, would be able toprocess 1GHz and 8192 DMs.8.7 Summary and Looking AheadWe have shown that it is likely possible to coherently dedisperse a full 1 GHz of dual‐polarisati<strong>on</strong>bandwidth, appropriately divided into manageable subbands, with a single <strong>Uniboard</strong>. The number ofsubbands required results in a bandwidth per subband which is still large enough to achieve the timeresoluti<strong>on</strong> desired for the high precisi<strong>on</strong> timing experiments. Based <strong>on</strong> today’s prices a <strong>Uniboard</strong> inproducti<strong>on</strong> will cost about 10,000 euros. Given that in SKA phase <strong>on</strong>e we are looking at trying totime a few tens of <str<strong>on</strong>g>pulsar</str<strong>on</strong>g>s simultaneously this should be achievable for something <strong>on</strong> the order of250,000 euros. The expected power c<strong>on</strong>sumpti<strong>on</strong> of 350W per board would then result in a total ofabout 8kW. Each board would likely need a c<strong>on</strong>trol PC and the data would also need to be archived.In the case of the <str<strong>on</strong>g>pulsar</str<strong>on</strong>g> searching the current <strong>Uniboard</strong> could likely deal with the expectedbandwidth required for the Galactic plane search at a central frequency of about 800 MHz. Thebandwidth for that search is likely to be less than that assumed for the calculati<strong>on</strong>s above (basedmore <strong>on</strong> the 1‐2 GHz range) and by adjusting the number of bits required we could make it possibleto dedisperse and search about 500 MHz of bandwidth. A more detailed analysis of therequirements for the Aperture Array search would be required, but the total number of dispersi<strong>on</strong>trials is likely to be similar to the above, and the total data rate less, so it will probably also fit.Assuming 1GHz beams and a price per <strong>Uniboard</strong> of 10,000 euros per beam, and c<strong>on</strong>sidering thenumber of dispersi<strong>on</strong> trials we can estimate then it will cost about 5 euros per beam per DM. Thismetric is useful as <strong>on</strong>e of the limiting factors is the number of dispersi<strong>on</strong> measures which needs tobe calculated. As above assuming 350W power c<strong>on</strong>sumpti<strong>on</strong> per board then it will cost 0.17W perbeam per DM. However, these estimati<strong>on</strong>s are specific to <strong>Uniboard</strong> and should be c<strong>on</strong>sidered asupper limits because it is possible that the <strong>Uniboard</strong> resources and the FPGAs will not be fullyutilised.2011‐04‐01 Page 23 of 24


WP2‐<str<strong>on</strong>g>040.170.010</str<strong>on</strong>g>‐TD‐<str<strong>on</strong>g>001</str<strong>on</strong>g>Revisi<strong>on</strong> : 1It is estimated that something like 4000 beams will be needed in SKA Phase 1 and in this model <strong>on</strong>e<strong>Uniboard</strong> is needed for each beam. However it may be possible that the increased processingcapabilities, and hopefully increased amount of memory <strong>on</strong> <strong>Uniboard</strong> 2, or a subsequent board, willimprove the efficieny and reduce cost. As noted above it is possible that the capability of the fpgas in<strong>Uniboard</strong> will not be fully utilised and so a next step would be to c<strong>on</strong>sider how we might be able toimplement the computati<strong>on</strong>ally costly accelerati<strong>on</strong> search as part of this package.2011‐04‐01 Page 24 of 24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!