Design and Test of a Scalable Security Processor - ACM Digital ...

4B-4sDesign and Test of a Scalable Security ProcessorChih-Pin Su, Chen-Hsing Wang, Kuo-Liang Cheng, Chih-Tsun Huang £ and Cheng-Wen WuDepartment of Electrical Engineering £Department of Computer ScienceNational Tsing Hua University National Tsing Hua UniversityHsinchu, Taiwan 30013 Hsinchu, Taiwan 30013Abstract—This paper presents a security processor to acceleratecryptographic processing in modern security applications.Our security processor is capable of popular cryptographic functionssuch as RSA, AES, hashing and random number generation,etc. With proposed Crypto-DMA controller, data gathering andscattering become flexible for security processing, using a simpledescriptor-based programming model. The architecture of the securityprocessor with its core-based platform is scalable and configurablefor security variations in performance, cost and powerconsumption. Different number of data channels and cryptoenginescan be used to meet the specifications. In addition, a DFTplatform is also implemented for the design-test integration. Thesecurity processor has been fabricated with 0.18m CMOS technology.The core area is ¿ÑÑ ¢ ¾¾ÑÑ (525K gates approximately)and the operating clock rate is 83MHz.I. INTRODUCTIONDriven by the booming Internet and wireless communicationapplications, many kinds of network processors have beenimplemented to process the increasing network traffic and handlecomplex protocols in various network services. The networkprocessor design has faced many different challenges,e.g., the bandwidth limitation, the capability of providing flexiblepacket processing for different workload of diverse networkapplications, and the complex computation of networksecurity. Currently, the demand of security processing amongall types of communication has been arising, especially whenapplications such as Virtual Private Network (VPN) and IP Security(IPSec) are getting more and more important.Generally, a secure communication system ensures its safetyusing some specific protocol which mixes the public-key cryptographyand secret-key cryptography. IPSec and SSL (SecuritySocket Layer) are two popular protocols for this purpose.Traditionally, security processing can be done using algorithmspecificaccelerators, which are designed for certain securityfunctions. Dedicated hardware delivers high performance withless programmability. On the other hand, extended cryptographicinstructions can be implemented to improve the softwareperformance on existing microprocessors. However, thesoftware approach will fall behind with the performance fora computation-intensive security application. Therefore, anefficient security processor has become urgent for such kindof security workload. In this paper, we present an integratedarchitecture of a scalable security processor. The core-basedplatform integrates our previous cryptographic IPs [1, 2, 3].A DMA (Direct Memory Access) controller, called Crypto-DMA, is proposed to manage heterogeneous security functions.Our descriptor-based instruction format provides a simple programmingmodel. The host processor therefore can manipulatethe security applications effortlessly at the system level.II. DESIGN OVERVIEWFigure 1 shows the generic system of the network processorunit using our Security Processor (SP), which can be asystem-on-chip (SOC) design. The on-chip bus is AdvancedHigh-performance Bus (AHB) of Advanced Micro-controllerBus Architecture (AMBA) system [4]. The host processor,e.g., ARM processor, manages the work flow of the entire systemand executes system applications. The packet processorcommunicates with the outside and manipulates the ingressand egress packets. Meanwhile, our SP handles cryptographyintensivecomputation. For many networking and communicationapplications, power consumption is an important factorin the overall performance. Therefore the system can have apower management unit with the dynamic power control bythe host processor, manipulating various low-power technologiessuch as clock gating, dynamic frequency and/or dynamicvoltage level.RAMHostProcessorPowerManagementSecurityProcessorPLLDynamicVoltageGeneratorPacketProcessorFig. 1. The generic architecture of the network processor.Figures 2 shows the system architecture from the descriptorbasedperspective. When an application requires cryptographicfunctions, the host processor maintains the record of currentsecure sessions and their corresponding keys and contexts.Once the host processor determines that a security operationis required, the processor creates a descriptor to guide the SPthrough the security operation, with the SP acting as a bus master.The descriptor can be stored in the main memory, or writtendirectly to the descriptor buffer of the crypto-channel in theSP. Once the security process is complete, an interrupt will be0-7803-8736-8/05/$20.00 ©2005 IEEE. 372ASP-DAC 2005

asserted to inform the host processor.The descriptor-based scheme reduces the control overheadof the host processor. During the cryptographic/authenticationoperations of the network applications such as IPSec and SSL,the packet/data can be fragmented on the input and output.The gathering and scattering of the encrypted or decrypteddata/packet can be easily achieved by generating the datapointers properly in the descriptor. As a result, our SP, whichinherits the advantages of the descriptor-based DMA devices,is capable of the data transfer and processing workload, leavingthe host processor for the system application and the sophisticatedflow control as well as the exception handling.System MemoryDescriptor #1Descriptor #2Session KeysInput Data #2Input Data #3Input Data #1Output Data #1Output Data #2Output Data #3Descriptor PointerAES ECB Descriptor HeaderPointer of KeyLength of KeyPointer of Input Data #1Length of Input Data #1Pointer of Output Data #1Length of Output Data #1Pointer of Input Data #2Length of Input Data #2Pointer of Output Data #2Length of Output Data #2Pointer of Input Data #3Length of Input Data #3Pointer of Output Data #3Length of Output Data #3Pointer to Next DescriptorFig. 2. The system architecture and data structure of a descriptor.III. DESCRIPTOR-BASED INSTRUCTION SETARCHITECTUREAMBA System BusSecurityProcessorHostProcessorIn Fig. 2, an AES ECB descriptor is shown as a descriptorexample. Figure 3(a) presents the generic descriptor format,which consists of 16 32-bit words (64 bytes in total). The firstword is the descriptor header that defines the security operation.Subsequently there are seven pairs of data pointer anddata length. The data pointer refers to the address of the datablock, while the data length indicates its amount. Data to betransferred can be interpreted to be keys, initial vector, ciphertext or plain text, etc. The last word of the descriptor is anaddress pointer to the next descriptor.The descriptor header defines the cryptographic functionand its parameters for the security processing. As depictedin Fig. 3(b), the upper halfword of the header, the DescriptorControl halfword, identifies the cryptographic function.The most significant three bits (bits 32–29), called the CE(Crypto-Engine) tag, are used to define the category of the operations.The CE tag will be used as an identity to reserve thecorresponding crypto-engine. Currently four categories of descriptorsare implemented in the prototype, i.e., AES, HMAC,RSA and RNG (Random Number Generation). Extension forother cryptographic operations is easy. Two other fields in theDescriptor Control are DH (Descriptor Header)-Type and NP(Number of Pairs). The DH-Type field further divides the descriptorswith the same CE tag into different types, using bits27 and 26. The NP field (bits 18–16) records the total numberof data blocks referred in the descriptor. The lower halfwordof the descriptor, the Crypto-Engine Control halfword, definesthe private instruction for the specific crypto-engine. Theinstruction set architecture inherits the original design of eachcrypto-engine.In our design, two other AES descriptors, i.e., AES CBCand AES Data descriptors are used. Different from the AESECB descriptor (see Fig. 2), the AES CBC descriptor requiresan additional pointer to the initial vector. When the key remainsunchanged, the AES Data descriptor can be used forburst data encryption and description. Other descriptor categoriesare defined in a similar way. E.g., the HMAC functionconsists of two descriptor types: one is for initialization ofthe HMAC-SHA-1 or HMAC-MD5, and the other is for datatransfer. In addition, the RSA descriptor is used to completethe modular exponentiation in the RSA function. The RNGdescriptor is used to retrieve arbitrary length of random number.With descriptor-based data processing, host processor canleave security-intensive computation to the SP with little controloverhead.Descriptor HeaderPointer of Data #1Length of Data #1Pointer of Data #2Length of Data #2Pointer of Data #3Length of Data #3Pointer of Data #4Length of Data #4Pointer of Data #5Length of Data #5Pointer of Data #6Length of Data #6Pointer of Data #7Length of Data #7Pointer to Next Descriptor(a)Descriptor Control halfwordCryptoEngine Control haflword31 16 0CEAES Descriptor HeaderDH0 0 1 TypeRSA Descriptor Header0 1 1HMAC Descriptor HeaderDH0 1 0 TypeRNG Descriptor Header1 0 0Numberof PairsE/DModeKeySizeNumberof Pairs(b)OpCodeNumberof PairsInclude final word addressinclude output addressDESRecompute KeyRunRecompute IVSRC1RunSHA1/MD5Fig. 3. (a) General format of a descriptor and (b) the definition of thedescriptor header.IV. SCALABLE SP ARCHITECTUREThe scalable architecture of the proposed SP is shownin Fig. 4. It consists of heterogeneous crypto-engines anda DMA-like interface, called Crypto-DMA controller. TheCrypto-DMA controller interprets the descriptors and manipulatesthe crypto-engines to perform proper cryptographic operations.Direct access of specific crypto-engine is also allowedfrom the external AHB interface. Multiple crypto-engines ofthe same and/or different cryptographic functions can be used,all with AHB slave interface. Therefore, efforts of the integrationand extension can be minimized. Design parameters inthe SP, such as the number of crypto-channels, the number ofcrypto-engines, and the number of internal AHB bus (multilayerAHB), are all scalable and configurable to meet differentperformance/area trade-offs. In this paper, the prototypeto realize the architecture shown in Fig. 4 is addressed. OneCrypto-DMA with four crypto-channels has been implementedto control six crypto-engines, including two AES engines [1],two HMAC engines [3], one RSA engine [2], and one RNGengine. In this prototype, both the external and internal AHBinterfaces are 32 bits.373SRC2

External AHBExternalAHB SlaveInterfaceExternalAHB MasterwithTransferEngineDescriptorBuffer16x32 bitCryptoDMACrypto−Channel #0Security ProcessorInput/OutputData FIFO16x32 bitRegister Files 3x32 bitCrypto−Channel #1Crypto−Channel #2Crypto−Channel #3Main ControllerInstructionDecoderResourceManagerMicroprogramSequencerInternalAHB MasterwithTransferEngineFig. 4. Scalable architecture of Security Processor.A. Crypto-DMAInternal AHBCryptoEnginesAES#1AES#2HMAC#1HMAC#2RSA#1RNG#1The Crypto-DMA serves as a bridge between the externalsystem memory and the internal crypto-engines for thedata transfer. The control flow of the internal crypto-enginesis also managed by the Crypto-DMA. Similar to the typicaldescriptor-based DMA devices, the Crypto-DMA is activatedby writing a descriptor pointer to one of the Crypto-Channel.The Crypto-DMA will fetch the descriptor and interpret thecontent based on the descriptor header. The basic proceduresto process a descriptor are: 1) Assigning specific crypto-engineaccording to the the CE tag in the descriptor header. Whenthere are multiple crypto-engines of the same function, theirstatus and priority will also be considered. 2) Initializing thetargeted crypto-engine. 3) Transferring the data from the systemmemory to the input buffer or local memory of the targetedcrypto-engine. 4) Activating the cryptographic processing. 5)Transferring the data from the output buffer or local memoryof the targeted crypto-engine back to the external memory afterreceiving the interrupt of the completeness from the cryptoengine.6) Releasing the targeted crypto-engine. 7) Automaticallyfetching the successive descriptors from the system memory.Although the programming sequences are diverse in differentcrypto-engines, all of them can be partitioned into threekinds of subsequences: 1) the data input/output state thattransfers the data into or out of the crypto-engine; 2) the controlstate that configures the operation mode of the crypto-engineand 3) the wait state that waits for the data processing of thecrypto-engine. Proper interleaving among the sequences ofdifferent crypto-channels is realized by the Main Controllerin our Crypto-DMA. The interrupt mechanism between theCrypto-DMA and crypto-engines is also used to provide thebetter flow control.B. Crypto-EnginesAES Engine—The AES engine is implemented based on thework in [1]. It supports the standard AES encryption and decryptionwith 128-, 192- and 256-bit keys, with both ECB andCBC modes. In addition to the previous approach using a costeffectivecomposite field arithmetic, this improvement reducesthe critical path by moving the basis conversion hardware between ´¾ µ and ´´¾ µ ¾ µ outside the round iteration.RSA Engine—The RSA engine [2] performs the modularmultiplication based on an enhanced word-based Montgomeryalgorithm, supporting scalable keys of the length up to 1024bits.HMAC Engine—The HMAC engine is based on the work in[3], which proposed an area-efficient integrated SHA-1/MD5core.RNG Engine—The RNG engine is a pseudo random numbergenerator that consists of five linear feed-back shift registers(LFSRs) with different structures and data scrambling in theoutput. The fully-synthesizable implementation makes it easyto integrate into the SP. Our RNG also provides a single-bitnoise input. By introducing the 1-bit noise source into theRNG, the output will become unpredictable. Our RNG hasbeen validated by the random number test in the FIPS 140-2standard [5].V. DFT PLATFORMFollowing the core-based design methodology, the SP ispartitioned into the Crypto-DMA and crypto-engine cores.Each crypto-engine has been designed and verified individually.Therefore, the major verification effort focuses on theCrypto-DMA and the interconnection. In the design flow, thetest planning is taken into account from the beginning. Figure 5shows our design and test integration flow to realize the SP design.Here we utilize STEAC (SOC TEst Aid Console) [6]framework to help the test integration. All the crypto-enginescome as the synthesized scan-ready netlist. The DFT informationof crypto-engines generated by the commercial ATPGare submitted to the STEAC to generate the first pass of testscheduling. The first-pass test scheduling provides a referenceresult of the overall test time with respect to the width of TAM(Test Access Mechanism) bus and the number of scan chainsfor each IP. The test access architecture can be further adjustedbased on this reference, resulting in an optimized test schedulingand test access architecture in the second pass. Based onthis result, the TAM circuitry, test controller as well as the testwrappers are generated accordingly. These DFT circuits arethen integrated into the SP with the typical synthesis flow. Thetest patterns of each core can also be translated to system levelby the STEAC easily. The overall test architecture of the SP isshowninFig.6. VI. IMPLEMENTATION RESULTSOur chip has been fabricated using UMC 0.18m CMOSstandard library. Table I summarizes the area statistics of theSP. Test circuitry occupies less then 1% of the total gate countwith negligible performance penalty. Our DFT wrappers areresided in the bus interface, avoiding the extra timing overheadto the critical path.A layout view of the whole chipand its floorplaning is depicted in Fig. 7. Total chip area isÑÑ ¢ ¿½ÑÑ, where the core area is ¿ÑÑ ¢¾¾ÑÑ. The equivalent gate count is about 525K gates. Theoperating frequency is about 83MHz under post-layout simulation.Total power consumption is about 383.5mW.Table II shows the peak performance considering the maximumoperating frequency of the individual crypto-engine. In374

Design and Test of a Scalable Security Processor - ACM Digital ...

Create successful ePaper yourself

Delete template?

Save as template?