pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 16Figure 2: TPU v1 design schema (source [3])TPU v1 is manufactured on a 28 nm process with a die size ≤ 331 mm2, clockspeed of 700 MHz, 28 MB of on-chip memory, 4 MB of 32-bit accumulators,and a 256×256 systolic array of 8-bit multipliers. For this reason, we can get700Mhz*65536 (multipliers) → 92 Tera operations/sec. This is an amazingperformance for matrix multiplications: Figure 3 shows the TPU circuit boardand flow of data for the systolic matrix multiplication performed by the MMU.In addition, TPU v1 has an 8 GB dual-channel 2133 MHz DDR3 SDRAM offering34 GB/s of bandwidth. The external memory is standard, and it is used to store andfetch the weights used during the inference. Notice also that TPU v1 has a thermaldesign power of 28-40 Watts, which is certainly low consumption compared toGPUs and CPUs. Moreover, TPU v1 are normally mounted in a PCI slot used forSATA disks so they do not require any modification in the host server [3]. Up to 4cards can be mounted in each server. Figure 3 shows TPU v1 card and the process ofsystolic computation:Figure 3: On the left you can see a TPU v1 board, and on the right an example of how the data is processedduring the systolic computation[ 575 ]

Tensor Processing UnitIf you want to have a look at TPU performance compared to GPUs and CPUs, wecan refer Jouppi et al., 2014 [3] and see (in a log-log scale graph) that the performanceis two orders of magnitude higher than a Tesla K80 GPU.The graph shows a "rooftop" performance that is growing until thepoint where it reaches the peak and then it is constant. The higherthe roof the merrier for performance.Figure 4: TPU v1 peak performance can be up to 3x higher than a Tesla K80Second-generation TPUThe second-generation TPUs (TPU2) were announced in 2017. In this case, thememory bandwidth is increased to 600 GB/s and performance reaches 45 TFLOPS.4 TPU2s are arranged in a module with 180 TFLOPS performance. Then 64 modulesare grouped into a pod with 11.5 PFLOPS of performance. TPU2s adopt floatingpointarithmetic and therefore they are suitable for both training and inference.[ 576 ]

Chapter 16

Figure 2: TPU v1 design schema (source [3])

TPU v1 is manufactured on a 28 nm process with a die size ≤ 331 mm2, clock

speed of 700 MHz, 28 MB of on-chip memory, 4 MB of 32-bit accumulators,

and a 256×256 systolic array of 8-bit multipliers. For this reason, we can get

700Mhz*65536 (multipliers) → 92 Tera operations/sec. This is an amazing

performance for matrix multiplications: Figure 3 shows the TPU circuit board

and flow of data for the systolic matrix multiplication performed by the MMU.

In addition, TPU v1 has an 8 GB dual-channel 2133 MHz DDR3 SDRAM offering

34 GB/s of bandwidth. The external memory is standard, and it is used to store and

fetch the weights used during the inference. Notice also that TPU v1 has a thermal

design power of 28-40 Watts, which is certainly low consumption compared to

GPUs and CPUs. Moreover, TPU v1 are normally mounted in a PCI slot used for

SATA disks so they do not require any modification in the host server [3]. Up to 4

cards can be mounted in each server. Figure 3 shows TPU v1 card and the process of

systolic computation:

Figure 3: On the left you can see a TPU v1 board, and on the right an example of how the data is processed

during the systolic computation

[ 575 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!