pdfcoffee
Chapter 16Figure 2: TPU v1 design schema (source [3])TPU v1 is manufactured on a 28 nm process with a die size ≤ 331 mm2, clockspeed of 700 MHz, 28 MB of on-chip memory, 4 MB of 32-bit accumulators,and a 256×256 systolic array of 8-bit multipliers. For this reason, we can get700Mhz*65536 (multipliers) → 92 Tera operations/sec. This is an amazingperformance for matrix multiplications: Figure 3 shows the TPU circuit boardand flow of data for the systolic matrix multiplication performed by the MMU.In addition, TPU v1 has an 8 GB dual-channel 2133 MHz DDR3 SDRAM offering34 GB/s of bandwidth. The external memory is standard, and it is used to store andfetch the weights used during the inference. Notice also that TPU v1 has a thermaldesign power of 28-40 Watts, which is certainly low consumption compared toGPUs and CPUs. Moreover, TPU v1 are normally mounted in a PCI slot used forSATA disks so they do not require any modification in the host server [3]. Up to 4cards can be mounted in each server. Figure 3 shows TPU v1 card and the process ofsystolic computation:Figure 3: On the left you can see a TPU v1 board, and on the right an example of how the data is processedduring the systolic computation[ 575 ]
Tensor Processing UnitIf you want to have a look at TPU performance compared to GPUs and CPUs, wecan refer Jouppi et al., 2014 [3] and see (in a log-log scale graph) that the performanceis two orders of magnitude higher than a Tesla K80 GPU.The graph shows a "rooftop" performance that is growing until thepoint where it reaches the peak and then it is constant. The higherthe roof the merrier for performance.Figure 4: TPU v1 peak performance can be up to 3x higher than a Tesla K80Second-generation TPUThe second-generation TPUs (TPU2) were announced in 2017. In this case, thememory bandwidth is increased to 600 GB/s and performance reaches 45 TFLOPS.4 TPU2s are arranged in a module with 180 TFLOPS performance. Then 64 modulesare grouped into a pod with 11.5 PFLOPS of performance. TPU2s adopt floatingpointarithmetic and therefore they are suitable for both training and inference.[ 576 ]
- Page 559 and 560: An introduction to AutoMLUsing Clou
- Page 561 and 562: An introduction to AutoMLOnce the d
- Page 563 and 564: An introduction to AutoMLAt the end
- Page 565 and 566: An introduction to AutoMLAs the nex
- Page 567 and 568: An introduction to AutoMLOnce the m
- Page 569 and 570: An introduction to AutoMLFigure 65:
- Page 571 and 572: An introduction to AutoMLOnce the m
- Page 573 and 574: An introduction to AutoMLWe can als
- Page 575 and 576: An introduction to AutoMLThe most e
- Page 577 and 578: An introduction to AutoMLReferences
- Page 579 and 580: The Math Behind Deep LearningSome m
- Page 581 and 582: The Math Behind Deep LearningSuppos
- Page 583 and 584: The Math Behind Deep LearningNote t
- Page 585 and 586: The Math Behind Deep LearningTheref
- Page 587 and 588: The Math Behind Deep LearningThe ea
- Page 589 and 590: The Math Behind Deep LearningThe re
- Page 591 and 592: The Math Behind Deep LearningCase 2
- Page 593 and 594: The Math Behind Deep LearningIn thi
- Page 595 and 596: The Math Behind Deep LearningHere,
- Page 597 and 598: The Math Behind Deep Learning(Note
- Page 599 and 600: The Math Behind Deep LearningIn man
- Page 601 and 602: The Math Behind Deep LearningIf we
- Page 603 and 604: The Math Behind Deep LearningChapte
- Page 605 and 606: The Math Behind Deep LearningThis c
- Page 607 and 608: Tensor Processing UnitMany people b
- Page 609: Tensor Processing UnitThe sequentia
- Page 613 and 614: Tensor Processing UnitOn the other
- Page 615 and 616: Tensor Processing UnitHow to use TP
- Page 617 and 618: Tensor Processing UnitNote that ful
- Page 619 and 620: Tensor Processing UnitEpoch 10/1060
- Page 621 and 622: Tensor Processing UnitFigure 11: Go
- Page 623 and 624: Tensor Processing UnitThen the usag
- Page 626 and 627: Other Books YouMay EnjoyIf you enjo
- Page 628 and 629: Other Books You May EnjoyAI Crash C
- Page 630: Other Books You May EnjoyLeave a re
- Page 633 and 634: AutoML pipelinedata preparation 493
- Page 635 and 636: Deep Deterministic Policy Gradient(
- Page 637 and 638: Google cloud consolereference link
- Page 639 and 640: used, for building GAN 193-198MNIST
- Page 641 and 642: regularizersreference link 38reinfo
- Page 643 and 644: TensorFlow Lite 81TensorFlow Core r
- Page 645: Xxception networks 160, 162YYOLO ne
Chapter 16
Figure 2: TPU v1 design schema (source [3])
TPU v1 is manufactured on a 28 nm process with a die size ≤ 331 mm2, clock
speed of 700 MHz, 28 MB of on-chip memory, 4 MB of 32-bit accumulators,
and a 256×256 systolic array of 8-bit multipliers. For this reason, we can get
700Mhz*65536 (multipliers) → 92 Tera operations/sec. This is an amazing
performance for matrix multiplications: Figure 3 shows the TPU circuit board
and flow of data for the systolic matrix multiplication performed by the MMU.
In addition, TPU v1 has an 8 GB dual-channel 2133 MHz DDR3 SDRAM offering
34 GB/s of bandwidth. The external memory is standard, and it is used to store and
fetch the weights used during the inference. Notice also that TPU v1 has a thermal
design power of 28-40 Watts, which is certainly low consumption compared to
GPUs and CPUs. Moreover, TPU v1 are normally mounted in a PCI slot used for
SATA disks so they do not require any modification in the host server [3]. Up to 4
cards can be mounted in each server. Figure 3 shows TPU v1 card and the process of
systolic computation:
Figure 3: On the left you can see a TPU v1 board, and on the right an example of how the data is processed
during the systolic computation
[ 575 ]