09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Tensor Processing Unit

The sequential implementation of this operation is time consuming for large

matrices. A brute-force computation has time complexity of O(n 3 ) for n x n

matrices, so it's not feasible for running large computations.

First-generation TPU

The first-generation TPU (TPU v1) was announced in May 2016 at Google I/O.

TPU v1 [1] supports matrix multiplication using 8-bit arithmetic. TPU v1 is

specialized for deep learning inference but it does not work for training. For training

there is a need to perform floating-point operations, as discussed in the following

paragraphs.

A key function of TPU is the "systolic" matrix multiplication. Let's see what this

means. Remember that the core of deep learning is a core product Y=X*W, where,

for instance, the basic operation to compute Y[i,0] is:

YY[ii, 0] = XX[ii, 0] ∗ WW[0,0] + XX[ii, 1] ∗ WW[1,0] + ⋯ + XX[ii, nn] ∗ WW[nn, 0]

"Systolic" matrix multiplication allows multiple Y[i, j] values to be computed in

parallel. Data flows in a coordinated manner and, indeed, in medicine the term

"systolic" refers to heart contractions and how blood flows rhythmically in our

veins. Here systolic refers to the data flow that pulses inside the TPU. It can be

proven that a systolic multiplication algorithm is less expensive than the brute force

one [2]. TPU v1 has a Matrix Multiply Unit (MMU) running systolic multiplications

on 256×256 cores so that 64l,000 multiplications can be computed in parallel in one

single shot. In addition, TPU v1 sits in a rack and it is not directly accessible. Instead,

a CPU acts as the host controlling data transfer and sending commands to the TPU

for performing tensor multiplications, for computing convolutions, and for applying

activation functions.

The communication CPU ↔ TPU v1 happens via a standard PCIe 3.0 bus. From

this perspective, TPU v1 is closer in spirit to an FPU (floating-point unit) coprocessor

than it is to a GPU. However, TPU v1 has the ability to run whole inference models

to reduce dependence on the host CPU. Figure 2 represents TPU v1, as shown in

[3]. As you see in the figure, the processing unit is connected via a PCI port, and it

fetches weights via a standard DDR4 DRAM Chip. Multiplication happens within

the MMU with systolic processing. Activation functions are then applied to the

results. MMU and the unified buffer for activations take up a large amount of

space. There is an area where the activation functions are computed:

[ 574 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!