TPU (Tensor Processing Unit) Deep Dive

henryhmko.github.io

152 points by transpute 6 hours ago

sgt101 4 minutes ago

ELI5: how (specifically) do GPU and TPU optimisations effect determinism in LLMs? Or is this all a myth?

Can you suggest a good reference for understanding which algorithms map well onto the regular grid systolic arrays used by TPUs? The fine article says dese matmul and convolution are good, but is there anything else? Eigendecomposition? SVD? matrix exponential? Solving Ax = b or AX = B? Cholesky?

cdavid 21 minutes ago

SVD/eigendecomposition will often boil down to making many matmul (e.g. when using Krylov-based methods, e.g. Arnoldi, Krylov-schur, etc.), so I would expect TPU to work well there. GMRES, one method to solve Ax = b is also based on Arnoldi decomp.
musebox35 2 hours ago

I think https://jax-ml.github.io/scaling-book/ is one of the best references to go through. It details how single device and distributed computations map to TPU hardware features. The emphasis is on mapping the transformer computations, both forwards and backwards, so requires some familiarity with how transformer networks are structured.
WithinReason 2 hours ago

Anything that you can express as 128x128 (but ideally much larger) dense matrix multiplication and nothing else

frays an hour ago

How can someone have this level of knowledge about TPUs without working at Google?

ipsum2 an hour ago

Everything thats in the blog post is basically well known already. Google publishes papers and gives talks about their TPUs. Many details are lacking though, and require some assumptions/best guesses. Jax and XLA are (partially) open source and give clues about how TPUs work under the hood as well.
https://arxiv.org/abs/2304.01433
https://jax-ml.github.io/scaling-book/
musebox35 an hour ago

From the acknowledgment at the end, I guess the author has access to TPUs through https://sites.research.google/trc/about/
This is not the only way though. TPUs are available to companies operating on GCP as an alternative to GPUs with a different price/performance point. That is another way to get hands-on experience with TPUs.

serf 2 hours ago

does that cooling channel have a NEMA stepper on it as a pump or metering valve?[0]

If so, wild. That seems like overkill.

[0]: https://henryhmko.github.io/posts/tpu/images/tpu_tray.png

fellowmartian 42 minutes ago

definitely closed-loop, might even be a servo

ariwilson an hour ago

Cool article!

almostgotcaught 5 hours ago

> In essence, caches allow hardware to be flexible and adapt to a wide range of applications. This is a large reason why GPUs are very flexible hardware (note: compared to TPUs).

this is correct but mis-stated - it's not the caches themselves that cost energy but MMUs that automatically load/fetch/store to cache on "page faults". TPUs don't have MMUs and furthermore are a push architecture (as opposed to pull).

jan_Sate 5 hours ago

I thought that it would be about 3D printer filament.