2080 RTX performance on Tensorflow with CUDA 10

sabalaba · on Oct 5, 2018

https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks...

These numbers match up with the performance that we’ve measured in our own tests that were posted last week. The Titan V is simply too expensive for Deep Learning. The 2080 TI is, by far and away, the best GPU from a price/performance perspective.

As mentioned in the article, only possible reason that you might want a Titan V is if you care about FP64 performance: i.e., nobody training neural networks.

infocollector · on Oct 5, 2018

The author recommends Titan V - without justifying its $3k price. The 1080 series is less than half that price with comparable benchmarks. Am I missing something?

carlob · on Oct 5, 2018

The article says:

I am doing experimental work where I really need to have double precision i.e. FP64. The Titan V offers the same stellar FP64 performance as the server oriented Tesla V100.

swerner · on Oct 5, 2018

The benchmarks where the 1080 doesn't even compete - FP16/Tensor Cores.

bitL · on Oct 5, 2018

Many state-of-art models won't train well on FP16. But for inferencing it's extraordinarily good. 2x1080Ti is the sweet spot for FP32 training on "budget" at the moment.

jongomez · on Oct 5, 2018

Got any sources? Was thinking about buying one just for the tensor cores, but if this is the case I probably won't.

bitL · on Oct 5, 2018

You can even see it in author's comments in the original article:

"When I first looked at fp16 Inception3 was the largest model I could train. Inception4 blew up until I went back to fp32. Mixed precision needs extra care, scaling of gradients and such. Still I think it is a good thing. What I really want to test is model size reduction for inference with TensorRT targeted to tensorcores. I think that is probably the best use case. Non-linear optimization is just too susceptible to precision loss."

There was also some NVidia video presentation recommending mixed FP32/FP16 training instead of pure FP16.

option · on Oct 5, 2018

Mixed precision training can give you tensor core speedups. Paper: https://arxiv.org/abs/1710.03740 Toolkit which implements it on top of Tensorflow: https://github.com/NVIDIA/OpenSeq2Seq

IshKebab · on Oct 5, 2018

"Inference" or "inferring". "Inferencing" isn't a word any more than "defencing" is.

Symmetry · on Oct 5, 2018

"Inference" is a term of art in machine learning jargon so those replacements you propose have much broader meanings than the original and are not suitable replacements.

lettucehead · on Oct 5, 2018

This is outrageous. Consider math terminology as a so-called "term of art." An attractor, in dynamic systems, is something which attracts. A repellor is something which repels. Should it be called a "repellence?" Wouldn't that mean something undesirable? Should the opposite of "attractor" be a "repugnor," because repugnance is the opposite of attraction? What is the corresponding correct form of something that repels? Repellor (or repeller, for those who prefer American suffixes).

Forms matter. Colloquial meanings also matter, but not as much, particularly when they're an egregious violation of English and decency.

lhlmgr · on Oct 5, 2018

"I do like the RTX 2080Ti but I just love the Titan V! The Titan V is a great card and even though it seems expensive from a "consumer" point of view. I consider it an incredible bargain." .. quite hard to read as a phd student..

sigi45 · on Oct 5, 2018

As a student, you get more free offers to use university ressources or something else than someone who just 'works'.

_Wintermute · on Oct 5, 2018

Compared to other equipment in research environments it's incredibly cheap.

ekianjo · on Oct 5, 2018

Usually you dont buy such equipment by yourself anyway.

aetimmes · on Oct 5, 2018

Considering the V100 runs around $10-11k and the Titan V provides similar performance for around $3k, the author isn't wrong.

bitL · on Oct 5, 2018

32GB vs 12GB. Enables a lot more. If you don't care about memory and FP64, 2080Ti would be a much better deal than Titan V. V100 vs Quadro RTX 8000 would be more interesting.

Still, I think 2x1080Ti is a better deal than 1x2080Ti and costs the same.

ZeroCool2u · on Oct 5, 2018

Frankly it probably is, if only because you can do batch processing while still doing more experimental work on the other card.

shaklee3 · on Oct 5, 2018

Another thing that's not clear from the benchmarks is the Titan has both more tensor cores, but also much higher memory bandwidth with hbm2. I'd be curious to see how much that affected the results compared to the number of cores.

Also the 2080ti can do lower precision math (int8/4) in the tensor cores, while the Titan v cannot.

visionscaper · on Oct 5, 2018

Although the RTX 2080 Ti performs significantly better than the 1080 Ti, I'm still drawn towards the 1080 Ti; I can buy two second-hand 1080 Ti's for the price of one new 2080 Ti, providing me the double amount of memory, plus, the computing performance of 2x 1080 Ti is much better in FP32 than one 2080 Ti.

I'm using my GPUs to train large sequence to sequence models (with long sequences) that need FP32 for training and can use FP16 for inference (mixed-precision training), so I can't even use the FP16 performance of the Tensorcores for training.

The only disadvantage is that the energy costs are higher using two 1080 Ti's compared to one 2080 Ti.

mychael · on Oct 5, 2018

Does anyone know why they are using Xeon processors instead of the AMD Threadripper? Is it the support for ECC memory? If so, why is it that so important?

Example: https://www.pugetsystems.com/nav/peak/tower_single/customize...

celrod · on Oct 5, 2018

The Xeon-W 2175 has avx-512. Threadrippers cannot compete in numbering work relative to price point on well optimized code.

sgemm on 5000x5000 matrices takes about 600ms on a Threadrippers 1950x, but only around 150ms on the comparatively priced i9 7900x. Vector libraries for special functions, eg Intel VML or SLEEF also provide a similar performance advantage there.

If you're mostly crunching numbers, and either compiling the code you run with avx512 enabled (eg, -mprefer-vector-width=512 on gcc, otherwise it's disabled) or using explicitly vectorized libraries, you will see dramatically better performance from avx512, regardless of any thermal throttling. Number crunching is what it's made for.

Granted, you should be offloading most of those computations to the GPU, which will be many times faster. But I'd you're in the business of ML or statistics, I'd still way that more heavily than the difference in how long it takes them to compile code.

AnthonyMouse · on Oct 5, 2018

> Granted, you should be offloading most of those computations to the GPU, which will be many times faster. But I'd you're in the business of ML or statistics, I'd still way that more heavily than the difference in how long it takes them to compile code.

I don't follow the logic. It sounds like you're saying that if you care about that specific type of highly vectorized computation being fast what you really want is a GPU rather than any particular CPU. So how should that have a major influence on which CPU you choose? Particularly when the CPU which is slower at that is faster at many other things that aren't suitable for a GPU.

celrod · on Oct 5, 2018

I'm saying there is a reason to favor a CPU with avx512. The reason may not apply to you / your work flow.

If your number crunching is just neural networks on your GPU, then the CPU doesn't matter.

But there's probably a lot of overlap between the folks who train neural networks, and those who may do linear algebra, MCMC, or traditional stats that are much better suited to the CPU. That is, conditioning on person A being someone who trains NNs, there is a higher probability that they're someone who would be interested in CPU intensive tasks that benefit from vectorization. If that isn't you, don't factor it into your decision.

I do most of my number crunching on the CPU, so my choice is clear. The reviews of avx512 are generally poor (disable it so you don't get thermal throttling!), while the Threadrippers receive a lot of praise. But within it's own niche (linear algebra, many iterative algorithms), the widest vectors are king.

AnthonyMouse · on Oct 5, 2018

Isn't linear algebra one of the other things GPUs are good at?

I think you're also looking at the release prices for the CPUs rather than the current ones. Using today's prices from Newegg, the Threadripper 1950X is $699, the (newer/faster) 2950X is $859, meanwhile the i9-7900X is $1275, up from its $989 release price presumably due to Intel's current manufacturing issues. And the AMD processors have 60% more cores/threads with, avx notwithstanding, generally equivalent performance per thread.

I expect you're right that there are niche workloads where avx512 is a real advantage, but it's starting from a pretty deep hole on the price/performance front in general.

Erlich_Bachman · on Oct 5, 2018

AMDs usually and historically supports ECC memory. In fact, in some ways it supports it more than Intel: Intel disables ECC support (for no other real reason than marketing efforts and because they can charge more money that way) on non-Xeon processors, while AMD keeps it enabled on most models, even desktop-oriented ones.

zamadatix · on Oct 5, 2018

The Xeon they are using is 14 cores in a single NUMA node so maybe that's it since a 16 core Threadripper is 2 separate nodes in one die. I'm pretty sure Threadripper supports ECC: "With the most memory channels you can get on desktop, the Ryzen™ Threadripper™ processor can support Workstation Standard DDR4 ECC (Error Correcting Mode) Memory to keep you tight, tuned and perfectly in sync." from https://www.amd.com/en/products/ryzen-threadripper

simcop2387 · on Oct 5, 2018

I can definitely confirm the ECC support on TR cpus. I've got one running right now and have had it report errors and corrections when trying to overclock things (I don't know what I'm doing with that so it's nice to have it warn me that I'm at the limit).

mychael · on Oct 5, 2018

How does that warning surface? From the underlying operating system?

jamesblonde · on Oct 5, 2018

Because it's a single-root pci complex. See here: https://www.servethehome.com/how-intel-skylake-sp-changes-im...

scientist · on Oct 5, 2018

Could you please explain your comment? The link that you have provided explains that with the new Intel Xeon Scalable generation it is difficult to implement single-root PCI complex on typically available motherboards, while according to [1] "the new Intel® Xeon® W processors are based on the Intel® Xeon® Scalable processor microarchitecture". Therefore, Intel Xeon W would have the same problems for supporting single-root PCI complex as the Xeon Scalable mentioned in the link you have provided.

[1] https://www.intel.com/content/www/us/en/processors/xeon/xeon...

bigmit37 · on Oct 5, 2018

Are there benefits to using FP32 vs FP16? I’ve been dabbling with deep learning but not really sure how much affect higher precision is having. Though more precision is better I suppose.

bitL · on Oct 5, 2018

Traditionally Deep Learning frameworks were all using FP32.

With FP16 one can theoretically get 2x speed and 2x larger models with the same VRAM capacity. For inferencing with INT8/INT4 it can be even way better (good for embedded stuff). The downside is that sometimes more complex/deep models don't converge (or converge less often than FP32). Sometimes there are framework issues with some advanced FP16 stuff.

visionscaper · on Oct 5, 2018

From experience I know that models using RNNs have trouble training with FP16 precision. The common solution is to do training in FP32 and inference in FP16. To make this happen you often have to implement custom code (e.g. using Tensorflow or Keras as a meta framework)

zrav · on Oct 6, 2018

Granted, the benchmark only covers training, but for a chip that spends significant die space on dedicated AI circuitry the performance gain over the previous generation is disappointing.