This is an interesting topic and I'd really like to know the answer. But many SIMD and vector machines had have gather scatter, or even general instruction sets which include indirection. The article kind of says this, but tries to distinguish between some kind of limited table lookups vs general indirection. Maybe this is really only relevant to SSE/AVX models?
I guess it kind of useless to try to backwards-engineer from marketing derived terms. But if anyone sees a bright line between a GPU and a bunch of intel hardware threads with vector units, please share.
The difference really is more in the architecture than in the programming model or the instruction set. SIMT programs, such as OpenCL kernels, can indeed be accelerated with SSE/AVX, but it is not that efficient (in silicon utilization). Intel's cores are fat cores optimized for single threaded programs. But if you don't care about single threaded performance and optimize for throughput in SIMT programs, the fat core is not a great idea. It is more efficient to hide latency to DRAM with computation rather than through caches, super-scalar and out-of-order execution, register renaming and so on. Expand Intel's SMT (hyperthreading) from 2-threads to 1024-threads, get rid of all the stuff that's there to hide latency and use the saved silicon area to add more cores, and you're looking at something that resembles a GPU.
There isn't really a bright dividing line between SIMD and "SIMT". Note that this article was written in 2011, in particular before the introduction of AVX2 and its VPGATHERDD instruction (which was slow in Haswell, but has seen improvements in Skylake), so you can't even use scatter/gather as a place to draw the line anymore.
I'm sure others will be able to chime in with examples of SIMD instruction sets for various architectures predating AVX2 that included true scatter/gather operations.
"SIMT" really is just a programming model that maps down to SIMD execution; even back in 2011 NVIDIA GPUs were SIMD machines. Scatter, gather, and predication features in your SIMD ISA make the SIMT -> SIMD mapping fast for the general case, to the point that no one really bothers using SIMT to target an ISA lacking them. But you could.
For me the power of SIMT over SIMD is in its programming model. It becomes especially interesting for the more complex cases where you have data dependant branching. No, it's not going to use the full power of your GPU anymore, but at least you can branch without restructuring your whole code as you'd have to in SIMD (and potentially make it less efficient when adding more memory accesses to avoid branching). Likewise, modelling the performance of a GPU kernel is quite easily doable compared to CPU. I found that roofline with some modifications creates quite predictable results. This is mainly because the memory model and the computational cores are all quite simple in comparison.
Perhaps a good dividing line would be the presence of mask operands in most instructions, allowing branch predication per SIMD lane. This should include all modern GPUs as well as AVX-512, but exclude AVX2.
I guess it kind of useless to try to backwards-engineer from marketing derived terms. But if anyone sees a bright line between a GPU and a bunch of intel hardware threads with vector units, please share.