This is really neat, but seems to neglect an alternative (and just as simple) method. Instead of traversing through each pixel of the source buffer, traverse through each pixel of the destination buffer sampling the correct pixels from the source. Using different sampling methods results in various qualities, and efficiencies in cpu use and memory access.
This method is as simple to code and doesn't suffer from the missing-pixel aliasing problem of the simple method of the article, and is also capable of higher quality results than the shearing method.
You have to traverse sqrt(2) times more pixels than in the original method unless you somehow know where the borders of the destination square are ( additional computation ).
This is true only if the source image is smaller than the destination image. In the article, the source image is shown as the same size as the destination, and includes a large white border that can be clipped without the inner image being affected.
In the event that the destination buffer is much larger than the source, that additional computation is trivial (it's the same calculation that's already being done for each and every pixel). As it only needs to be additionally done on the corners, not per-pixel, the additional time spent should be quite minimal.
The shearing method in the article is genuinely clever and totally cool, but I just can't shake the feeling that even on 1980s hardware, this method would be better. On modern hardware, there's no question, it's still used to this day. Nowhere near as cool, though.
This method is as simple to code and doesn't suffer from the missing-pixel aliasing problem of the simple method of the article, and is also capable of higher quality results than the shearing method.