wc seems like a bad example because it’s basically IO bound. The title should read, “Go’s built in bufio reader is faster than raw reads on the file descriptor.” Which it should be because that’s the point of bufio.
No, it's not faster. They are comparing a Go version that does not do character decoding (which is necessary for correctly counting the number of words under the presence of non-ASCII punctuation) to a C version that does decode characters (and matches them against a larger character set with iswspace).
This can be easily shown, by counting words and lines using wc separately. Word counting decodes characters (to find non-ASCII whitespace that may separate words), whereas line counting just looks for ASCII line separators:
$ time wc -w wiki-large.txt
17794000 wiki-large.txt
wc -w wiki-large.txt 0.48s user 0.02s system 99% cpu 0.496 total
$ time wc -l wiki-large.txt
854100 wiki-large.txt
wc -l wiki-large.txt 0.02s user 0.01s system 99% cpu 0.034 total
So, without character decoding, looking at every byte is ~15 times faster. So, if you'd compile wc without multibyte character support (which would be a fair comparison), it would probably beat Go without any parallelization.
Author here. I have addressed this in the article. The bufio-based implementation was the first one, and it was actually slower.
In the second section, I was able to surpass the performance of the C implementation - by using a read call with a buffer. As I mentioned in the article, the C implementation does the same, and in the interest of a fair benchmark, I set equal buffer sizes for both.
I don't know Go so I am not sure what file.Read does exactly but wc, on my system which uses the same wc.c as Jenner's article, is doing blocking reads in a naive way which I argue makes it somewhat of a paper target.
Maybe the Linux wc version you have is better. I don't know. They do exist and Penner's article gave a link to one. I am not sure as you didn't link to the source of your wc.
(Edit: I just noticed you did use the OSX wc, you're just running it on Fedora. Sorry about that.)
But in any case using just wall clock time can be deceptive. Here is what I mean. On my machine, a somewhat old Mac Mini with a spinning disk, user CPU is only ~1/5 of the real time. The rest waiting for the OS or the disk.
% for ((i=0;i<250;i++)); do cat /usr/share/dict/words >> a; done
% /usr/bin/time wc -l a
58971500 a
1.24 real 0.28 user 0.64 sys