More

peterderivaz · 2026-01-10T12:38:30 1768048710

The Raspberry Pi contains a Videocore processor (I wrote the original instruction set coding and assembler and simulator for this processor).

This is a general purpose processor which includes 16 way SIMD instructions that can access data in a 64 by 64 byte register file as either rows or columns (and as either 8 or 16 or 32 bit data).

It also has superscalar instructions which access a separate set of 32-bit registers, but is tightly integrated with the SIMD instructions (like in ARM Neon cores or x86 AVX instructions).

This is what boots up originally.

Videocore was designed to be good at the actions needed for video codecs (e.g. motion estimation and DCTs).

I did write a 3d library that could render textured triangles using the SIMD instructions on this processor. This was enough to render simple graphics and I wrote a demo that rendered Tomb Raider levels, but only for a small frame resolution.

The main application was video codecs, so for the original Apple Video iPod I wrote the MPEG4 and h264 decoding software using the Videocore processor, which could run at around QVGA resolution.

However, in later versions of the chip we wanted more video and graphics performance. I designed the hardware to accelerate video, while another team (including Eben) wrote the hardware to accelerate 3d graphics.

So in Raspberry Pis, there is both a Videocore processor (which boots up and handles some tasks), and a separate GPU (which handles 3d graphics, but not booting up).

It is possible to write code that runs on the Videocore processor - on older Pis I accelerated some video decode sofware codecs by using both the GPU and the Videocore to offload bits of transform and deblocking and motion compensation, but on later Pis there is dedicated video decode hardware to do this instead.

Note that the ARMs on the later Pis are much faster and more capable than before, while the Videocore processor has not been developed, so there is not really much use for the Videocore anymore. However, the separate GPU has been developed more and is quite capable.

jacquesm · 2026-01-10T14:09:20 1768054160

You have the most interesting job!

Thank you, I've used your work quite a number of times now.

peterderivaz · on May 16, 2023

Project Euler challenge involving Fractran: https://projecteuler.net/problem=308

peterderivaz · on Dec 30, 2016

In practice I would expect Knuth's DLX algorithm to be much faster, both for generating the first solution, and for counting all solutions (for any reasonable sized N).

There are some timing results for a (well-optimized) backtracking solver here: http://www.jsomers.com/nqueen_demo/nqueens.html

N=21 takes about 600,000 seconds on an 800MHz computer, i.e. around 500.10^12 computations.

In comparison, the algorithm in the paper has a complexity higher than 8^N, which is around 9,000,000.10^12 for N=21.

The number of solutions is growing by about an order of magnitude (sequence https://oeis.org/A000170) for each increment of N.

taeric · on Dec 30, 2016

I like that page. Well written and amusing to see how long some of these solutions took back then. Most I ever read on this problem was Knuth's Dancing Links. My small attempt at it is here: http://taeric.github.io/DancingLinks.html

I posted your page as a submission hoping to get a discussion on it. Apologies for not asking first. Please let me know if you want me to try and remove the submission.

taeric · on Dec 30, 2016

What do you make of this paper's claim that this method is superior to backtracking methods?

peterderivaz · on Oct 19, 2016

It was my first job out of University to design this instruction set which may explain some of the quirkiness...

Initially the instructions did all set the status flags but it caused a tight feedback loop in the processor. The choice was between a higher clock frequency for all instructions or better 64-bit arithmetic.

None of the initial video applications needed 64-bit support so it lost out, although I did get to put in the divide instruction just so my Doom port could run faster :)

hermanhermitage · on Oct 19, 2016

Love the design of the instruction set. It was a lot of fun reverse engineering it and mulling over the design space. Thanks.

david-given · on Oct 19, 2016

That's awesome! Also, Doom uses divisions?

Are you allowed to tell us what the C compiler used internally was based on? I know there are some very easy to port proprietary compilers which commonly don't see the outside world, and I'm wondering whether it was one of those, or whether some poor sucker had to port gcc.

peterderivaz · on Oct 22, 2016

We paid a company called Metaware to make a compiler. I believe this compiler is still in use.

As it happens, while we were waiting for this compiler to be made for us, I ported GCC to the architecture for my own use. I don't remember it being all that painful, just a few pages of machine description and everything seemed to work fine.

This only supported the scalar instruction set. However, when we needed an MP3 decoder I found that it really needed 32bit precision to meet the audio accuracy, so I also made a different port of gcc that targeted the vector processor. I changed the float data type so any mention of float actually represented 16 lanes of a 16.16 fixed point data type implemented on the vector processor. From what I recall, mp3 decode required 2MHz of the processor for stereo 44.1kHz.

pcwalton · on Oct 19, 2016

> Also, Doom uses divisions?

Well, it's 3D, so you pretty much need perspective divide at the very least…

peterderivaz · on Dec 19, 2015

Not the original poster, but one thing that confused me was that I can clearly draw a graph that cannot be two coloured with a single pen stroke if I am allowed to reverse direction and keep drawing back over the same line again (indeed it is of course possible to draw any connected graph this way). I wasn't sure which part of the conditions forbid this.

Perhaps it is no longer a graph if I have a bidirectional edge? Or perhaps it is not considered planar if two edges coincide?

ColinWright · on Dec 19, 2015

Ah, it's not allowed to go back exactly over the same line, so you can't reverse direction. You can go back and revisit the same node, but it will require a separate edge, and the two parts of the stroke will then encompass another area.

I'll add that - thanks!

Edit: now added - it will go live when the page updates.

peterderivaz · on Aug 14, 2015

I've found using nolearn and lasagne on top of Theano made Theano much easier to use while still being in Python with access to familiar graphics routines.

lasagne gives you ways of constructing neural network layers (implemented as Theano functions).

nolearn sits on top of lasagne and gives a Scikit learn style interface that makes it trivial to set up a standard deep network to predict values from given input data.

Using nolearn was a very similar experience for me to using the Torch7 framework.

peterderivaz · on May 27, 2015

I think in this formula the altitude should be in feet, while the result is in miles, so 30,000 feet actually gives 213 miles.

peterderivaz · on April 6, 2015

I ran some experiments to investigate the discrepancy.

The largest differences were caused by:

1. A variable quantiser being used for HEVC, but not for VP9 (as you described)

2. Keyframes being forced every 30 frames for VP9 in the first paper

HEVC also had I frames added every 30 frames, but these were not IDR frames, meaning that B frames were allowed to use the information from the I frame in HEVC.

However, in VP9, true keyframes were forced every 30 frames. The way VP9 works this meant that every 30 frames it encoded both a new Intra frame, and a new golden frame.

Making both codecs use a true fixed quantizer and removing the forced intra frames made the results more like Google's own paper.

I guess the moral is to not force frequent keyframes when encoding with VP9.

peterderivaz · on Feb 18, 2015

I don't know Colin's logic, but my personal reasoning would be:

1. If this were true then it gives a simple, efficient, deterministic method to test for primes. (You can compute P_n modulo n efficiently using matrix exponentiation.)

2. If there was such a method I am sure I would have heard of it before - and people wouldn't bother using Miller-Rabin or the complicated deterministic primality method

3. Therefore there must be a flaw...

ColinWright · on Feb 18, 2015

Actually, there is a polynomial algorithm for testing primality, the AKS primality test.

http://en.wikipedia.org/wiki/AKS_primality_test

In essence, my reasoning was simply that primes don't behave like this. Any connection with addition-type stuff is spurious and doesn't go on forever.

Mind you, when I'd searched up to 10^5 and still hadn't found a counter-example I was starting to doubt my intuition. The first counter-example is when the sequence predicts that 521^2 is prime.

In fact, if p is prime then p divides k(p). I find that non-obvious, but do follow and believe the proof. Even so, I don't know it well enough to feel enlightened by it.

The work continues.

peterderivaz · on Feb 18, 2015

The list is here: http://oeis.org/A013998

The first is P_(271441) with 33,150 digits

ColinWright · on Feb 18, 2015

From that page:

    EXTENSIONS

    ... Further terms beyond those shown here have been
    computed by cdw10(AT)cix.compulink.co.uk (C Wright)

I should contact Neil and get him to change that email address.