Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What is the longest known sequence that repeats in Pi? (homelab) (sponaugle.com)
68 points by sponaugle on Aug 28, 2024 | hide | past | favorite | 84 comments


On a side note (it's mentioned in the article), it has been a while since I last visited OEIS (Online Encyclopedia of Integer Sequences), which has been of use over the years.

Very much appreciate the amazing effort that is OEIS!

I thought I should mention it to raise awareness: https://oeis.org/

The sequence in the article: https://oeis.org/A197123


There's a Neil Sloane (OEIS creator/curator) playlist from Numberphile which, unsurprisingly, often feature OEIS sequences.

https://www.youtube.com/playlist?list=PLt5AfwLFPxWJXQqPe_llz...


OEIS is amazing and in the same "spirit" I also love factordb. Another "Web 1.0" look which also exists since immemorial times.

It contains, well, factorization of many numbers.

For example here's RSA-250:

https://factordb.com/index.php?query=21403246502407449612644...

I hope sites like these continue to exists for a very long time.

I'd link to one or two other useful oldschool sites that I do still use but I'm not sure they could handle the /. (well, HN) effect.


OEIS is amazing to say the least. It is such an interest collection of sequences as well as a great collection of editors that are willing to keep it all in order! Certainly worth a mention and a contribution.


I wonder if you could do something with Bloom filters similar to:

  filters = {}
  candidates = []
  for i in range(len(pi)):
      block = pi[i:i+10]
      hash_prefix = block[:4]
      try:
          bloom = filters[hash_prefix]
      except KeyError:
          filters[hash_prefix] = BloomFilter().add(block)
          candidates.append((i, block))
          continue
      if block in bloom:
          candidates.append((i, block))
      else:
          bloom.add(block)
Basically, make one pass through to build a list of statistically likely candidates to evaluate later. Then use full string matching on those candidates to find the real matching sequences.

That might be helpful here because Bloom filters can say "we might have seen this string before" or "we definitely have not seen it before". That may greatly reduce the amount of storage you need.


Exactly this. Use a Bloom filter to find candidate solutions, write those out and review them later. Since you can write the entire candidate and its location, the review consists of a simple sort.

Digging in to estimate the speed, if your Bloom filter has a 0.1% chance of a collision then a trillion digits of pi will result in a billion candidates. This requires ~15 bits per entry or about 2TB of memory for the full Bloom filter. Using multiple passes on subsets of the data allows you to trade speed for space.

Note also that a 128 bit integer has room for 38 digits. This means that you can do all of the shifting using modulo, multiplication and addition operations. I would be very surprised if the cost of the disk I/O to read the raw digits is as fast as the sieving of candidates. As such, the speed of a single pass should be about the cost of reading about 500GB of data which means that multi-threading will be of limited use. This should be less than an hour on most machines. That means that one of my ancient Intel NUCs with 32GB should be able to scan the entire range in about a day no matter what size sequence we are looking for.


This is very nice, but I can see a way to optimize it even further.

Consider pi[i:i+10] and pi[i+1:i+11] they are highly similar. Intuitively it seems that storing them both seems wasteful. But how can we avoid storing them both? This is where we have to get clever.

We are looking for a length 18 match. If position a and position b are a length 18 match, then it holds that (a,b) is a length 10 match. (a+1,b+1) is also a length 10 match, and so is (a+2,b+2). Wouldn't it be nice if we could store only one of a,a+1,a+2 etc?

The naive attempt is to do the loop with a stride of say 3, but this will not work as it may happen that e.g. a=1 mod 3 and b=2 mod 3.

However there is another approach that does work. We need to use a position-independent way of deciding between a,a+1 etc. We can do it thusly. Loop over all the numbers. When we encounter position a we score the length 10 sequences starting at a, a+1.. a+8. Any scoring function will do as long as it doesn't have ties. For simplicity we can score=number. The highest scoring number is stored. As an implementation detail we can use a "sliding window maximum" algorithm to avoid recomputing and recomparing the scores.

Note that in this scheme we may actually end up storing more than one of a, ..a+8. But it will cut down the storage quite a bit at least.


That would be O(n^2) though, where n is the max length of the strings to compare, right?


No, it's possible to do much better https://www.geeksforgeeks.org/sliding-window-maximum-maximum...

Edit: It should also be mentioned that due to the conjectured normality of pi, comparing two sequences will on average terminate very quickly, just looking at one or two digits will be enough most of the time, so a comparison is expected O(1).


Indeed, the comparison length is 90% one hit, 99% two hits, etc.


My current method is O( (n/h)^2 ) where h is the hash size, which is really O(n^2) for n>>h. I'll have to think about this to see if it would be better.


I need to noodle on this. Hmm.


Oh that is interesting!.. and indeed I did consider doing a 'hash compression' where I would generate offsets to numbers that might be similar, then do direct comparison on those ones that are similar.

I'll take a crack at doing this - It could certainly be an improvement and would just be fun to try out.


Please let us know!


I read the original post and thought: "someone knows a more efficient way to do this."

Thanks for being that someone.


Full disclosure: that’s a wild guess. It may be helpful, or it might turn out to be a terrible idea for reasons I hadn’t considered. I’m definitely not claiming it is a better way.

That said, I’m curious now. I also refuse to allow myself to dig further into it. I’ve got enough spinning plates at the moment. :-)


Yep! I suspect there are lots of different ways to optimize this search, especially with some heuristic approaches. One of the great things about HN - There is always someone with a better way!


OK cool. But...

Finding random numbers repeating is a simple brute force problem. It is cool, but slightly boring.

Related to this, what I cannot wrap my head across, is the Infinite monkey theorem. Is it not possible that we keep expanding the numbers and never reach a complex enough set of values?


It is, as far as being a math problem, somewhat boring. The numbers themselves are equivalent to a random number. There is certainly no significance to the numbers themselves, especially as this is all in the decimal representation, which is very 'us' specific.

It does represent something 'interesting' that is easy to observe and wonder about, and for the non-math person it leads to slightly better understanding of 'normal' numbers, of the difference in representation ( these answers are base specific), and a very small amount of simple computer science.

The notion of world record is of course tongue in cheek. No one really needs to know these numbers, nor will they rememeber them! But at least there is an entry in OEIS!


WRT significance, do people ever find the position of socially significant numbers in pi? For example I recall quite a while ago, there was a leak of the key that was used to encrypt blue rays.

Hypothetically that could be found in pi, right? Of course, it ended up being spread widely enough that they gave up on trying to police it. But maybe there are some other not-allowed numbers that could be distributed as a position in pi? I’m not sure what the point would be, mostly just to make a point I guess.


https://www.angio.net/pi/ lets you search for sequences of up to 120 digit in the first 200M digits of pi. I'm sure there's others knocking around that might go further.


in a pure infinite monkey scenario, works of Shakespeare would be generated at some point


You can't guarantee that monkeys are typing perfectly randomly. We may never see better than "it was the best of times it was the blurst of times"


It doesn't need to be perfectly random; you just need every sequence to have a nonzero probability of being generated.

"to be or not to be" may be more likely for a monkey on a keyboard to generate than a random string, actually. The keys to generate it are highly repetitive and relatively close together. That's true for any NL string: the lower entropy of NL strings are reflected in the keyboard layout. If the monkeys switch to Dvorak, they could probably generate Shakespeare even faster.


  What, you egg?
    [He stabs him.]
(from Macbeth)

Are you certain the real thing isn't "the blurst of times" already?


Your monkeys are pretty bad at Shakespeare if the best they come up with "it was the best of times it was the blurst of times".



> Your monkeys are pretty bad at Shakespeare

Worse, they can’t even tell their Shakespeare from their Dickens (It’s the opening line of A Tale of Two Cities).


you misunderstand infinity


So if I randomly type letters from the top row of my keyboard forever, I'll eventually type the entire works of Shakespeare?

Of course not. There are lots of ways you can bias random that remain random, but prevent every possible output from being generated. That's all GP was saying. You can't just say "infinite time", you need to rule out biases that would prevent the desired result.


You don't need "perfect randomness". The claim was about "perfect randomness". You just need a non-zero probability for the outcomes in question. And you need independence between subsequent random events.


how so?


If you have infinite monkeys, then (at least) one of the monkeys is Shakespeare.


If you have finite monkeys, let's say they number M, but infinite time, then for small values of M it would be faster to breed the monkeys until one of them evolves into Shakespeare.


The binary representation of Pi contains both the MIT license and the GPL. I'm confused.


I think I saw the Blu-ray rip of “The Big Levowski” flit by (but it was dubbed in Finnish).


Extending that to all known digits would be a fun technical challenge. I once wrote a code to calculate positions where k subsequent decimal digits equal to the position index itself [1], but I had no storage to spare so I instead used a stream of digits from the public GCP dataset [2]. You would eventually want to try this approach if you are like me (i.e. don't want to pay for two-digit-TB storage).

[1] https://github.com/lifthrasiir/remote-pi-reader/

[2] https://storage.googleapis.com/pi100t/index.html (used to be `pi50t` back then)


Nice... I did that same problem as it was a numberphile episode. Very cool! Fortunately I have over a PB of local storage at home which helps out.



This can easily be optimized to use less memory, since pi is thankfully random.

Do N passes over the data, and at each pass only put in your hashmap the values that mod N equal the current iteration. If you take N~100, the runtime would probably be in the same ballpark, since the only thing that increase is streaming all the data N times. With a fast SSD, that's not that much.


Pedantic point here.

PI isn't random. It is exactly the ratio of a radius and the circumference of the circle.

Incommutability is not the same thing as 'random'. PI does not follow any sort of probability or statistical algorithm to be generated. Nothing inside PI is random...

Now... if you take a random number (introducing randomness in the first place) representing the position and another random number to represent the length. From PI those two random numbers will generated a random sequence. But the 'randomness' is not IN PI, the random elements are only in the 2 chosen numbers. I understand this is colloquially what people mean when they say PI is 'random'...

Randomness can only come about from choice, and the digits of PI have no choice as to what they will be. You can not read out a deterministic sequence of numbers in order and expect randomness. The randomness must come from some choice interacting with that defined sequence of numbers.


Not taking issue with you, slightly different question:

Do the digits of Pi pass statistical tests for randomness? Uniform distribution, etc?



Statistical tests? Yes. Mathematically proven to be normal? No.


This reminds me of Chaitin's work on the nature of randomness and his halting probability constant: https://en.wikipedia.org/wiki/Chaitin's_constant


Isn't this functionally the same thing that the author already did, just with N=10 and filtering on the first digit instead of the last?


Sure, I did read it too quickly, but why didn't he simply extend the generalized implementation instead of saying at the end that he was limited by the amount of RAM he has?

Those runtimes seem quite a lot slower than what I would have expected, and I'm pretty sure they could be optimized by a lot, especially since he sounds like he has enough RAM to hold the entire file.

Not trying to be dismissive, but this doesn't sound like that great of an implementation.


Indeed no worries about pointing that out! - there are lots of things that could optimize the RAM requirements, and especially reduce the runtime. Since the overall runtime is pretty quick to getting a single answer (3 hours) vs spending more time optimizing.. but absolutely there is a lot left on the table.

I think with some good optimization you could reduce the runtime significantly, especially on modern hardware.

As for RAM limitations - You are correct the only limitation is just that with the RAM I have I would need to do more iterations. It would be possible to do this in much less RAM with more iterations, or the reverse of course.

I was fond of solving the problem in RAM just as a way to limit the scope of the problem... but SSDs are indeed pretty fast at streaming data like this.


How odd. You have to read and copy the data from the web to your hard disk, to the SSD. Why not just run the program on the web copy, streaming. Tiny amount of ram for all the hash tables, and you do not need to move the data around. Multi threading would improve the runtime, but reading the table is the bottle neck.

He makes a great optimization, by using hash tables, but I was wondering about optimizing about the evaluation of the Chi-Square Test for Equal Proportions.

If the Chi-Square test diverges from the Chi-Square Test for e, then there is little likely hood for pi+e to ever converge on rationality.

My hypothesis is that because Pi is cyclical, and e is exponentially transcendental, that their sum and product are not rational, and nether are their respective powers ( Pi^e and e^Pi ), but those will take a much larger homelab than I have access to.


I have a local copy of 10 trillion digits of Pi on the math cluster, so no web access needed.


Is there a research paper is this?

Seriously, I find the most interesting works self-published in people’s personal blogs and I often wonder if there’s a paper in there and why it didn’t make it into a paper.

Maybe there’s a paper in finding the largest repeating sequence in Pi. There have certainly been more niche papers than that.


There isn't much incentive to publish academic papers outside of academia. It's a huge hassle and the juice generally isn't worth the squeeze.


Assuming that pi has an infinite number of digits, then the answer to "What is the longest known sequence that repeats in Pi?" is "The longest known sequence of Pi".


This isn't true as you can build an infinite sequence that never repeats. An example sequence in binary is (the number of 0s between each 1 increases by 1 every time)

01001000100001...


Exactly. A number with the property that every sequence occurs is called a rich or disjunctive number - a number can be rich in s specific bases or rich for all bases we don't know whether pi is any if that. A number where every sequence occurs equally often (scaled to the length of the sequence) is called a normal number, which is an even stronger property.


While Pi is not proven to be a disjunctive number, nor the stronger condition of being normal, it is generally believed to be normal. That being said we don't have a proof of Pi being normal, nor disjunctive.

I am not familiar with how a proof of that would be constructed, as clearly numerical or computational measurements could never be conclusive.


Starting from the fourth digit,

    01000100
But maybe I don’t understand your example


You didn't understand the original claim which is that because Pi has an infinite decimal representation, every subsequence of it has a repeat


Ahh, you’re saying something like every sequence containing at least 2 ones, does not repeat.

There may be some long repeats, but not all sequences repeat. Thanks!


Your example is governed by an arbitrary rule of your choosing. Pi is not.


That assumption is empirically supported, but is, as of now, conjectural and unproven. In fact, our ability to prove facts about the distribution of digits in pi is so weak that we can't even rule out statements such as "after a certain point, all digits of pi represented in base 10 are either 0 or 1."


The longest sequence of digits that eventually repeat (as defined in the article) is unboundedly large for any infinite sequence of digits, not just pi!

Using the pigeonhole principle, there must be at least one length N repeating string in the first N(N!+1) characters of any string.


Ha! Yea, that is true.. now if we only had a printout of it!


Post the IP of your printer, and I'll send you one.


1f:ca7:5ee5:b12d::ca7:eat5:b12d


Breaking news: As of this writing, OP is 23 hours old. OEIS has already been updated with the new terms OP discovered.


Can you do something fancy with FFTs and convolution here?

I _think_ you can but my brain doesn't have the bandwidth to think past that.


Isn't the entire known sequence of pi also known to repeat due to its nature?


You mean because the entire known sequence is finite, and every finite sequence occurs in pi infinitely often?

This is suspected but currently not yet proven.

In fact, we don't even have a proof yet that the digit 7 occurs infinitely often in pi. (Same for any other digit.) Right now it's conceivable, but very unlikely, that there is some final 7 in pi.


I suspect not. Pi is known to be irrational, and repeating sequences are known to be rational. So pi cannot be truely repeating.


The right algorithm for this problem starts with a linear time suffix sort.



Conjectured to be normal.


Yep - This really is just a random number problem.. it is only interesting in the sense that Pi is something that has been captivating in public math perception. Any random number would have the same characteristics..

It is however something fun to do in a homelab outside of the normal learning and playing around, and that has merit for me. Hobbies are hobbies in part because of the interest, joy and appreciation they drive.


In one sense you're right. The set of normal numbers has measure 1 so if you choose a real number randomly it will be normal and have the same properties.

In another sense you're very wrong. The set of uncomputable numbers also has measure 1. So while there are lots of normal numbers Pi belongs to a pretty exclusive club of numbers we can actually look at.

It's not a particularly deep result but the fact that the set of computable numbers is countable will never not make me existential. Every number that we can actually express, that we will ever know, is no bigger than the whole numbers.


How do you generate a random irrational number?


Generate a random Cauchy sequence. One trivial way to do this is to output a random string of digits. It's not perfect, but it works. Approximately 0% of random Cauchy sequences will be computable, so long as your random number generator doesn't have self-correlations.


Would this not require generating an infinite number of random numbers which you can not do? Or can you actually generate a random Cauchy sequence from a finite set of random numbers?


"which you can not do" is an interesting assertion. In some sense it's trivially true, of course. Generating infinite random numbers would take infinite time. In fact, I'm quite happy to say that approximately 0% of real numbers can exist, for that reason.

But in contrast to that, I'm also happy to grant the existence of constructions with countably infinite length when discussing theories outside of computability. So in that context, is it possible to generate an infinite string of random numbers? Well, it is so long as you believe you can generate any amount random numbers at all. Just repeat the process, whatever it is.

Is that a reasonable thing to believe? That's probably a philosophical question at this point. I'm certainly not equipped to answer it. But assuming it is possible allows for some interesting math, which I support for its own sake.

This is all sort of reminiscent of a mathematician friend's stance on the axiom of choice. If you were to tell them a vehicle is only guaranteed to not explode by the axiom of choice, they wouldn't use it. On the other hand, assuming the axiom of choice is true leads to some useful math, even if it's sort of sketchy ontologically. It's a lot like programming. Sometimes you can't fix the bugs in the underlying system and you just code around them to get your stuff working.


We were talking about somewhat different things, you meant doing it mathematically, sure, no problem, pick an infinite random sequence, I understood the question to mean how to do it in practice.

Actually I had something even stronger in mind, being able to pass the generated number around. For pi I can have a program that I can pass around and that will give everyone access to all the digits of pi, in base 16 we can even have random access to all the digits.

For generating a random Cauchy sequence things are not that easy. Locally I can just use a true random number generator - in case such a thing exists - and generate all the digits I am interested in on demand, storing them in case I have to look at a previously generated digit again. But I can not easily pass that number around, only the digits I have already generated. We would need a shared database where everyone can share all the generated digits.

Or I could try to replace the true random number generator with a pseudo random number generator, then I could just pass the seed around. Everyone would have to either agree how the sequence of random numbers is mapped to the sequence of digits or we would have to use a seekable random number generator. But this raises the question whether using a [specific] pseudo random number generator would still yield a normal number.


I think the answer is yes and no.

Let's say we knew we had a normal number, say we could prove pi is normal. Then the set of sequences defined by the decimal expansion of pi starting at the nth place for all n in the natural numbers would be uniformly random over all decimal expansions. I'm going this route because it in theory allows us to define each sequence by virtue of a single number rather than an infinite sequence of randint(0, 9). This I think would be our best case if it were possible. But you would have to select a number uniformly in the range [0, inf) which... you can't.


We actual know some normal numbers [1] but they seem to all be synthetic. I would guess they have a lot of local correlation because of the way they are constructed and the comment I responded to explicitly mentioned no self-correlations - not sure what exactly that means - but it probably indicates that being a normal number is not a strong enough condition for using its digits as a sequence of random digits.

[1] https://en.wikipedia.org/wiki/Normal_number#Properties_and_e...


> Pi is normal

ok... click

> Pi may be a normal number

huh...


Yeah, it looks really normal. If "it looks about right" is convincing then I guess there a bunch of professions you need to avoid, and mathematics is one of them.

We do not have, and perhaps will never get, proof that Pi is normal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: