> Regulators have never dictated where auditable logs must live.
That’s true. They specify that logs cannot be lost, available for x years, not modifiable, accessible only in y ways, cannot cross various boundaries/borders (depending on gov in question). Or bad things will happen to you (your company).
In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”. You need to know that the record has been made durable (across whatever your durability mechanism is, for example a DB with HA + DR), before progressing to the next step. Depending on the stringency, RPO needs to be zero for audit, which is why I say it is a special case.
I don’t know anything about linux audit, I doubt it has any relevance to regulatory compliance.
> In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”
As long as the record can be located when it is sought, it does not matter how many copies there are. The regulator will not ask so long as your system is a reasonable one.
Consider that technologies like RAID did not exist once upon a time, and backup copies were latent and expensive. Yet we still considered the storage (which was often just a hard copy on paper) to be sufficient to meet the applicable regulations. If a fire then happened and burned the place down, and all the records were lost, the business would not be sanctioned so long as they took reasonable precautions.
Here, I’m not suggesting that
“the record is on a single disk, that ought to be enough.” I am assuming that in the ordinary course of business, there is a working path to getting additional redundant copies made, but those additional copies are temporarily delayed due to overload. No reasonable regulator is going to tell you this is unacceptable.
> Depending on the stringency, RPO needs to be zero for audit
And it is! The record is either in local storage or in central storage.
> And it is! The record is either in local storage or in central storage.
But it isn’t! Because there are many hardware failure modes that mean that you aren’t getting your log back.
For the same reason that you need acks=all in Kafka for zero data loss, or synchronous_commit = remote_flush in PostgreSQL, you need to commit your audit log to more than the local disk!
If your hardware and software can’t guarantee that writes are committed when they say they are, all bets are off. I am assuming a scenario in which your hardware and/or cloud provider doesn’t lie to you.
In the world you describe, you don’t have any durability when the network is impaired. As a purchaser I would not accept such an outcome.
> In the world you describe, you don’t have any durability when the network is impaired.
Yes, the real world. If you want durability, a single physical machine is never enough.
This is standard distributed computing, and we’ve had all (most) of the literature and understanding of this since the 70’s. It’s complicated, and painful to get right, which is why people normally default to a DB (or cloud managed service).
The reason this matters for this logging scenario is that I normally don’t care if I lose a bit of logging in a catastrophic failure case. It’s not ideal, but I’m trading RPO for performance. However, when regs say “thou shalt not lose thy data”, I move the other way. Which is why the streams are separate. It does impose an architectural design constraint because audit can’t be treated as a subset of logs.
> If you want durability, a single physical machine is never enough.
It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.
Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.
> when regs say “thou shalt not lose thy data”, I move the other way. Which is why the streams are separate. It does impose an architectural design constraint because audit can’t be treated as a subset of logs.
There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies. Regardless of how you manage them, it doesn't change their fundamental nature. Don't confuse the nature of logs with the level of durability you want to achieve with them. They're orthogonal matters.
> It absolutely can be. Perhaps you are unfamiliar with modern cloud block storage, or RAID backed by NVRAM? Both have durability far above and beyond a single physical disk. On AWS, for example, ec2 Block Express offers 99.999% durability. Alternatively, you can, of course, build your own RAID 1 volumes atop ordinary gp3 volumes if you like to design for similar loss probabilities.
Certainly you can solve for zero data loss (RPO=0) at the infrastructure level. It involves synchronously replicating that data to a separate physical location. If your threat model includes “fire in the dc”, reliable storage isn’t enough. To survive a site catastrophe with no data loss you must maintain a second, live copy (synchronous replication before ack) in another fault domain.
In practice, to my experience, this is done at the application level rather than trying to do so with infrastructure.
> There's no conflict between treating audit logs as logs -- which they are -- with having separate delivery streams and treatment for different retention and durability policies
It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.
> Again, auditors do not care -- a fact you admitted yourself! They care about whether you took reasonable steps to ensure correctness and availability when needed. That is all.
I care about matching the solution to the regulation; which varies considerably by country and use-case. However there are multiple cases I have been involved with where the stipulation was “you must prove you cannot lose this data, even in the case of a site-wide catastrophe”. That’s what RPO zero means. It’s DR, i.e., after a disaster. For nearly everything 15 minutes is good, if not great. Not always.
> It matters to me, because I don’t want to be dependent on a sync ack between two fault domains for 99.999% of my logs. I only care about this when the regulator says I must.
If you want synchronous replication across fault domains for a specific subset of logs, that’s your choice. My point is that treating them this way doesn’t make them not logs. They’re still logs.
I feel like we’re largely in violent agreement, other than whether you actually need to do this. I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details. As far as I know, no one - no one - has ever been held to account by a regulator for losing covered data due to a catastrophe outside their control, as long as they took reasonable measures to maintain compliance. RPO=0, in my experience, has never been a requirement with strict liability regardless of disaster scenario.
> I suspect you’re overengineering to meet an overly stringent interpretation of a requirement. Which regimes, specifically, dictated that you must have synchronous replication across fault domains, and for which set of data? As an attorney as well as a reliability engineer, I would love to see the details.
I can’t go into details about current cases with my current employer, unfortunately. Ultimately, the requirements go through legal and are subject to back and forth with representatives of the government(s) in question. As I said, the problem isn’t passing an audit, it’s getting the initial approval to implement the solution by demonstrating how the requirement will be satisfied. Also, cloud companies are in the same boat, and aren’t certified for use as a result.
This is the extreme end of when you need to be able to say “x definitely happened” or “y definitely didn’t happen” It’s still a “log” from the applications perspective, but really more of a transactional record that has legal weight. And because you can’t lose it, you can’t send it out the “logging” pipe (which for performance is going to sit in a memory buffer for a bit, a local disk buffer for longer, and then get replicated somewhere central), you send it out a transactional pipe and wait for the ack.
Having a gov tell us “this audit log must survive a dc fire” is a bit unusual, but dealing with the general requirement “we need this data to survive a dc fire”, is just another Tuesday. An audit log is nothing special if you are thinking of it as “data”.
You’re a reliability engineer, have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?
> have you never been asked to ensure data cannot be lost in the event of a catastrophe? Do you agree that this requires synchronous external replication?
I have been asked this, yes. But when I tell them what the cost would be to implement synchronous replication in terms of resources, performance, and availability, they usually change their minds and decide not to go that route.
That’s true. They specify that logs cannot be lost, available for x years, not modifiable, accessible only in y ways, cannot cross various boundaries/borders (depending on gov in question). Or bad things will happen to you (your company).
In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”. You need to know that the record has been made durable (across whatever your durability mechanism is, for example a DB with HA + DR), before progressing to the next step. Depending on the stringency, RPO needs to be zero for audit, which is why I say it is a special case.
I don’t know anything about linux audit, I doubt it has any relevance to regulatory compliance.