My experience with exporting Postgres RDS partitioned tables to S3

metadata · on May 15, 2022

Scripting data movement is easy only for small and easy jobs. With many thousands of tables and more than a few TB, all kinds of issues start popping up. I read somewhere that 85% of large data migration projects fail. Data warehouses really need an optimal Parquet file sizes to work efficiently, and for Snowflake it's roughly 100-200MB per file. The good way to copy that is relying on DB statistics to determine the optimal number of records per chunk. Then, to have the job finish in a reasonable time, one needs to read a certain number of data chunks in parallel and stream that data into Parquet at S3 (or Azure Blobs). Once the data is up, Snowflake can ingest it.

Shameless plug: my company created a commercial solution which does exactly that (https://www.spectralcore.com/omniloader). Happy to answer any questions.

metadat · on May 16, 2022

I knew it, it was you.

singlow · on May 15, 2022

Might consider using the Athena Federated Query. Essentially it is a lambda that knows how to get data from a database instance and write it to s3. It is very well optimized for partitioned tables.

https://github.com/awslabs/aws-athena-query-federation/tree/...

Once that is configured, execute a Create-Table-As (CTAS) or insert query using a select from the configured lambda-backed catalog.

I have only tryed the mysql version, but they docs indicate they are very similar.

gcbirzan · on May 15, 2022

RDS export also has a bug where at some point it will not allow you to make new jobs (let alone list the old ones) because something internally times out when trying to serialise the list of tables from the job. I'm surprised it worked for you, I think we have around 100k tables too and it fails.

higeorge13 · on May 16, 2022

Ah yeah, i forgot to write about this issue. We also never made it to work when providing all the list of partitioned tables, the api call was failing. We were fortunate to just provide the list of 7 tables as input and made it work (somehow).

Marazan · on May 15, 2022

Ive had issues in the other direction.

Importing data using the copy from s3 command has a nasty failure state if the client loses connection to the db AND the import command has a problem.

The query hangs with a row exclusive lock and the query _cannot_ be killed. Only option was to reboot the DB.

higeorge13 · on May 15, 2022

You mean this function: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...? I don't get that Postgres RDS only allows you to import from csv, while the exports are strictly parquet. I know that Postgres implements the COPY command using csv or binary, but i would expect some end-to-end lifecycle implementation by AWS and parquet would at least enforce some schema and reduce some errors including those which caused yours.

iknownothow · on May 15, 2022

As far I can tell, exporting query results from Postgres RDS to S3 uses Postgres's Copy and only supports CSV and Text [1]. Can you provide a link for parquet exports?

[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postg...

vanpythonista · on May 15, 2022

You can use a RDS Snapshot to export to S3 in Parquet format: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...

darkhorse222 · on May 16, 2022

But why? You export in parquet to retain the schema and data but ingesting that data back in requires CSV which loses the schema. How do you restore your table schema if someone deletes it? Why not just use the same ingest from s3 function to write out to s3?

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/postg...

higeorge13 · on May 17, 2022

We don’t ingest back to postgres, but snowflake. And we also have a strict schema definition for our source models in snowflake via dbt.

akdor1154 · on May 15, 2022

It sounds like you don't have AWS support? In case you do, give it a shot.. it takes a while but eventually you can get a knowledgeable answer. Though im in 50% on it being 'yes this is by design, please watch the RDS change log for future improvements' in this case.

gcbirzan · on May 15, 2022

Tell that to the ticket I opened in November that got closed without an answer or even a message a month ago, related exactly to RDS export being crap.

whoknew1122 · on May 16, 2022

There is no mechanism in AWS Support by which tickets are automatically closed without a response. Could someone have gone in and resolved your case without a response? In theory, it's possible. But there's no incentive to do so.

I'd suggest responding to the case, which will reopen it and get it reassigned. A open case that's missed SLA by 6 months will raise eyebrows.

That's assuming what you're saying is correct. I'm not sure what your incentive would be to lie, but I've dealt with more than one support case wherein the customer swears up-and-down that they never changed anything. And whoopsie, Cloudtrail shows they did.

gcbirzan · on May 16, 2022

I don't know how to check, maybe someone was cleaning tickets older than 3-4 months. But, the status is 'resolved' (again, not sure who can set this), but there has been no communication from AWS since. I just checked now, the issue either was fixed by us dumping less data or by AWS.

However, the SLAs that I can see are 'initial response', which for better or worse, sometimes happen. However, the most egregious example of how broken that model is, we had a business critical system down ticket down that got an answer of 'looking into it' in 19 minutes, then ~1:45 an answer of "oh, yeah, I checked and this and this DB you talked about didn't reboot", which was true, the person that made the ticket pointed out the wrong instance. 3 hours after I point the correct DB we get told, yes, you are right, it rebooted. And then on the 1st of April I ask for a timeline of how the writer/reader nodes changed during the reboot, I get an answer on the 20th, even though we pushed in the ticket and through our TAM for an answer.

Generally, I'm not really sure how people are so happy with AWS support, every issue that is even the least bit complicated gets wrong answers (took a month for us to convince people, though meetings with our TAM, that it's not a configuration or capacity issue that's causing some issues) or just sits there for ages. Once we got an engineer on the Aurora team assigned to the case I used as an example, we started making progress, we even got a patched version of PostgreSQL RDS deployed, but once the bug was identified, we had to work around it, since a fixed version was not an option in less than a few months.

All in all, for the hundreds of thousands of dollars we're spending a month on support, I have a feeling that when we need it, we're kind of alone.

whoknew1122 · on May 16, 2022

I'm not exactly sure how your example is particularly egregious.

AWS Support doesn't know your workload. If you, who know your workload, can't figure out what's wrong, how can AWS Support figure it out in 15 minutes?

> "that got an answer of 'looking into it' in 19 minutes, then ~1:45 an answer of 'oh, yeah, I checked and this and this DB you talked about didn't reboot', which was true, the person that made the ticket pointed out the wrong instance."

Which wasn't AWS Support's fault. They looked into the resource on the ticket. Again, AWS Support doesn't know your workload and isn't going to go on a fishing expedition to guess which DB might be the issue.

> "3 hours after I point the correct DB we get told, yes, you are right, it rebooted. And then on the 1st of April I ask for a timeline of how the writer/reader nodes changed during the reboot, I get an answer on the 20th, even though we pushed in the ticket and through our TAM for an answer."

Timelines and RCAs have to go through the service team. Up to, and including, the service GM at times. It can be frustrating to wait for an RCA, but sometimes the things just take time.

> "All in all, for the hundreds of thousands of dollars we're spending a month on support, I have a feeling that when we need it, we're kind of alone."

I'd encourage you to open a chat or phone for a business critical issue. That way Support doesn't spend an hour and a half trying to troubleshoot the wrong resource. Those errors can quickly be found and remediated on a call.

gcbirzan · on May 17, 2022

> Which wasn't AWS Support's fault. They looked into the resource on the ticket. Again, AWS Support doesn't know your workload and isn't going to go on a fishing expedition to guess which DB might be the issue.

No, they won't. I mean, definitely, two hours is not enough time to do anything except say "but it didn't reboot".

> Timelines and RCAs have to go through the service team. Up to, and including, the service GM at times. It can be frustrating to wait for an RCA, but sometimes the things just take time.

If essential information isn't present in the event log, or in any way accessible to the customer, and especially given how trivial the info was (we got back literally four bullet points two timestamps), almost 3 weeks is a disgrace.

> I'd encourage you to open a chat or phone for a business critical issue. That way Support doesn't spend an hour and a half trying to troubleshoot the wrong resource. Those errors can quickly be found and remediated on a call.

The colleague who made the ticket put a phone number to get a callback in the ticket. They didn't call. We were actively trying to fix the issues we had, caused by the weird design by the Aurora failover system, not sit in a queue for an hour (yes, we had that too, business critical system down tickets where we spent >45 minutes waiting for a support rep).

On another note, this is why I hate AWS support. It's literally a game of who can blame the customer or avoid fixing issues better. RDS export doesn't handle large databases? Don't make large exports. The UI breaks when you have large exports? Don't use the UI. Errors on an RDS instance when you spin up a second read replica? You ran out of capacity! And that's not saying anything of being ping ponged between teams for technical guidance tickets or having to fight support staff when trying to explain my problem. More than 1 hour spent in the loop of "but it works for me" "I don't care, it doesn't work for me", or outright false statements (even from engineers, not only support staff) that they have to carefully walk back when we actually try things out.

I've not seen this 'customer obsession', ever. And I'd understand that if we were on developer plan or something, but we're not.

fdr · on May 15, 2022

....that seems like an unusually enormous number of partitions.

tomatowurst · on May 15, 2022

what about using sqlite diff stored in dynamodb that gets synced when lambda is called? thought i saw a repo for that.

i think that has merit too reading this im reminded.

higeorge13 · on May 15, 2022

It would be great if you could share some specifics or share the repo.