> If your benchmark can include stats for uploads that normally take more than 15 min, that'd be much more eye-catching at least to me.
I don’t have benchmarks for anything like that runtime. 100 files of 100K is the biggest. For raptor the biggest issue is picking the ‘—rate-mbps’ which controls how fast packets are sent. If defaults to a pretty low value so if you have a Gbps connection it will need to be specified to go fast.
> Do you have benchmarks for the total round trip time as well, rom sending to full reconstruction?
One use case I can talk a bit about involved transforming a large (100s of millions of records) hash database to individual S3 objects. Each was around 2kb. It took eons, even with Lambda executions sharing the work at scale.
The full end-to-end time can be observed using ‘—confirm FILE’ which will make raptor wait until the multipart upload is completed (and confirmation notification received). The baseline number is around .5s but it varies. Larger blocks with repair packets in use take longest.
> People can tar + ztsd infinitely many files and upload that archive normally and avoid the small-file problem altogether.
Sure, is that works for the use case. But sometimes small files i S3 are the desired outcome. For example, to serve as web pages or a static JSON API.
> Your pipeline involves quite a lot of moving parts.
It seems simple to me. It’s just a couple queues and a couple Lambda functions (one very simple). The FIFO queue is important and does double duty - ensuring only one completion event gets through, and keeping them in order. I didn’t add much error recovery. Resending files is often the fastest path. There are some ways to get there though. If you use the ‘—no-prefix’ flag only incomplete blocks will be processed on repeated uploads — not a solution, but something that could be built on.
> Pardon my dumb question. The final multi-part upload in S3 still uses the same number of HTTPs connections as if the file were sent from clients, right?
Yes, you have this right. Since we don’t control the S3 API all we can do is show how much better it could be if S3 had an API like this. It could save an insane amount of electricity.
The screenshot on the repo only shows a comparison to the AWS CLI. Here is a comparison for uploading a single 1KB file with the AWS CLI, a minimal AWS API client and raptor:
~ $ time aws s3 cp data/1kb/file_1.bin s3://${BUCKET_NAME}/test-upload/ --quiet
real 0m0.720s
user 0m0.593s
sys 0m0.070s
$ time ./s3upload data/1kb/file_1.bin s3://${BUCKET_NAME}/test-upload/ --quiet
Uploaded data/1kb/file_1.bin
real 0m0.327s
user 0m0.136s
sys 0m0.062s
~ $ time ./raptor data/1kb/file_1.bin $RAPTOR_ENDPOINT --server-key $RAPTOR_SERVER_KEY --client-key $RAPTOR_CLIENT_KEY --silent
real 0m0.303s
user 0m0.013s
sys 0m0.014s
AWS S3 CLI or API, it's the transport and protocol overhead that are burning the extra CPU time (power). At least, that's my conclusion.
The raptor tool has another nice advantages not possible with the AWS API:
Optional Acknowledgement. The `--confirm` option allow skipping waits for ACKs, which takes the run time down to just what it takes to put the outbound packets on the network. That makes uploads very fast, and the reliability is tunable with the `--overhead` option. On networks with significant packet loss this can make urgent uploads much, much faster.
The `--rate-mbps` controls the transmission rate. Sometimes slower is better, and having this option means not needing to setup bandwidth control using `cgroup` or similar.
We didn't want to get too wordy with a post like this. But to be honest, if you are knowledgeable enough to spot that kind of gap, you might be a perfect fit for what we are working on ;)
Having lived in the PNW all my life, and worked closely with our friend Doug (the fir trees), this article brings up old mental images of otherwise healthy needles with browned (dead) tips in the crowns.
> Visually, the corona discharges generated on the leaves were either small purple-blue point discharges or elongated purple-blue discharges, and usually formed on the tips of the leaf closest to the source of the electric field (Figure 1). Sometimes the corona discharges were steady and constant, but other times they would dim and brighten in an unsteady pulse. When the corona was turned off, the tips of the leaf where the discharges occurred were often burned and browned, even for the weakest electric fields applied to the leaves.
Human eyes can be sensitive down to 380nm, the UV range goes up to 400nm. Birds and insects can see this. We can see this, using UV filters such as shown in the article. I get that it's fun to be a pedant sometimes, but come on.
My people. My first paid programming was hand translating a BASIC app to C. I did it on the same paper the original was printed on (green/white continuous feed). When I thought I had it right I went to my mom’s work in the middle of the night to type it in and check it. Over the course of a summer I made it work.
I took what I learned from BYTE and wrote a CP/M terminate/stay-resident 'driver' that got some interesting hardware working well enough to get me the contract, as a teenager, to write the DOS driver for thing as well.
That led to a rocket-ride career through decades of systems programming, and I just can't thank the BYTE folks enough for those mind-expanding days ..
The two characterizations of people in the introduction are timeless?
> A person with a primary interest in software will oftentimes be the person who purchases a kit computer because the kit minimizes the amount of hardware knowledge the person is required to have.
That’s how I came into my first computer - built from a kit in 7th grade.
And, yeah, I understand more about hardware than I did back then, but it’s all about the software to me still… okay, maybe some electronics and mechanics, too.
I’ll have to take a look at this as a way to move off my homegrown serverless email on AWS. Doesn’t look like it has parity with being able to send email from many subsystems safely (with delay and veto)[1], but is pretty close on the receive side automation[2].
We have a moderately complex set of services we deploy with some separation of application code and infrastructure. No application code that runs on VMs is deployed as part of the infrastructure IaC - that’s all loaded once the “empty” infra is in place. The grey area is around non-VM compute like Lambda and Step Functions, which can be a part of the infra templates.
The way these services work requires an initial set of code to create the resources, and while it would be possible to send a “no-op” payload for the infrastructure deployment and then update it with real application code later, that seems pedantic (to us).
Maybe someday that changes, but for now it isn’t at all burdensome and we’ve been very successful with this approach.
Yeah, we've solidified that the day-0 path of deploying an empty image or no-op code deploy when first provisioning a service is the way to go and then letting CI/CD pick up the actual deployments longterm. I can see the "this seems pedantic" POV, but this is what we've found works across a number of cloud native services and accomplishes the end goal of managing infra with IaC and deploying with whatever tool we want for the application layer.
We have a similar array of deployment targets and the method is context dependent. The Kubernetes declarative manifests and reconciliation loop for applications is winning out, for our devs and industry at large. The cloud funcs / lambda are an annoying corner case, we do that with a late step in CI/CD currently, with a move to a dedicated Argo setup just for CD (workflows, not CD because that only does helm well)
I don’t have benchmarks for anything like that runtime. 100 files of 100K is the biggest. For raptor the biggest issue is picking the ‘—rate-mbps’ which controls how fast packets are sent. If defaults to a pretty low value so if you have a Gbps connection it will need to be specified to go fast.
> Do you have benchmarks for the total round trip time as well, rom sending to full reconstruction?
One use case I can talk a bit about involved transforming a large (100s of millions of records) hash database to individual S3 objects. Each was around 2kb. It took eons, even with Lambda executions sharing the work at scale.
The full end-to-end time can be observed using ‘—confirm FILE’ which will make raptor wait until the multipart upload is completed (and confirmation notification received). The baseline number is around .5s but it varies. Larger blocks with repair packets in use take longest.
> People can tar + ztsd infinitely many files and upload that archive normally and avoid the small-file problem altogether.
Sure, is that works for the use case. But sometimes small files i S3 are the desired outcome. For example, to serve as web pages or a static JSON API.
> Your pipeline involves quite a lot of moving parts.
It seems simple to me. It’s just a couple queues and a couple Lambda functions (one very simple). The FIFO queue is important and does double duty - ensuring only one completion event gets through, and keeping them in order. I didn’t add much error recovery. Resending files is often the fastest path. There are some ways to get there though. If you use the ‘—no-prefix’ flag only incomplete blocks will be processed on repeated uploads — not a solution, but something that could be built on.
> Pardon my dumb question. The final multi-part upload in S3 still uses the same number of HTTPs connections as if the file were sent from clients, right?
Yes, you have this right. Since we don’t control the S3 API all we can do is show how much better it could be if S3 had an API like this. It could save an insane amount of electricity.
reply