As someone who myself worked on a hobby-level Rust based Kafka alternative that used Raft for metadata coordination for ~8 months: nice work!
Wasn't immediately clear to me if the data-plane level replication also happens through Raft or something home-rolled? Getting consistency and reliability right with something home-rolled is challenging.
Notes:
- Would love to see it in an S3-backed mode, either entirely diskless like WarpStream or as tiered storage.
- Love the simplified API. If possible, adding a Kafka compatible API interface is probably worth it to connect to the broader ecosystem.
Hi, the creator here, I think its a good idea to have S3 backed storage mode, its kinda tricky to do it for the 'active' block which we are currently writing to, but totally doable for historical data.
Also about the kafka API, I tried to implement that earlier, I had a sort of `translation` layer for that earlier, but it gets pretty complicated to maintain that because kafka is offset based, while walrus is message based.
TBH I don't think anyone can utilise S3 for the active segment, I didn't dig into Warpstream too much, but I vaguely recall they only offloaded to S3 once the segment was rolled.
The Developer Voices interview where Kris Jenkins talks to Ryan Worl is one of the best, and goes into a surprising amount of detail: https://www.youtube.com/watch?v=xgzmxe6cj6A
tl;dr they write to s3 once every 250ms to save costs. IIRC, they contend that when you keep things organized by writing to different files for each topic, it's the Linux disk cache being clever that turns the tangle of disk block arrangement into a clean view per file. They wrote their own version of that, so they can cheaply checkpoint heavily interleaved chunks of data while their in-memory cache provides a clean per-topic view. I think maybe they clean up later async, but my memory fails me.
I don't know how BufStream works.
The thing that really stuck with me from that interview is the 10x cost reduction you can get if you're willing and able to tolerate higher latency and increased complexity and use S3. Apparently they implemented that inside Datadog ("Labrador" I think?), and then did it again with WarpStream.
I highly recommend the whole episode (and the whole podcast, really).
s3 charges per 1,000 Update requests, not sure how it's sustainable to do it every 250ms tbh, especially in multi tenant mode where you can have thousands of 'active' blocks being written to
> In practice, many teams will use a pull request-based workflow because the tests could be too complicated to run locally or need to facilitate code reviews. For trunk-based development to fulfill its promise, it’s imperative that pull requests are kept small and that the team is committed to reviewing them quickly.
Then it is not trunk-based, it is still PR based. I would also wonder how bigger features would get developed, without littering it across multiple pull requests and feature flags to make sure the code still compiles.
Would you consider adopting the STARS framework[1] ? It was designed to solve the questions you mentioned like “Are our key customers adopting the new feature we launched?” but at the same time it also incorporates qualitative feedback.
We kind of ended up building something that overlaps with our feature report - but we just stop at the STAR, we haven’t really looked into adding qualitative information like satisfaction to our product, but it could make sense!
Lots of great additions. I will just highlight two:
Column selection:
When you have tons of columns these become useful. Clickhouse takes it to the next level and supports APPLY and COLUMN in addition to EXCEPT, REPLACE which DuckDB supports:
- APPLY: apply a function to a set of columns
- COLUMN: select columns by matching a regular expression (!)
I can't count how many times I've run into a problem with a trailing comma. There's a whole convention developed to overcome this: the prefix comma convention where you'd write:
SELECT
first_column
,second_column
,third_column
which lets you easily comment out a line without worrying about trailing comma errors. That's no longer necessary in DuckDB. Allowing for trailing commas should get included in the SQL spec.
Allow referencing columns defined previously in the same query would make duckdb competitive for data analytics. Without that one has to chain With statements for just the tiniest operations.
(nothing to do with DuckDB but..) SQL is complex enough, and allowing this (and acyclically as mentioned below) would do my $%^& nut implementing it.
But I know a user requirement when I hear one, so can you give me an large, real example of where allowing this would make things easier? That would be mega helpful, ta
No, it is not. I mean it is, but not in parts where that could be seen as useful and/or convenient. [A]cyclic graph traversal/etc is one of the basic tests in a modern interview at any CRUD studio. How come it could do $%^& to any part of yours?
Because just implementing the standard stuff nearly did my $^&% nut. Also I know about graphs & posets, and it's potentially a little more complex than it seems. The variables
select x * x as y, 1 as x
is meh, but what about
select
(select tbl.z from tbl where tbl.y = y) as subq,
x * yy as y,
xx + 1 as x,
subq + yy as zzz
from (
select xx, yy
from ... )
Note you can already reference select list items in GROUP, HAVING, and ORDER BY so it's not that big of an extension.
I've implemented the ability to reference select-list aliases before; it's not that hard to do if implemented basically like a macro expansion. The main problem is user confusion due to ambiguous references, e.g.
select 2 as x, x as `which x?`
from (select 1 as x) t;
we ended up adding a warning for the case where a select list alias shadowed a table column, suggesting using a fully-qualified table name if they actually wanted the table column (t.x in the above example).
IMO only allowing references to previous select list items is a perfectly reasonable restriction; loosening it isn't worth the implementation headache or user confusion. Though we did allow using aliases in the WHERE clause.
> Note you can already reference select list items in GROUP, HAVING, and ORDER BY so it's not that big of an extension.
You're just looking for symbols in the symbol table, I think it's a big difference!
> IMO only allowing references to previous select list items is a perfectly reasonable...
agreed, see my other post where I say the same.
> Though we did allow using aliases in the WHERE clause
And the SQL standards people didn't go for this, and I'm sure they were very far from stupid. And nobody's asking why they didn't allow this, which really bothers me.
Oh, was your objection specifically to allowing references to following (not just preceding) select list items? Then we're in violent agreement. That would be complicated to implement and confuse users. Definitely not worth it.
> suggesting using a fully-qualified table name if they actually wanted the table column (t.x in the above example).
I just realised why this was bothering me. That means 't' and 't.x' are actually different variables. In standard SQL it's always the case (right?) that an unqualified variable ('t') is just an convenient shorthand for the fully qualified variable ('t.x', or more fully I suppose, '<db>.<schema>.t.x), and you just broke that.
That’s no different than the first snippet, if you aren’t parsing it with regexps, of course. The resulting AST identifiers would simply refer to not only column names, but also to other defined expressions. This is the case for both snippets. It’s either cyclic or not, and when not, it is easy to substitute/cse/etc as usual. The complexity of these expressions is irrelevant.
@wruza, @wenc: These are both very good answers, and you are of course both right. Check the symbol table, anything you can't find should be defined in the same context (in the select list, as a new expr).
In which case, match each symbol use (eg. x in x * x as y) to the definition (eg. 1 as x) to establish a set of dependencies, then do a partial order sort, then spit out the results.
I can do that I just don't fancy it, and more to the point nobody is giving me an example of where it would be particularly helpful. So if anyone can, I'm interested.
(also, consider human factors; although an acyclic definition could be extracted from an unordered expression set, a consistent left to right (in the western world anyway, matching textual layout) with dependencies being introduced on the right and depending only on what came before on the left might actually be better for us meatsacks)
My examples are from boring enterprise, not from what we love to create at home. I’ve read and patched literally meters long queries in analytics, which could be reduced dramatically by being self-referential and by other approaches discussed itt. Of course these could be refactored into something “create view/temp/cte”, but that requires a full control of ddl, special access rights and code ownership. Most space was used by similar case-when-then constructs and permutations of values these produced. The original code was on official support, so we couldn’t just rewrite it, because migrating to the next update would cost a week instead of an hour.
I could reach to and post a lenghty example, but it’s nothing but boring reshuffles really, spiced with 3-level joins of “modelling db in db to allow user columns”.
I agree on the LTR idea, because reading a symbol not yet defined may lead to confusion.
It’s not trivial but as someone who has implemented something similar (for an equation based modeling language) it’s not super complicated if you use the right abstractions. It’s basically traversing the AST and doing substitutions.
The thing that makes SQL simple for me is that I can think in set operations devoid of proceduralness. Once we make things more and more sequential the more it is like programming than a formula.
I've seen quite a few production queries that use indexes in GROUP BY and ORDER BY; it's quite common. Probably partially because linters/code review/etc are lightweight to nonexistent amongst the analysts/data science types that I tend to work with.
Many dialects already support using aliases in GROUP BY and HAVING too, btw.
IMO it's most useful (though somewhat more difficult to implement) to be able to use the aliases with window functions or large case/when statements, something like
SELECT
page,
SUM(clicks) AS total_clicks,
100. * total_clicks / (SUM(total_clicks) OVER ()) AS click_pct,
100. * SUM(total_clicks) OVER (ORDER BY total_clicks DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / (SUM(total_clicks) OVER ()) AS cumulative_click_pct
FROM weblog
GROUP BY page;
JSON is the other one where it annoys me, but luckily I rarely hand-write any JSON anymore (and there are semi-solutions for this like json5).
In code I always add trailing commas to anything comma-separated. It makes editing simpler (you can shuffle lines without thinking about commas). In a diff or blame it doesn't show adding a comma as a change.
SQL is the one spot where this doesn't work, and it's a constant foot-gun as I often don't remember until I run and get a syntax error.
JSONC allows comments and trailing commas, but adoption seems to be low.
VSCode uses it for configuration, but when I wanted to use it in Python (to add context to source-controlled Elasticsearch schemas) there were only a couple old barely-maintained libraries for parsing.
You can do the same thing with your WHERE clause and ANDs by always starting them WHERE 1=1 as well.
>> Allowing for trailing commas should get included in the SQL spec.
So there is no "SQL spec" per se, there's an ANSI specification with decades of convention and provider-specific customizations piled on top. This support for trailing commas is the best you're going to get.
Thank you for the feedback! I will check those Clickhouse features out. I totally agree on the trailing commas, and I use commas first syntax for that same reason! But maybe not anymore... :-)
> Allowing for trailing commas should get included in the SQL spec
Not just SQL, trailing commas are stupidly useful and convenient, so as far as I'm concerned every language should have them. To be fair, a decent amount of them have implemented them (I was pleasantly surprised by GCC C), but there are still notable holdouts (JSON!).
Are leading commas allowed? Because otherwise, you've just traded out the inability to comment out the last element for the inability to comment out the first. I never understood this convention.
I agree that it's ugly and don't use it myself, but I find that I modify the last item in a list far more frequently than the first. Probably because the grouping columns tend to go first by convention, and these change less.
What would a date type look like? An ISO-8601 string is probably as good as it'll get, else you end up with an object containing loads of additional information like timezones, offsets, etc. There's also https://json-schema.org/understanding-json-schema/reference/... for a formalized date format in JSON.
The looseness seems to be aimed at human authors who want to write JSON as if it were JS, so mainly package.json. But, json is not ideal for configuration for multiple reasons.
If you wanted JSON to parse into actual date objects, you could always do:
{
thisIsADate: new Date('1995-12-17T03:24:00Z')
}
It would be simple enough for parsers in other languages to parse this into their own date objects. Much simpler than using regexs to find ISO strings.
I think allowing a whitelisted set of constructors (so it's extensible, but with some standard ones) as types in a future JSON-like could be good. e.g.
const serialised = NuJSON.stringify(data)
const result = NuJSON.parse(serialised, allowedConstructors, fallbackConstructor)
And if allowedConstructors wasn't present, it'd default to including all the standard JS ones (Map,Set,Date,Symbol,ArrayBuffer etc). The fallback constructor would be used if there were data types in the data that didn't have matching allowed constructors. By default it would through an error.
Of course, my ideal NuJSON would also include bigints, multiline strings, comments, bare keys (i.e. without having to wrap them in quotes).
Stretch nuJSON would also include the ability to serialise circular graph structures, perhaps through the use of a Reference([up, up, "aSymbol"]) data type.
I know this is how Javascript is done nowadays but using constructors for maps and arrays just seems like unnecessary syntax to me. And if you're going to initialize an array buffer too, you might as well just not use JSON at all and send plain Javascript code over the wire.
I think all JSON needs in this regard is type hinting. Let everything else be handled by the application:
{
birthdayParty: Date 1632817436514,
sounds: String? { // zomgwtfbbq comments too
cow:"moo",
fox: null
},
possibilities: ["she loves me", "she loves me not", 42],
aSymbol: Symbol "tag",
aBuffer: Uint8 [104, 101, 108, 108, 111]
}
I think the main difference between my suggestion and yours is that mine has more brackets. I think type hinting and 'constructors' are basically the same thing, but I like the types to be extensible, so even in your version, I'd make the type hints things that could be provided into the NuJSON.parse function.
Obviously, in your version my approach would be
sounds: Map {
cow: "moo",
fox: null
},
possibilities: Set ["she loves me", "she loves me not", 42]
I suppose that the idea could be that type hints are entirely ignoreable, and if you strip them the file just becomes normal JSON?
I suppose my point was that you don't have to specify types for maps and sets at all (which I didn't do in my example,) the existing syntax already does that. [] is already a set, {} is already a map. Type hints would just be used for the values.
Sending type definitions seems like a good idea. Although at that point it might as well not even be called any kind of JSON.
I don't agree that you don't have to specify types for maps and sets. For example
{true: "yes", false: "no"}
Is not valid JSON, but
Map [[true, "yes"], [false, "no"]]
should be valid 'NuJSON'. Maps and Sets are semantically different to Objects and Arrays, and they are Javascript standard objects that should ideally be supported by the default serialisation. The other alternative of course is to force the deserialisation code to fix them up afterwards, but that is painful.
I want to be able to stringify records with Sets, Maps, Arrays and Objects in, and have that work, even if they have behavior not supported by objects and arrays, and I don't think the user should have to fix up the output after parsing of JSON containing standard javascript objects.
Useful when you need to input already double-quoted value.
But it's a matter of taste still and nothing that we couldn't do before. Such changes will lead to meetings where programmers discuss JSON5 quoting choices (yes, it will be a thing).
It brings it closer to JavaScript behavior I guess.
I guess it's nice to have ability to copy random js object declaration to actual json without editing
You wouldn't say that mammals write C all the time, would you? Front end development would be so much better if you could just spew json at people in a table.
>You wouldn't say that mammals write C all the time, would you?
Yes. Mammals do write C all the time, because humans are mammals. And C is meant to be human readable and writable.
> Front end development would be so much better if you could just spew json at people in a table.
That's literally what a JSON API is, and there are tons of them. And for each of them, a human had to be able to read and comprehend JSON responses and write JSON requests in code. There's a reason JSON is pretty-printed in the debug consoles of browsers.
To say nothing of all of the package definitions or framework configs that are written (by hand) in JSON.
The fact that most JSON is machine generated at present has nothing to do with its original design intent. Most Javascript, CSS and HTML are machine generated nowadays as well, but they were meant to be written by a person using a text editor.
I've worked on 4+ years of technology that was entirely based on humans hand-tweaking, editing, and updating JSON configs. My text editors validate it quickly, and know how to fold, sort, prettify. Based on my experience (and editing it all day long) It's a very human readable and editable configuration.
I was delighted to see Opbeat on there (under Application Performance Monitoring).
Opbeat was acquired by Elastic, Inc and turned into their APM product. You can try the Elastic APM product for free as well: https://www.elastic.co/apm
(I was an Opbeat co-founder and part of the team that built Elastic APM)
The other problem here is that politicians are either oblivious or simple don't care about the cost associated with complexity in the laws. On top of that, the other unfortunate dynamic that exists is that politicians all work to get their little special case into new bills so they can point to that when their constituents ask how they influenced the new law.
Taken together, the complexity of the law just accumulates. Tax law in Denmark being a particularly gross example.
I don't doubt that writing a new laws is complicated, but if I write code, and realise that I constantly need to tack on more code to handle special cases, then I will normally revisits my design. It is not my impression that politicians do the same with laws.
One major issue, at least with Danish law makers, is that they want to target special groups, but that would be discriminating, so instead they attempt to target the behaviour of those group. This of cause will affect a number of people who where not in the original target group, so they add on exceptions and and details to narrow down the law. Also there never seems to be any clean up in the laws.
If your laws/rule exploded from a few hundred pages to 30.000 pages, you should really revisit that thoughts behind that law.
It's crazy out here in DK. The most embarrassing example of such crazy targeted laws is the "hand shake with the mayor" requirement for new Danish citizens, clearly target at a certain ethnicity that would hesitate to do just that. The intent is to make it difficult to that ethnicity, the result is just a veiled law that makes them "appear" just enough secular and non-discriminatory.
There's a lot I love about life in Denmark. This is just to say that no place is perfect and DK also has stuff that all Danes should be rightly embarrassed about.
This was exactly my question while discussing with a few Danish friends who brushed it off saying, “yeah it’s wrong, but it’s not that bad.... while in Denmark they should be willing to live like a Dane”. My argument was whether they would be willing to wear a burka or give up driving as a woman without complaining in Saudi Arabia a not that they are good anywhere. The point is what is wrong should just be called out as wrong and not politically sugarcoated.
Didn't I cover that in my comment? I feel that that is bad. But I also feel that the whataboutism argument is overused when it comes to Israel. We should minimize our incidental support of repressive regimes wherever they are found. Especially when we are explicitly called on to do so (see BDS) by those oppressed by such repressive regimes.
Wasn't immediately clear to me if the data-plane level replication also happens through Raft or something home-rolled? Getting consistency and reliability right with something home-rolled is challenging.
Notes:
- Would love to see it in an S3-backed mode, either entirely diskless like WarpStream or as tiered storage.
- Love the simplified API. If possible, adding a Kafka compatible API interface is probably worth it to connect to the broader ecosystem.
Best of luck!