> don't really think anyone expects a c abi to have multiple implementation defi...

jacquesm · on July 19, 2023

We're very much in agreement.

The whole 'null pointer style strings' makes no sense, I think they want to say 'nul terminated'. But fine.

Your examples are excellent, let me add a few more:

Big endian? Little endian? Do we count characters or bytes? Who owns the bloody thing? Can they be modified in place? Are they in ROM or RAM? Automatic? Static? Can they be transmitted over a network 'as is' or do they need to be sent via some serialization mechanism? What about storing them on disk? And can they then be retrieved on different architectures?

The problem really is that C more or less requires you to really know what you're doing with your data and that's impossible in a networked world because your toy library ends up integrated into something else and then that something else gets connected to the internet and suddenly all those negative test cases that you never thought of are potential security issues. So any simplistic view of string handling will end up with a broken implementation regardless of how well it worked in its initial target environment.

C's solution is simple: take the simplest possible representation and use that, pass responsibility back to the programmer for dealing with all of the edge cases. The problem is that nobody does and even those that try tend to get it subtly wrong several times across a codebase of any magnitude.

It's a nasty little problem and it will result in security issues for decades to come. There are plenty of managed languages, I had some hope (as a seasoned C programmer) that instead of this Cambrian explosion of programming languages that we'd have some kind of convergence so that it becomes easier, not harder to pick a winner and establish some best practices. But it seems as though cooperation is rare, much more common is the mode where a defect in one language or eco system results in a completely new language that solves that one problem in some way (sometimes quite convoluted) at the expense of introducing a whole raft of new problems. Besides the fractioning of mindshare.

GoblinSlayer · on July 19, 2023

It's not a hypothesis, the thing was already implemented many times in C, C++ and other languages and used for ages especially for networked code, because C "there's no length" approach is a guaranteed vulnerability.

lelanthran · on July 19, 2023

It's not a guaranteed vulnerability, it's a potential vulnerability.

Guaranteed doesn't mean "this will probably happen", it means "this will definitely happen".

The "no length approach" can probably result in a vulnerability. It won't definitely result in a vulnerability.

I mean, come one, if it was a guaranteed vulnerability, almost nothing on the internet would work because they all have, somewhere down the line, a dependency on a nul-terminated string.

I mean, do you think that nginx (https://github.com/nginx/nginx/blob/master/src/core/ngx_stri...) is getting exploited millions of times per hour because they have a few uses for nul-terminated strings?

GoblinSlayer · on July 19, 2023

nginx whacks one mole at a time https://cve.circl.lu/cve/CVE-2013-2028

jacquesm · on July 19, 2023

That CVE has absolutely nothing to do with length up front vs nul terminated strings. It's also two years old. The only thing it does is reference nginx but that's disingenuous, unless the point you're trying to make is that nginx has the occasional security issue, which I think we're all very much aware of. But it doesn't answer the GPs point in any relevant way.

GoblinSlayer · on July 19, 2023

The problem there is in opportunistic bound checking due to loose association of an array with length, string being an example of an array. This vulnerability is a direct consequence of C "there's no length" approach and shows why this approach in unsuitable for networked code.

jacquesm · on July 19, 2023

In C a string is not an example of an array. If we can't agree on terminology for a discussion that requires extreme precision it becomes difficult to keep going.

Networked code does not as a rule use C style nul terminated strings though, in the case of fixed length buffers they will usually be accompanied either by a length field or by zeroing out the end of the string or even the whole buffer (the latter is much better and ensures you don't accidentally leak data from one session to another).

Networked code doesn't have to be written in C to begin with. Regardless of implementation there usually is a protocol spec and you adhere to that spec and if you don't then you'll find out the hard way why it matters.

This particular vulnerability has nothing at all to do with C strings but in fact has everything to do with a broken implementation of length based strings, which could result in the length being negative, which is at least one problem which C style strings do not have... (small comfort there, they have plenty of other problems, but that one they don't.).

This is the fix for that particular CVE:

https://github.com/nginx/nginx/commit/4997de8005630664ab35f2...

Which stems from integer overflow after doing arithmetic on the lengths.

It looks to me as though you just pulled the first nginx CVE that you found and posted it without looking at what the CVE was all about, without realizing that the ancestor comment was referring to the string implementation inside nginx which lives in the referenced file, whereas you are pointing to a CVE related to the parsing of HTTP chunked data requests, which resides in an entirely different file and has nothing to do with string handling to begin with.

GoblinSlayer · on July 19, 2023

And what do you propose? To let only 1.5 good C programmers in the world write code like in 70s?

jacquesm · on July 19, 2023

> And what do you propose?

That you get your terminology right, back up your claims with links that actually make sense and try to understand that the software world is complex and that incremental approaches make more sense than demanding unrealistic / uneconomical changes because they are not going to happen.

> To let only 1.5 good C programmers in the world write code like in 70s?

No, I did not propose that, you just did and clearly that's nonsense aka a strawman even if you didn't bother throwing it down.

C is here. It will be here decades from now. Rewriting everything is not going to happen, at least, not in the short term. C will likely still be here (and new C code will likely still be written) in 2100, and possibly long after that. This isn't ideal and it's not going to help that we can not make a clean break with the past even though we are trying.

The solution will come in many small pieces rather than as one silver bullet to cure it all and TFA announces two such small pieces and as such is a small step in a very, very long game. The adoption of Rust and other safer (not inherently safe but safer, there are still plenty of footguns left) may well in the longer run give us a chance to do away with the last of the heritage from the C era. But there is a fair chance that it won't happen and that Rust's rate of adoption will be too low to solve this problem timely.

The same goes for every other managed language, they are partial solutions at best. This isn't good news and it isn't optimal, but it is the reality as far as I can determine. If you're going to do a new greenfield development I hope that you will find yourself on a platform where you won't have to use C and that you have skills and resources at your disposal that will allow you to side-step those problems entirely. But that won't do anything for the untold LOC already out there in production and that utterly dwarfs any concern I have about future development, it's the mess we made in the past that we have to deal with and we have to try hard to avoid making new messes.

Think of it as fixing a large toxic waste spill.

GoblinSlayer · on July 19, 2023

It's not a hypothesis, the change happened several times and is used in networking code: in putty and s2n in C and in grpc in C++ and I guess in all C++ code that uses string_view and span, it's easier to happen in C++ due to more language features.

>Rewriting everything is not going to happen, at least, not in the short term.

If you can't do a big task in one go, split it into smaller tasks and do them in sequence.

jacquesm · on July 19, 2023

I'm sorry, I apparently lack the vocabulary or clarity of expression to get my points across to you so I'm bowing out here.

jacquesm · on July 19, 2023

Which C compilers are those then?

Also, you keep writing 'null pointer' and 'null', there is a pretty big difference between 'null' and 'nul' and in the context of talking about language implementation details such little things matter a lot. You say a lot of stuff with great authority that simply doesn't match my experience (as a C programmer of many decades) and while I'm all open to being convinced otherwise you will have to show some references and examples.

GoblinSlayer · on July 19, 2023

What doesn't match your experience?

jacquesm · on July 19, 2023

My experience as a programmer of some 40 years in C has yet to expose me to a C compiler that has length based rather than nul terminated strings as the base string type. Please point me to one in somewhat widespread use rather than an experimental implementation that uses this concept and make sure not to confuse libraries with the implementation of the language.

GoblinSlayer · on July 19, 2023

Since no C/C++ compiler supports it, for them implementation is in a library.

jacquesm · on July 19, 2023

So that means they are not part of C/C++. Which was the point. You can write software in C/C++ but that's hardly news and you can use that to create new data types that are not in the language, which also is hardly news.

GoblinSlayer · on July 19, 2023

People suggesting it are concerned about security, they don't intend it to be a novel invention. Bound checking predates C.

jacquesm · on July 19, 2023

Yes it does. But that doesn't mean that you get to state a lot of stuff with certainty that upon inspection turns out to simply not be true. C programmers are - in spite of what you appear to think - also concerned about security. And whether bounds checking predates C or not has nothing to do with how this is implemented, in a library or in the compiler itself (or even in the hardware).

If you reference C you are talking about the compiler, that, and only that is the language implementation. In C that specification is so tiny that a lot of the functionality that you might expect to be present in the language is actually library stuff. K&R does a poor job for novices to split out what is the language proper and what is the library, but a good hint is that anything that requires an include file isn't part of the language itself.

The original comment to which you responded talked about the ABI, the layer between the applications and the operating system, presumably the UNIX/POSIX ABI, which is more or less cast in concrete by now and unlikely to be replaced because if you do so you introduce a breaking change: all compiled applications using that ABI will no longer work. Some versions of UNIX will occasionally do this and this is widely regarded as a great way to limit your adoption. So the problem, in a nutshell is: how do we repair the security situation that has emerged as the result of many years of bad practices in such a way that our systems continue to work without having to re-invest the untold trillions of $ that have been spent on software that we use every day. This is a hard problem. TFA is a small, and incremental step in trying to solve that problem.

Others are more pessimistic, believe that we should just take our lumps and get on with that rewrite, usually in whatever is their favorite managed (or unmanaged, in some cases) language. Yet others pursue compiler based or hardware based solutions which all introduce different degrees of incompatibility.

I'm somewhat bearish on seeing this problem resolved in my lifetime. At the same time I applaud every little step in the right direction. And I personally do not believe that replacing C's 'string type' (which it really doesn't have other than nul terminated string literals) is the way to go due to the reasons outlined above. But an incremental approach allows for fixing some known issues and allows us to back away from historical mistakes in a way that we can afford the cost and to do so without incurring the penalty of a complete rewrite (which usually comes with a whole raft of new bugs as well). So small improvements that do not address each and every grievance should be welcomed. Even if they no doubt introduce new problems at least the scope is such that you can - hopefully - deal with those without introducing new security issues.

GoblinSlayer · on July 20, 2023

Putty and s2n are examples how this problem is solved, they work on POSIX, e.g. linux, just compile them with gcc and they work.

GoblinSlayer · on July 19, 2023

>32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

32 bit should be enough for everyone, it's easier to type as int, and you have less problems with variable sized integers on different targets. Signed length makes sense because length is a number, and numbers are signed, also in conjunction with array -1 sentinel value is often used.

>If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value, actually in languages with nullable strings null string and empty string are routinely synonymous and you often use a method like IsNullOrEmpty to check for absence of value. Anyway you need the concept of absence for other types too, like int, so string isn't special here.

>You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

pointer+length struct is a value type, see https://en.cppreference.com/w/cpp/container/span

lelanthran · on July 19, 2023

> C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value,

Incorrect. I'm literally, today, working on a project where the business logic is different depending on whether an empty string is stored in the database, or no string.

"User didn't get to fill in a preference" is very different from "user didn't indicate a preference".

In more practical terms, a missing value could mean that we use the default while an empty value could mean that we don't use it at all.

GoblinSlayer · on July 19, 2023

For user empty text field means absence of value. Indeed, rarely a situation arises for optional values, but it's not only for strings, other types like int may need it too.

jacquesm · on July 19, 2023

The end user representation of a programming construct versus the implementation details surrounding such constructs give rise to what is called a 'leaky abstraction', in this case that 'absence of value' is something entirely different than 'empty string'.

We have a way of representing absence of value for some data types but not for others, again because of implementation details. This sort of leaky abstraction often gives options for creativity but it can also lead to trouble and bugs. Some languages offer such 'optional' behavior to more datatypes and make it a part of function calling conventions, either by supplying a default or by leaving the optional parameters set to the equivalent of 'empty' or even 'undefined' if that is possible.