Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Strlcpy and Strlcat – Consistent, Safe, String Copy and Concatenation (1999) [pdf] (openbsd.org)
43 points by todsacerdoti on April 25, 2023 | hide | past | favorite | 52 comments


A person smarter than me once told me: “Just use snprintf()”.

The tiny performance hit is usually worth the safety and simplicity. Only one error condition to check, covering both concatenation and copies, sensible 'size' argument behaviour (size includes null terminating byte, null terminating byte is always written)...

Just use snprintf().


I came here to post exactly that; I think using snprintf() is the proper solution to the "I need to build a string out of many parts in C" problem pretty much every time.

Especially since these functions' return values make it hard to use them to step forward in the buffer, so subsequent calls need to walk over the already-inserted characters again and again. That kind of redundant work makes me grit my teeth, which I guess is a sign that I'm a C programmer at heart. :)

Compare:

    char buf[1024];
    snprintf(buf, sizeof buf, "%s%s", "foo", "bar");
and

    char buf[1024];
    strcpy(buf, "foo");
    strlcat(buf, "bar", sizeof buf);
The latter is given a pointer to the base buffer, so it has to walk past the existing characters in order to find the end where the concatenation should start.

I also find the "size" arguments confusing and weird. I think it's idiomatic in C to (tightly) associate a size with a buffer pointer, and speaking in terms of "there are n bytes available at this location" makes sense to me. So the prototype really should have been

    size_t strlcat_unwind(char *buf, size_t buf_size, const char *src);


Even better, if you don't mind using the heap, there's also asprintf() which figures out the right length and allocates a buffer itself, then returns it to you. Downside is you have to free the pointer returned of course. And you might want to be careful passing it user input without taking a look at the length first. But you have to do that anyway if you use the stack.


Here's something I've found a useful upgrade to asprintf, as it frees the passed-in buffer after expanding the format string. You can just pass the same char ** repeatedly (and also pass it as one of the string arguments corresponding to a %s!) and it'll update the pointer appropriately each time. Makes many kinds of string manipulation very simple.

    int xasprintf(char**p,const char *fmt,..) {
        int n=0;
        char *p2=nullptr;
        if(fmt) {
            va_list v;
            va_start(v);
            n=asprintf(&p2,fmt,v);
            va_end(v);
        }
        if(n>=0) {
            free(*p);
            *p=p2;
        }
        return n;
    }
If you thought asprintf was inefficient, this is worse. But it's very convenient!


By the way, I realized that sprintf can be used to concatenate by using the buffer within the formatting itself, while snprintf prevents it. Is it a safety issue too?

  strcpy(buf, "foo");
  sprintf(buf, "%s%s", buf, "bar"); //buf contains "foobar"

  strcpy(buf, "foo");
  snprintf(buf, sizeof buf, "%s%s", buf, "bar"); //buf contains "bar"


keep in mind that you can't do that, here's a quote that explains it well:

Some programs imprudently rely on code such as the following

    sprintf(buf, "%s some further text", buf);
to append text to buf. However, the standards explicitly note that the results are undefined if source and destination buffers overlap when calling sprintf(), snprintf(), vsprintf(), and vsnprintf(). Depend‐ ing on the version of gcc(1) used, and the compiler options employed, calls such as the above will not produce the expected results.

The glibc implementation of the functions snprintf() and vsnprintf() conforms to the C99 standard, that is, behaves as described above, since glibc 2.1. Until glibc 2.0.6, they would return -1 when the output was truncated.

- printf(3), Linux man-pages 6.04, 2023-04-01


Except snprintf() does not terminate the string if there's no room for the final \0. So technically this should be snprintf(buf, sizeof buf - 1, "%s%s", "foo", "bar"); buf[sizeof buf - 1] = '\0';


I don't think that is correct, my understanding is that it always terminates when possible. FreeBSD's man page [1] has a pretty clear description:

The snprintf() and vsnprintf() functions will write at most size-1 of the characters printed into the output string (the size'th character then gets the terminating `\0'); if the return value is greater than or equal to the size argument, the string was too short and some of the printed characters were discarded. The output is always null-terminated, unless size is 0.

On a more anecdotal front, I did a quick test on Ideone [2] that seems to behave exactly how I would expect it to (and contradicts your description).

In short: always call snprintf() with the exact size of the buffer. No Obi-Wans.

[1]: https://man.freebsd.org/cgi/man.cgi?query=snprintf

[2]: https://ideone.com/CdLfNS


Check the return value! Unless you really want truncated values, but I would assume most people don't.


At least in glibc, the printf family is much slower than strcpy. It is a huge performance penalty if string copy is frequent (e.g. when you incrementally write to a large string buffer).


As a bonus, if you pass NULL for the first argument, its return value tells you how much memory it would have used, so you can allocate a buffer that's the right size. Bigger performance hit, but in practice, I found the impact negligible in a codebase that doesn't do a lot of stringy stuff but needs to emit reasonable messages to logfiles etc.


> The tiny performance hit is usually worth the safety and simplicity

It need not even be a performance hit. The compiler could recognize the idiom of code such as

  snprintf(buffer, bufLen, "%s%s%s", s, t, u);
and optimize it into (example possibly buggy; my C is rusty)

    b = buffer;
    e = b + bufLen:
    while((b < e) && (*b++ = *s++));
    while((b < e) && (*b++ = *t++));
    while((b < e) && (*b++ = *u++));
    *b = 0;


There have been various attempts to improve on strcpy() etc. The current thinking is that strscpy() is the best general interface to this functionality. See the "improving ... strcpy" section at

https://www.pixelbeat.org/programming/gcc/string_buffers.htm...


The sad part is really that there are several implementation of safe(r) string manipulation functions, they all do mostly the same thing, but have different names and parameter orders, preventing interoperability between major compilers. They really need to get their shit together and finally agree on a single variant to be standardized.


As mentioned in the article linked above https://saagarjha.com/blog/2020/04/12/designing-a-better-str... , memccpy() comes relatively close and is slated to be part of C23.


The code in that blog article has a bug in it that I reported to the author like 6 months ago and hasn't fixed

it should be

  char *strxcpy(char *restrict dst, const char *restrict src, size_t len)
  {
    char *end = memccpy(dst, src, '\0', len);
    if (!end && len > 0) { dst[len - 1] = '\0';}
    return end;
  }
Not the end of the world but just another subtly bugged implementation..

This illustrates the issue..

Notice in the code below how it wipes out the dest string at char 0 when we supply buf[1]

if we didn't supply buf[1] the zero gets written at buf[size_t_max]

  #include <stdio.h>
  #include "string.h"

  char *strxcpy(char *restrict dst, const char *restrict src, size_t len) {
   char *end = memccpy(dst, src, '\0', len);
   if (!end) { dst[len - 1] = '\0';}
   return end;
   }

  int main()
  {
      char buf[3] = "??";
      printf("Hello World%s", buf);
      strxcpy(&buf[1], "test", 0);
      printf("Hello World%s", buf);

      return 0;
  }


Even without this, I don't think the claim that "memccpy() comes relatively close" to safe string manipulation can be taken seriously if it doesn't even null terminate the destination. That's a pretty core requirement.

(I mean, of course it doesn't null terminate given its intended usage, but that's what it would need for that claim.)


Goodness me, that is a terrible name. Lower case o and c are very hard to distinguish in the middle of a word, so that looks like memcopy (yes I know the actual existing function is memcpy but memmove is more common in application-ish code, and it includes the middle "o", so I still read memccpy wrong).

What next? merncopy?


Admittedly the name is poor, but memccpy has been in POSIX since forever (the opengroup.org manual mentions it's from "Issue 1" which I believe refers to the X/Open Issue 1 from 1985), so I guess the C committee thought that an existing poor name was better than coming up with a new one.


Nothing that requires separate pointer and length will be safe, no matter how many times they keep going at it, unless it is backed by hardware memory tagging.


That is undoubtedly true, but while we're waiting for the widespread acceptance of a C string library that defines a separate string type, or for the world to be rewritten in Rust or other safer languages, it's still possible to make incremental progress.


There is a reason why hardware memory tagging is now trending across all new hardware platforms.


One drawback of these functions is that the return value is the length of the source string. This is seldomly useful, as it's mostly a micro-optimization to avoid an extra string scan when using static buffers for small strings, but dynamic buffers for large ones. However, it prevents these functions from being used with input that might not be null-terminated, and causes performance to be bounded by the size of the source (which usually isn't controlled by the caller), instead of the destination (which usually is).


> it prevents these functions from being used with input that might not be null-terminated

Those are not valid C-strings! String functions are(have) "undefined behavior" for non string inputs!


That's true, but in practice you sometimes encounter such strings and it's nice if you can use the regular string functions with them, instead of having to special-code something.


I've never understood the circumstances where it is ok for a copy/concatenate operation to also include a bonus "truncate the result" operation at the end. I'm a believer that you should use strcpy and strcat, and also make sure your buffers are large enough. If you can't make them large enough, then you can't do the operation you are trying to do (and I would say that doing it incorrectly via truncation isn't obviously better since this seems like silent corruption to me)


Truncating strings is a practical alternative that provides useful signal with minimal extra error handling code. Yes, in the ideal world there will be no string truncation, but if there are no alternatives in a lot of cases truncation is better than nothing.

Which of those two error messages would you rather see?

    sprintf: output buffer is to small
or

    fopen: cannot open file /usr/bin/superlong-dist-path/208277409874874/fi


You will be surprised to learn that there were programmers who did not make the buffers large enough.


No matter how big you make it. Someone will find a way to overflow it.


Someone will likely be able to answer more specifically, but I seem to recall an old common operation was filling fixed width text fields; truncating a message would make sense there (not ideal, but probably familiar).


Except to properly implement that you'll need to deal with grapheme clusters.


I expect the parent is thinking of why strncpy is the way it is, and the people who built that were glad if their terminal had both uppercase and lowercase letters, they didn't have Unicode, or grapheme clusters, their concern wasn't "will this string fit in this many pixels on my 24" flat panel?" but "Will this string fit in these 10 bytes where the filename needs to go in the directory data structure of the filesystem?".


Before using any function of the strcpy family, try to think about why you are using them.

The first question is: do I need a copy? Obviously, often, you do, but maybe you can avoid it. Copying data can be costly in terms of performance, and of course, without copying, no need to allocate memory or deal with buffers sizes.

Now, that you are sure that you indeed need a copy, maybe you should consider using memcpy() instead. If you already know the sizes of your strings, then it is a no-brainer. If you don't, maybe doing strlen()/strnlen() first on your input data is a good idea, and if it is too big, handle the error appropriately, and if it is not, then do a memcpy() now that you have the size.

The default behavior of "safe" variants of strcpy() it to truncate the string to the maximum size. It is certainly better than a buffer overflow, but it may not be what you want, just ask your ass

isstant.


The standard C functions are enough of a mess mentally for me that I try to use them only for simple things.

If I'm dealing with lots of string manipulations I usually create a struct with a count and an alloc size and a char* pointer and write init, cleanup, extend, and append functions specific for it and use that as a string type. It's less of a mental load for me to deal with reallocating an array and keeping count and alloc updated correctly once than it is knowing that character arrays are being recounted all the time and worrying about exactly where the '\0' is going to end up and if my buffer is large enough for every string I'm dealing with.


Null-terminated strings have to be up there with null-pointers as design ideas that seemed great at the time, but turned out to be extremely costly because of the insane number of bugs they produce.

I wonder what an alternative world would look like where C had decided to use pascal-strings instead (prefixing the string with its length). We would undoubtedly have argued about the length of the size prefix, but a mistake there is trivial to spot compared to wrong application of str[nls]*cpy(_s)?.


I used to program in Pascal a lot, and its counted strings were one of the worst features.

First, you cannot really "argue" about length of the size prefix - it is a part of ABI/calling convention, and unless you have template-like mechanism and overloading/conventions, you will have to use whatever stdlib was compiled with.

And Borland Pascal's stdlib was compiled with 1 byte prefix, making max string length of 255 bytes. This made sense when total RAM was measured in hundreds of kilobytes, but generated so many bugs when a normally short string had to be just a bit longer.

I was somewhat envious of my C-using friends who could put entire help message into a single string. But not envious enough to switch to much slower C tooltchain :)

There were other reasons, too, like using less registers in assembly-level loops (old machines had much fewer registers than modern ones). Or not having to worry about validating strings when reading binary data from disk.


One can imagine that in a world that had taken a turn towards Pascal strings, we would have ended up in a world where string lengths were varint encoded -- 7 bits of string length in the first byte, plus a continuation bit. I'd imagine that the initial (pure software) implementation of this would have no specific length limit, but we'd be back in the < 1 MB system memory days; so, continuing the imagination, early 32-bit CISC processors would reasonably invest in a varint-to-int decoding probably limited to a 4-byte memory read (28 bit string length), which would be a significant accelerator and cover all sane use cases. In the mid 32 bit era the limit of 256 MB strings would become occasionally relevant, but assuming varint-to-int set an overflow flag if the continuation bit on the last byte was set, this could be mitigated sanely; and the late 32 bit era would give a 64-bit read variant (56 bit string length plus overflow flag), which would probably be cleaned up in the 64 bit era to drop the overflow flag due to lack of use once memory address sizes < 56 bit became standard.

The above is, of course, speculation on an alternate timeline -- but it's certainly consistent and believable. What's interesting is what impact such a path would have elsewhere. I'd expect two other consequences: First, our existing uses of varints (e.g. protobufs) would become faster and more efficient, and would probably appear earlier in the timeline. Second, UTF8 would likely become varint-representation-compatible, giving up the (valuable) ability to tell which byte of a codepoint is being decoded in isolation, in exchange for hardware acceleration of encoded<->codepoint conversion; and likely giving us 2^28 codepoints based on a 4-byte maximum length, rather than arbitrarily choosing a maximum length of three bytes to give 2^21 codepoints.

What else would be different? Hard to say… but I do think that size prefixes could have been an alternate route.


I cannot imagine varints working well for in-memory structures, especially in C. Iterating over string with x[i] is (was?) a very common operation, and with varint length there is an extra memory fetch, extra math and a possible jump. On later CPUs, there is also unaligned reads and possible pipeline-stalling data dependency. Smart optimizers can help, but when C was designed, the compilers were pretty simple. And there are other problems -- for example, a simple act of adding a character to string can cause shift-by-1 invalidating existing pointers.

Varints can be useful on disk/network streram, but I've never seen them used for actively modified in-memory storage.


Specifically in the case of string length encodings, these objections don't seem to apply. The memory fetch is from a cache line that is already needed for the (beginning of the) string data, so only adds ops in the case of pure substring access (in which case C patterns of holding offset/length out of band apply). Similarly you're going to be accessing a string -- unaligned reads are the least of your problem if you're doing byte-at-a-time ops; and you can always align the string to a word boundary to get the varint to start at a word boundary.

I totally agree that as a general tool, varints have failures. But for (a) encoding pascal-style string lengths; (b) encoding often-small numbers on the wire; and (c) encoding codepoints they seem to apply fine.


That is why we used arrays of chars for that case, and naked Assembly blocks.


That also would have made for a nice performance benefit passing to zero-copy type calls. And reduced complexity for new languages written in C implementing their own string types. Tcl was irritating when it was first invented, as it directly used C strings, and couldn't process binary data at all.


Specially given the alternatives outside Bell Labs in systems programming, since at least 1958 starting with JOVIAL.


pascal strings are terrible, the correct solution is to just always have pointer+length and even better to make that a built-in type for all types.


As other have said, this is no longer considered to be "true".

Yes, they make it harder to get memory corruption. But that's a very narrow definition of "safe".

If you tell the computer to copy a string, or append to a string, it's wrong, and not safe, to only do a partial copy or concatenation.

It's a huge footgun. Smaller than strcpy(), sure, but "safe" it is not.

I don't really have a better answer for C.


The better answer would be to add data types like SDS[0] to the standard library, and use them as ADTs (Abstract Data Types) [1].

Unfortunely WG14 has proven in 30 years of existence, that it isn't something that they care to fix, and while 3rd party solutions exist, without vocabulary types on the stardard library adoption will never take off.

[0] - https://github.com/antirez/sds

[1] - https://en.wikipedia.org/wiki/Abstract_data_type


I don't know that this should be fixed. It seems so fundamental to the core language that it's simply the cost of doing business.

The set of projects in 2023 that should be done in C are not that big. Even the Linux kernel is getting Rust.

C is still good for embedded stuff, but there tends to be much less string manipulation there than in most other code.


It is not as well known, but I just saw a recording of a guy named David Weston, giving a talk called "Default Security" at BlueHat IL 2023, and he talks about how Windows is porting core components over to Rust, and there's in fact already a syscall implemented in it. It's not just Linux!


Indeed, great talk, but as he reckons, it is going to take years to fix C and derived languages flaws.

We are decades away from OSes using memory safe languages across all layers.

In the meantime, the security story of C, C++ and Objective-C also needs continuous improvement, of which bare bones C is the worse.


Unfortunately C is here to stay for generations, everywhere there is a POSIX kernel.

Embedded has enough memory buffer manipulations to be a concern as possible CVE root causes, specially in the IoT age where the S stands for security.


I don't even see it as unfortunate. It's just reality. It's a tool to be used when appropriate, like any other.

My point is that adding a higher level abstraction for e.g. strings in C runs a strong risk of not being fit for purpose for the people who reach for C.

People code in C because they have very specific requirements. If they could just pick up an opinionated library then it's likely that they don't need to use C in the first place.


People also have to code in C when no other alternative is possible, or allowed, and exactly because of that, C security model must be improved.


> People also have to code in C when no other alternative is possible, or allowed

That's exactly what I said.

> exactly because of that, C security model must be improved.

I'm all ears. For 75 years we've been trying to perfect the need for control with higher level save abstractions.

Most coding nowadays will gladly sacrifice this control, because the code will run in an operating system, and on a fast CPU. And rightly so.

A lot can be gained (without losing control) just by allowing some features of C++. But still, it's not like std::string is a simple fix to this problem. Some C environments cannot allocate memory, or if they can then they need to have an upper bound in how much memory they will allocate.

It's possible to have a mystring type that satisfies the environments like the ones I (and then you) mentioned, but by their nature (because they chose C) they have specific requirements that it may not be possible to solve the general case for.

That said, is strlcpy() a safety improvement over strcpy() and strncpy()? Well... it's not worse. But I'm not sure "if (s >= strlcpy(dst, src, s)) {...}" is any safer than "if (strlen(src)>=s) {...} strcpy(dst,src)". Both are (unless I made a typo) correct, and both are very easy to make a typo with.

And you need to do this at every call site, or you have a bug. Yes, the strcpy() version is much more likely to be exploitable, but the strlcpy() version at the very least produces incorrect output.

And I would not consider something that produces incorrect output "safe". "So just call it correctly" also applies to strcpy(), so same thing.

But yeah, strncpy() was clearly misdesigned, and probably most people would incorrectly guess that its semantics are that of strlcpy(), the latter semantics making sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: