Hacker Newsnew | past | comments | ask | show | jobs | submit | dspillett's commentslogin

> > I guess I'm a bit confused but don't want to read AI fuzz further

> Personally, I'd learn to get used to it.

Getting used to it doesn't mean I have to start liking it!

ai;dr is a valid enough reaction IMO, much the same as “can't read without turning off my stalker blocker, fair enough, I'll go elsewhere”. You might think that I'm going to miss out on stuff, I might think that I'm perfectly fine missing out on that stuff and doing something else instead.


Depending on utilisation and good use of low-power or sleep (or full off) states when things aren't actively processing, it can still be a _lot_ cheaper to run things at home than on a rented service. Power costs have increased a lot in recent years, but so have compute-per-watt ratios and you are not paying the that indirect compute price when the processors are asleep or off whereas with subscription access to LLMs you are paying at least the base subscription each month even if you don't use it at all in that period. Much the same as the choice between self-hosting an open-source project or paying for a hosted instance - and in both cases people don't tend to consider the admin cost (for some of us the admin is “play time”!) so the self-hosted option it does practically feel free.

The problem with any of these tells is that an individual instance is often taken as proof on its own rather than an indicator. People do often use “it isn't X, it is Y” like constructs¹ and many, myself included sometimes, overuse “quotes”², or use m-dashes³, or are overly concerned about avoiding repeating words⁶, and so forth.

LLMs do these things because they are in the training data, which means that people do these things too.

It is sometimes difficult to not sound like an LLM-written or LLM-reworded comment… I've been called a bot a few times despite never using LLMs for writing English⁴.

--------

[1] particularly vapid space-filler articles/comments or those using whataboutism style redirection, which might be a significant chunk of model training data because of how many of them are out there.

[2] I overuse footnotes as well, which is apparently a smell in the output of some generative tools.

[3] A lot of pre-LLM style-checking tools would recommend this in place of hyphens, and some automated reformatters would make the change without access, so there are going to be many examples in training data.

[4] I think there is one at work in VS which I use in DayJob, when it is suggesting code completion options to save typing (literally Glorified Predictive Text) and I sometimes accept its suggestion, and some of the tools I use to check my Spanish⁵ may be LLM based, so I can't claim that I don't use them at all.

[5] I'm just learning, so automatic translators are useful to check what I'm written isn't gibberish. For anyone else doing the same: make sure you research any suggested changes preferably using pre-2023 sources, because the output of these tools can be quite wrong as you can see when translating into a language you are fluent in.

[6] Another common “LLM tell” because they often have weighting functions especially designed to avoid token repetition, largely to avoid getting stuck in loops, but many pre-LLM grammar checking tools will pick people up on repeated word use too, and people tend to fix the direct symptom with a thesaurus rather than improving the sentence structure overall.


Much like the epidemic we're going through of "everything is fake", every comment that even has a whiff of these tells will automatically be dismissed at "AI".

I've find myself doing it, a time or two.


The other common “entitlement” is getting miffed when their suggested enhancement isn't something that you intend to do, or will/might get done but is very low priority so it won't be soon. Common responses are to suggest that you should reconsider “for the community”⁰, or start a moaning campaign on social media to try to get others to chip in and nag you. Or “threaten” to use something else instead, which always amused me¹ [way back] when I had some f/oss stuff out there.

Expecting quick responses to security issues is one thing, and perfectly acceptable IMO, but new features/enhancements or major changes (that might break other workflows, most importantly mine!) is quite another.

---------

[0] My response years ago when I had f/oss code out there was sometimes “why don't you do it for the community, and submit a patch?” which usually got an indignant response. Though these days if I ever publish code again it'll be on more of an “open source not open contribution” basis, so I'd not be accepting patches like that and my response would be more along the lines of “feel free to fork and DIY”.

[1] So, if I do the thing I don't want to do right now, you'll stay and probably keep making demands, and if I don't do the thing that I don't want to do right now, you'll go away and bother someone else? Let me think about that…


The 'I'll use something else' threat is my favorite. In my space the threat is usually 'I'll just use the AWS native tools.' And my honest answer is: if that works for you, do it. Most of them come back three months later after the first Config rules bill lands.

> and the waitresses were up to their names, literally standing and waiting by our table

I hate that. My old boss would book us into 5* places when we travelled for work and his wife was also there. People standing over me just felt “ick”. Like when the security guard decides to fallow you around the supermarket! (the latter has only happened a couple of times that I've noticed, when it did I made a point of spending much longer than I otherwise would meandering back & forth, and gave them a grin on my way out after paying…)


It's usually pretty discreet - they'll stay out of your line of sight and, if they're doing it right, you should barely notice them stepping in to top up your wine, etc.

Not sure what star rating system you're going by (Michelin only goes up to 3), but I'd expect that level of service even at 1* restaurants.


I'm meaning hotel (and therefore the bars/other within them) ratings rather than restaurant stars.

Maybe years living in a somewhat ropey town and having to be careful in alley-ways and tree covered areas has tuned me to be extra sensitive to people trying to stay out of line of sight…


> nailing down Unicode and text encodings was still considered rocket science. Now this is a solved problem

I wish…

Detecting text encoding is only easy if all you need to contend with is UTF16-with-BOM, UTF8-with-BOM, UTF8-without-BOM, and plain ASCII (which is effectively also UTF8). As soon as you might see UTF16 or UCS without a BOM, or 8-bit codepages other than plain ASCII (many apps/libs assume that these are always CP1252, a superset of the printable characters of ISO-8859-1, which may not be the case), things are not fully deterministic.

Thankfully UTF8 has largely won out over the many 8-bit encodings, but that leaves the interesting case of UTF8-with-BOM. The standard recommends against using it, that plain UTF8 is the way to go, but to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM (otherwise it assumes CP 1252 and characters above 127 are corrupted). But… some apps/libs are completely unaware that UTF8-with-BOM is a thing at all so they load such files with the first column header corrupted.

Source: we have clients pushing & pulling (or having us push/pull) data back & forth in various CSV formats, and we see some oddities in what we receive and what we are expected to send more regularly than you might think. The real fun comes when something at the client's end processes text badly (multiple steps with more than one of them incorrectly reading UTF8 as CP1252, for example) before we get hold of it, and we have to convince them that what they have sent is non-deterministically corrupt and we can't reliably fix it on the receiving end…


> to get Excel to correctly load a UTF8 encoded CSV or similar you must include the BOM

Ah so that’s the trick! I’ve run into this problem a bunch of times in the wild, where some script emits csv which works on the developers machine but fails strangely with real world data.

Good to know there’s a simple solution. I hope I remember your comment next time I see this!


Excel CSV is broken anyway, since in some (EU, ...) countries it needs ; as separator.

That's not an excel issue. That's a locale issue.

Due to (parts of?) the EU using then comma as the decimal separator, you have to use another symbol to separate your values.


Comma for decimal separator, and point (or sometimes 'postraphy) for thousands separator if there is one, is very common. IIRC more European countries use that than don't, officially, and a bunch of countries outside Europe do too.

It wouldn't normally necessitate not using comma as the field separator in CSV files though, wrapping those values is quotes is how that would usually be handled in my experience.

Though many people end up switching to “our way”, despite their normal locale preferences, because of compatibility issues they encounter otherwise with US/UK software written naively.


Locales should have died long ago. You use plain data, stop parsing it depdending on wen your live. Plan9/9front uses where right long ago. Just use Unicode everywhere, use context-free units for money.

Locales are fine for display, but yes they should not affect what goes into files for transfer. There have always been appropriate control characters in the common character sets, in ASCII and most 8-bit codepages there are non-printing control characters that have suitable meanings to be used in place of commas and EOL so they could be used unescaped in data fields. Numbers could be plain, perhaps with the dot still as a standard decimal point or we could store non-integers as a pair of ints (value and scale), dates in an unambiguous format (something like one of the options from ISO8601), etc.

Unfortunately people like CSV to be at least part way human-readable, which means readable delimiters, end-or-record markers being EOLs that a text editor would understand, and the decimal/thousand/currency symbols & date formatting that they are used to.


A lot of the time when people say CSV they mean “character separated values” rather than specifically “comma separated values”.

In the text files we get from clients we sometimes see tab used instead of comma, or pipe. I don't think we've seen semicolon yet, though our standard file interpreter would quietly cope¹ as long as there is nothing really odd in the header row.

--------

[1] it uses the heuristic “the most common non-alpha-numeric non-space non-quote character found in the header row” to detect the separator used if it isn't explicitly told what to expect


The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me. I understand they want it to be the last encoding and therefore not in need of a explicit indicator, but as it currently IS NOT the only encoding that is used, it makes is just so difficult to understand if I'm reading any of the weird ASCII derivatives or actual Unicode.

It's maddening and it's frustrating. The US doesn't have any of these issues, but in Europe, that's a complete mess!


> The US doesn't have any of these issues

I think you mean “the US chooses to completely ignore these issues and gets away with it because they defined the basic standard that is used, ASCII, way-back-when, and didn't foresee it becoming an international thing so didn't think about anyone else” :)


> because they defined the basic standard that is used, ASCII

I thought it was EBCDIC /s


From wikipedia...

    UTF-8 always has the same byte order,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8...
    Not using a BOM allows text to be backwards-compatible with software designed for extended ASCII. For instance many programming languages permit non-ASCII bytes in string literals but not at the start of the file. ...
   A BOM is unnecessary for detecting UTF-8 encoding. UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text.
That last one is a weaker point but it is true that with CSV a BOM is more likely to do harm, than good.

> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

One of the key advantages of UTF8 is that all ASCII content is effectively UTF-8. Having the BOM present reduces that convenience a bit, and a file starting with the three bytes 0xEF,0xBB,0xBF may be mistaken by some tools for a binary file rather than readable text.


Did you read past the first sentence I wrote?

ASCII does not work for any country than the US, making it a shit encoding.


> The very fact that UTF-8 itself discouraged from using the BOM is just so alien to me.

Adding a BOM makes it incompatible with ASCII, which is one of the benefits of using UTF-8.


Another one who fails to read past my first sentence...

I read past your first sentence, but ASCII is used by non English speaking countries for many things. Source code, for one.

Indeed, I've been using the BOM in all my text files for maybe decades now, those who wrote the recommendation are clearly from an English country

> are clearly from an English country

One particular English-speaking country… The UK has issues with ASCII too, as our currently symbol (£) is not included. Not nearly as much trouble as non-English languages due to the lack of accents & such that they need, but we are still affected.


> Most users just absolutely do not know about, care about, or worry about security, privacy, maintainability, robustness, or a host of other things.

That is a problem that needs to be fixed in those users, not something we should take advantage of as an excuse for releasing shoddy work.

> For some reason this is continually terrifying and shocking to many.

For many reasons.

It means that a good product can be outcompeted by a substandard one because it releases faster, despite the fact it will cause problems later, so good products are going to become much more rare at the same time as slop becoming much more abundant.

It means that those of us trying to produce good output will be squeezed more and more to the point where we can't do that without burning out.

It means that we can trust any given product or service even less than we were able to in the past.

It means that because we are all on the same network, any flaw could potentially affect us all not just the people who don't care.

The people who don't care when caring means things release with lower cadence, are often the same people who will cry loudest and longest about how much everyone else should have cared when a serious bug bites their face off.

and so on… … …

Are you suggesting we should just sit back and let then entire software industry go the way of AAA games or worse?


> > Most users just absolutely do not know about, care about, or worry about security, privacy, maintainability, robustness, or a host of other things.

> That is a problem that needs to be fixed in those users, not something we should take advantage of as an excuse for releasing shoddy work.

Ok. Tech folks have been trying to educate users and get them to make better decisions (in the viewpoint of those tech folks) for a long time. And the current state points to how successful that's been: not very. This isn't exclusive to software... many industries have consumers who make unsound long-term choices (in the viewpoint of experts).

Taking advantage? Besides cases where folks are actually breaking the law and committing fraud, this isn't some kind of illicit activity, it's just building what the users choose to buy/use.

> It means ... It means ... It means ... It means ...

Perhaps, perhaps, perhaps, and perhaps.

> Are you suggesting we should just sit back and let then entire software industry go the way of AAA games or worse?

I'm not sure what "the way of AAA games" means. I'm just laying out how I view the last 30 years of the software industry.

I don't see any reason to expect significant change.


>* I'm not sure what "the way of AAA games" means.*

The rush to get things out NowNowNowNowNOWNOWNOW has resulted in massive crunches at the end (or even from the very start) or many big projects, and an apparent “sod it, it'll do, we can patch it later” attitude. Over the last decade or more this problem has become worse, with only a few exceptions to the rule.

With “vibe coding” and “vibe designing” taking more load, I expect that “sod it, it'll do, we can patch it later” will become more common everywhere¹, and that is among those that do have an understanding of the potential security and stability issues that things going out without sufficient review can cause.

--------

[1] Once management are convinced LLM tools will increase throughput by, say, 50% in ideal cases, they'll expect output to increase by 50+% in all cases, and like the gaming industry “if you can't put the hours in, someone else will” when problems in LLM output cause delays or production issues, is likely to become a key driver, more so than it might be already.


> This isn't supposed to replace Windows,

OP wasn't suggesting it was, just that the lack of quality in one significant area of the company's output leads to a lack of confidence in other products that they release.


Given anything the size of Microsoft, it's not a good assumption. MS has large research teams that produce really interesting things. Their output is unrelated to released products.

Companies want us to trust their things based on positive experiences with their other things, and that works both ways.

> people publishing articles that contain these kinds of LLM-ass LLMisms don't mind and don't notice them

That certainly seems to be the case, as demonstrated by the fact that they post them. It is also safe to assume that those who fairly directly use LLM output themselves are not going to be overly bothered by the style being present in posts by others.

> but there are also always clearly real people in the comments who just don't realize that they're responding to a bot

Or perhaps many think they might be responding to someone who has just used an LLM to reword the post. Or translate it from their first language if that is not the common language of the forum in question.

TBH I don't bother (if I don't care enough to make the effort of writing something myself, then I don't care enough to have it written at all) but I try to have a little understanding for those who have problems writing (particularly those not writing in a language they are fluent in).


> Or translate it from their first language if that is not the common language of the forum in question.

While LLM-based translations might have their own specific and recognizable style (I'm not sure), it's distinct from the typical output you get when you just have an LLM write text from scratch. I'm often using LLM translations, and I've never seen it introduce patterns like "it's not x, it's y" when that wasn't in the source.


That is true, but the “negative em-dash positive” pattern is far from the only simple smell that people use to identify LLM output. For instance certain phrases common in US politics have quickly become common in UK press releases do to LLM based tools being used to edit/summarise/translate content.

> so that you can hibernate

The “paging space needs to be X*RAM” and “paging space needs to be RAM+Y” predate hibernate being a common thing (even a thing at all), with hibernate being an extra use for that paging space not the reason it is there in the first place. Some OSs have hibernate space allocated separately from paging/swap space.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: