> very small percentage of people to use the torrent over the direct download
BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.
On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.
I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.
FMA acronym is not fast multiply add, it’s fused multiply add. Fused means the instruction computes the entire a * b + c expression using twice as many mantissa bits, only then rounds the number to the precision of the arguments.
It might be the Prism emulator failed to translate FMA instructions into a pair of two FMLA instructions (equally fused ARM64 equivalent), instead it did some emulation of that fused behaviour, which in turn what degraded the performance of the AVX2 emulation.
Author here - thanks - my bad. Fixed 'fast' -> 'fused' :)
I don't have insight into how Prism works, but I have wondered if the right debugger would see the ARM code and let us debug exactly what was going on for sure.
You’re welcome. Sadly, I don’t know how to observe ARM assembly produced by Prism.
And one more thing.
If you test on an AMD processor, you will probably see much less profit from FMA. Not because it’s slower, but because SSE4 version will runs much faster.
On Intel processors like your Tiger Lake, all 3 operations addition, multiplication and FMA compete for the same execution units. On AMD processors however, multiplication and FMA do as well but addition is independent, e.g. on Zen4 multiplication and FMA run on execution units FP0 or FP1 while addition runs on execution units FP2 or FP3. This way replacing multiply/add combo with FMA on AMD doesn’t substantially improve throughput in FLOPs. The only win is L1i cache and instruction decoder.
It’s impossible to replace JS with WebAssembly because all state-mutating functions (DOM tree manipulation and events, WebGL rendering, all other IO) is unavailable to WebAssembly. They expect people to do all that using JavaScript glue.
Pretty sure if WebAssembly were designed to replace JS instead of merely supplementing it, we would have little JS left on the web.
> What if you have two different project with different requirements at the same time?
Install multiple versions of Windows SDK. They co-exist just fine; new versions don’t replace old ones. When I was an independent contractor, I had 4 versions of visual studio and 10 versions of windows SDK all installed at once, different projects used different ones.
> run games through a Proton-like shim even on Windows
Already happening, to an extent. Specifically, modern Intel GPUs do not support DirectX 9 in hardware, yet legacy apps run fine. The readme.txt they ship with the drivers contains a paragraph which starts with the following text: “SOFTWARE: dxvk The zlib/libpng License” DXVK is a library which implements Direct3D on top of Vulkan, and an important component of SteamOS.
> It was never the right choice for API payloads and config files
Partially agree about API payloads; when I design my APIs I typically use binary formats.
However, IME XML is actually great for config files.
Comments are crucial for config files. Once the complexity of the config grows, a hierarchy of nested nodes becomes handy, two fixed levels of hierarchy found in old Windows ini files, and modern Linux config files, is less than ideal, too many sections. Attributes make documents easier to work with due to better use of horizontal screen space: auto-formatted JSON only has single key=value per line, XML with attributes have multiple which reduces vertical scrolling.
Software developers might be majority here in HN comments, but definitely a small minority across general population. A lot of people are negatively affected by AI: computer hardware is expensive because AI companies bought all memory, Windows 11 is crap because Microsoft reworked their operating system into AI-driven trojan horse, many people lost jobs because AI companies convinced top management of their employers’ people will be replaced with computers any day now, etc.
I have a hypothesis why issues like that are so widespread. That AI infrastructure is mostly developed by large companies; their business model is selling software as a service at scale. Hence containers, micro-services, TCP/IP in between. That approach is reasonable for data centres because these made of multiple servers i.e. need networking, and they have private virtual networks just to connect servers so the security consequences aren’t too bad.
If they were designing these infrastructure pieces primarily for consumer use, they would have used named pipes, Unix domain sockets, or some other local-only IPC method instead of TCP/IP.
> you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first
Technically possible, but I think unlikely to happen in practice.
On the higher level, these large models are sequential and there’s nothing to parallelize. The inference is a continuous chain of data dependencies between temporary tensors which makes it impossible to compute different steps in parallel.
On the lower level, each step is a computationally expensive operation on a large tensor/matrix. These tensors are often millions of numbers, the problem is very parallelizable, and the tactics to do that efficiently are well researched because matrix linear algebra is in wide use for decades. However, it’s both complicated and slow to implement fine grained parallelism like “adding together whichever two finish first” on modern GPUs. Just too much synchronization, when total count of active threads is many thousands, too expensive. Instead, operations like matrix multiplications are often assigning 1 thread per output element or fixed count of output elements, and reduction like softmax or vector dot product are using a series of exponentially decreasing reduction steps, i.e. order is deterministic.
However, that order may change with even minor update of any parts of the software, including opaque pieces at the low level like GPU drivers and firmware. Library developers are updating GPU kernels, drivers, firmware and OS kernels collectively implementing scheduler which assigns work to cores, both may affect order of these arithmetic operations.
Flash drives are less than ideal for backups. I think when they are stored cold i.e. unpowered, flash memory only retains data for a couple of years. Spinning hard drives are way more reliable for the use case.
That's true. But if they are stored unpowered for a couple of years, then you clearly aren't doing regular backups. OTOH, it's doesn't seem unlikely that the average person would leave a disk gathering dust, so advising people to use a regular HDD is probably the best approach
> if they are stored unpowered for a couple of years, then you clearly aren't doing regular backups
I am doing regular backups yet I have a few backup disks unpowered for years. They are older, progressively smaller backup HDDs I keep for extra redundancy.
Every 2-4 years I am getting a larger backup drive, and clone my previous backup drive to the new one. This way when the backup drive fails (happened around 2013 because I was unfortunate to get notoriously unreliable 3TB Seagate), I don’t lose much data if at all because most of the new stuff is still on the computers, and the old stuff is left on these older backup drives.
I do basically the same, but instead of keeping everything around I just keep the last two drives in rotation at the same time: One kept at home and one kept at work. One of them failed recently, while I was performing a backup, so I just got a new (and larger) drive, and synced it with the other backup drive before continuing as usual
BitTorrent protocol is IMO better for downloading large files. When I want to download something which exceeds couple GB, and I see two links direct download and BitTorrent, I always click on the torrent.
On paper, HTTP supports range requests to resume partial downloads. IME, it seems modern web browsers neglected to implement it properly. They won’t resume after browser is reopened, or the computer is restarted. Command-line HTTP clients like wget are more reliable, however many web servers these days require some session cookies or one-time query string tokens, and it’s hard to pass that stuff from browser to command-line.
I live in Montenegro, CDN connectivity is not great here. Only a few of them like steam and GOG saturate my 300 megabit/sec download link. Others are much slower, e.g. windows updates download at about 100 megabit/sec. BitTorrent protocol almost always delivers the 300 megabit/sec bandwidth.
reply