Hacker Newsnew | past | comments | ask | show | jobs | submit | louai's commentslogin

ChatGPT 4

> Use reservoir sampling to get 100 random lines from a dictionary that doesn’t fit into memory. Give me an implementation in Python.

> Reservoir sampling is a great choice for randomly selecting a fixed number of items from a stream of unknown size, especially when the entire dataset cannot fit into memory. Here's a Python implementation for selecting 100 random lines from a large file, like a dictionary, using reservoir sampling:

  ```python
  import random

  def reservoir_sampling(file_name, k):
      reservoir = []
      with open(file_name, 'r') as file:
          for i, line in enumerate(file):
              if i < k:
                  # Fill the reservoir array
                  reservoir.append(line.strip())
              else:
                  # Randomly replace elements in the reservoir
                  # with a decreasing probability.
                  # Choose an integer between 0 and i (inclusive)
                  j = random.randint(0, i)
                  if j < k:
                      reservoir[j] = line.strip()
      return reservoir

  # Example usage
  file_path = 'path_to_your_large_file.txt'  # Replace with your file path
  sampled_lines = reservoir_sampling(file_path, 100)
  print(sampled_lines)
  ```
> This script opens a file and reads through it line by line. For the first 100 lines, it adds them directly to the reservoir. For each subsequent line, it randomly decides whether to replace an existing line in the reservoir with the new line. The probability of replacing decreases as it progresses through the file, ensuring each line has an equal chance of being included in the final sample.


With Lisp-2 designs as discussed in the article this is not an issue, as variables and functions are in different namespaces:

  CL-USER> (defun foo (list) (list list))
  FOO
  CL-USER> (foo 42)
  (42)
In this case the function attached to the symbol LIST is applied to the argument with the same name, but that isn't a problem.

To further illustrate, in the above example the LIST symbol is imported from the package COMMON-LISP and has a function, plist etc. attached to it:

  CL-USER> (symbol-package 'list)
  #<PACKAGE "COMMON-LISP">
  CL-USER> (symbol-function 'list)
  #<FUNCTION LIST>
  CL-USER> (symbol-plist 'list)
  NIL


Yeah, I’ve been writing Common Lisp for six years now: Lisp-2s, IMO, make the right trade-offs


Macro expansion is done in the host environment, not the target environment. You can compile the macro definition to a function taking a form and an environment. This would be something you arrange for within your compiler.

CLTL2 specifies a function `parse-macro ` which is available in most implementations. Portably you can call it using the `trivial-cltl2` package:

https://github.com/Zulu-Inuoe/trivial-cltl2

Robert Strandh has written both papers and code for handling environments:

http://metamodular.com/SICL/environments.pdf http://metamodular.com/clostrum.pdf


Either you are describing something that doesn’t work for cross-compilation or you are missing the point. Common Lisp is a hard language to compile because of macros. Consider the following code you might see in bootstrapping:

  (defmacro defun ...)
  (defun ...)
To macroexpand the second line you must have an environment with the defun macro definition from the previous line, which means that the code to modify the environment with that definition must have been evaluated too. This must be evaluated in the (emulated) target environment because (1) the host environment would otherwise clash with it, and (2) you need to write down the target environment as a build artefact. Evaluating these environment-modifying forms requires an interpreter you can run on the host. If you want a good programming experience for your standard library then this needs to be a reasonably capable interpreter.

But you also need an interpreter to evaluate code during macroexpansion. Three reasons:

1. You must emulate the target environment for correctness, eg JavaScript only has one type of float.

2. You cannot sufficiently interact with the host environment to use its own macroexpander (if you want to be portable). This is because compiling a macrolet requires extending the lexical environment with the macro definition so that calls (from eg the expander for setf) to macroexpand can use the definition. The only portable thing on the host that can evaluate a macrolet is the compiler/interpreter

3. You probably can’t use native types for things like symbols so the built in macro expansion is not sufficient.


Common Lisp actually doesn't specify a mechanism for this. There are implementations that can compile to e.g. DLL:

http://www.lispworks.com/documentation/lw71/DV/html/delivery... https://franz.com/support/documentation/current/doc/dll.htm

Unfortunately none of the FOSS implementations have this ability (to my knowledge). There is nothing inherently in Common Lisp that mandates the "core dump" delivery model.


One correction to my post above - Corman Lisp is a Free implementation that does have the ability to produce DLLs, but it is limited to Windows:

https://github.com/sharplispers/cormanlisp/blob/master/docum...


how about ECL? That should do that, too.


A characteristic of intrusive data structures is generally good cache friendliness. In this context "intrusive" means that the data structure is embedded with the data that's being stores.

In C terms as in this case that means that the pointers required of individual nodes aren't allocated in separate structs and are instead embedded in one struct that also includes the payload.

This means that the cache behavior is improved, as a given node is stored in a single location. Once you access a node the associated data is already in the cache, instead of having to be fetched via a separate pointer dereference.


Note that it says less than one character. A character in UTF-8 can be composed of multiple bytes.

The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion:

  We define 7 byte types:
  T0 0xxxxxxx      7 free bits
  Tx 10xxxxxx      6 free bits
  T1 110xxxxx      5 free bits
  T2 1110xxxx      4 free bits
  T3 11110xxx      3 free bits
  T4 111110xx      2 free bits
  T5 111111xx      2 free bits

  Encoding is as follows.
  >From hex Thru hex      Sequence             Bits
  00000000  0000007f      T0                   7
  00000080  000007FF      T1 Tx                11
  00000800  0000FFFF      T2 Tx Tx             16
  00010000  001FFFFF      T3 Tx Tx Tx          21
  00200000  03FFFFFF      T4 Tx Tx Tx Tx              26
  04000000  FFFFFFFF      T5 Tx Tx Tx Tx Tx    32
[...]

  4. All of the sequences synchronize on any byte that is not a Tx byte.
If you are starting mid-run, skip initial Tx bytes. That will always be less than one character.


Note that UTF-8 has since been restricted to at most 4 bytes (i.e. the longest sequence is `T3 Tx Tx Tx`).


So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.

September 1992: 2 guys scribbling on a placemat.

January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.

March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.

March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)

September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)

November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.

Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.

EDIT: Just notice this in the footnotes, and the plot thickens...

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these.

So UTF-8 was indeed intended to be utf8mb3!


This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).

I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.


The Itanium C++ ABI mentions something similar

> A two-phase exception-handling model is not strictly necessary to implement C++ language semantics, but it does provide some benefits. For example, the first phase allows an exception-handling mechanism to dismiss an exception before stack unwinding begins, which allows resumptive exception handling (correcting the exceptional condition and resuming execution at the point where it was raised). While C++ does not support resumptive exception handling, other languages do, and the two-phase model allows C++ to coexist with those languages on the stack.

http://itanium-cxx-abi.github.io/cxx-abi/abi-eh.html


I am aware that there have historically been other languages besides Lisp that allow resumptive exception handling (PL/I is a historical example), but I'm unaware of any modern language besides Lisp that does it.

What would C++ be coexisting with on non-mainframe hardware?


When I was at Qt I spent a decent amount of time porting the native rendering engine from Qt4 to Qt5. It should be available and work relatively well. I have retired from the Qt project, but it looks like the code is still there in the repo. [1] You might need to tell the configure script to build it. To use it, set the QT_XCB_NATIVE_PAINTING environment variable.

[1] https://code.qt.io/cgit/qt/qtbase.git/tree/src/plugins/platf...


This is exactly how Arabic works today. It's a huge impediment, not a benefit.


Can you please elaborate on why it’s an impediment?


Because this way no-one has written forms of their mother tongue. They have to learn a second language and its written form. As others are pointing out, the situation is pretty much equivalent to Europe 1000 years ago when Latin had evolved into separate languages but all writing was still done in Latin. What would you prefer: being able to speak and write in French or speak in French but have to learn Latin to be able to write?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: