It's a fairly straightforward modification of BitNet, so I assume this quote fro...

It's a fairly straightforward modification of BitNet, so I assume this quote from the BitNet paper applies:

To train our 1-bit model, we employ the straight-through estimator (STE)[BLC13 ] to approximate the gradient during backpropagation. This method bypasses the non-differentiable functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions, during the backward pass. STE allows gradients to flow through the network without being affected by these non-differentiable functions, making it possible to train our quantized model