Open
Description
Hi,
I believe that the bias is not removed in the quantize() function. This would be necessary to have a symmetric Q8_0 quantization of activations. Is that not needed?
void quantize(QuantizedTensor *qx, float* x, int n) {
int num_groups = n / GS;
float Q_MAX = 127.0f;
for (int group = 0; group < num_groups; group++) {
// find the max absolute value in the current group
float wmax = 0.0;
for (int i = 0; i < GS; i++) {
float val = fabs(x[group * GS + i]);
if (val > wmax) {
wmax = val;
}
}
// calculate and write the scaling factor
float scale = wmax / Q_MAX;
qx->s[group] = scale;
// calculate and write the quantized values
for (int i = 0; i < GS; i++) {
float quant_value = x[group * GS + i] / scale; // scale
int8_t quantized = (int8_t) round(quant_value); // round and clamp
qx->q[group * GS + i] = quantized;
}
}
}
Metadata
Assignees
Labels
No labels
Activity