Hello!
I’m using SOAP descriptors for building similarity and wondering if I should explicitly normalize each SOAP vector before computing the kernel. Some references suggest dividing by the vector norm (or normalizing the kernel), but it’s not obvious from the DScribe documentation whether this is done automatically. Your example does the following:
a_features = desc.create(a)
b_features = desc.create(b)
re = AverageKernel(metric="linear")
re_kernel = re.create([a_features, b_features])
Would it be better to normalize the SOAP vectors beforehand? Something like:
soap_list_norm = []
for a_feat, b_feat in zip(a_features, b_features):
a_norm = np.linalg.norm(a_feat, axis=1, keepdims=True)
b_norm = np.linalg.norm(b_feat, axis=1, keepdims=True)
soap_list_norm.append(a_feat / a_norm)
soap_list_norm.append(b_feat / b_norm)
I’m asking because I frequently encounter the error: ValueError: Input X contains NaN when using the REMatchKernel. This might indicate an issue with descriptor generation or a division by zero in the kernel computation. Do you have any guidelines on best practices for avoiding NaNs in DScribe?
Thanks for your help!
Best,
Alejandro