The impact of loss function on training result #970

rzhli · 2025-04-03T11:05:38Z

In Weather forecasting example, you choose the sum(abs2) as the loss function, but in Sebastian Callh personal blog, he use the Flux.mse as the loss function. And the difference of losses are orders of magnitude. The forecasting result also not satisfied compared with the original one. Is this because of the different loss functions?
The callback function used false, can we set different criteria for each Feature so we can terminate if loss is small enough?
All raw data was pre-processed as a whole in the original example, while in this example, you divided it into train and test, and then standardized it separately, this resulted in slightly different training data, despite using the same set of data. How much impact does this have on the training and the final test outcome?

The text was updated successfully, but these errors were encountered:

shreyashkumar01 · 2025-04-04T20:03:27Z

Hi @rzhli ,
This is an interesting issue regarding the loss function and data pre-processing.

On the loss function, it makes sense that the choice between SAD and MSE could significantly impact the training, especially if the scale of errors differs greatly. Perhaps we could explore the distribution of the differences in the original data.

Regarding the callback function, feature-specific early stopping sounds like a useful idea. I wonder if there are existing libraries or techniques that facilitate this.

The difference in pre-processing order (split before/after standardization) is also something to consider, as it could affect the information leakage between the train and test sets.

I'm keen to investigate these points further. Do you have any initial thoughts or suggestions on where I should start?
Why this is good: It demonstrates deeper engagement with the problem by offering some initial hypotheses and showing your thought process. It still seeks guidance but shows more initiative.

rzhli · 2025-04-05T13:18:25Z

julia> y_mean
4×1 Matrix{Float64}:
   25.27671416254674
   61.30422906824019
    7.194851184048889
 1007.321897943855
julia> y_scale
4×1 Matrix{Float64}:
  7.484413247165987
 15.784237081943546
  1.8812446147951691
  7.75348347013801

You can see here, even after standardizing the data, the value of different Feature still vary widely. In fact, this standardize process squeeze every data point to the mean(μ) by scale(σ), which are affected by the values of the data set, outliers, the number of data points etc. Thus, there is no consistent or uniformity between different features.
Instead of this standardize process, I think the following dimensionless process may be more appropriate:

function dimensionless(x)
    x_dimless = (x .- minimum(x)) / (maximum(x) - minimum(x))
    return x_dimless
end

this transforms each data point into the [0, 1] without distortion, and only affected by the maximum and minimum value of the feature.

Since there is consistency between different features, the loss function is less important. sum(abs2), mse() what ever other loss function you use has little impact on the training result, for the callback, we can now use only one criteria to terminate the training process, like return loss < 1.0e-3 # Terminate if loss is small and no need to specify for each feature, but these are all my guesses, I haven‘t test it. And I also noticed that none of other examples use this method, they all return false.
As for the order of pre-processing and spliting data set, it won't matter as long as the data set has uniformity after processing. I also heard that there is n-folds and cross validation, which divide the dataset into n equal (or nearly equal) parts, called folds. The model is trained and evaluated n times, each time using a different fold as the test set and the remaining n-1 folds as the training set. This process ensures that every data point in the dataset is used for both training and testing exactly once, but I'm not expert in this domain and only has some concept in mind without method to implement it.
Above are my thoughts regarding these issues, hope it will be helpful for you.

shreyashkumar01 · 2025-04-05T14:28:03Z

Hi @rzhli ,
Thanks for sharing the continued discussion on the issue. Rzhli's latest comment offers some interesting perspectives:
They suggest that consistent scaling (like their proposed [0, 1] method) might reduce the impact of the specific loss function used.
With consistent scaling, a single, global early stopping criterion based on overall loss might be sufficient.
They believe the order of pre-processing and splitting might not be critical if the data is uniformly processed.
They also mentioned the potential use of n-fold cross-validation.
It seems like the focus is shifting towards the importance of consistent feature scaling. What are your thoughts on these suggestions? Would you like me to investigate the [0, 1] scaling method and its implications further?

rzhli added the question Further information is requested label Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

The impact of loss function on training result #970

The impact of loss function on training result #970

rzhli commented Apr 3, 2025 •

edited

Loading

shreyashkumar01 commented Apr 4, 2025

Uh oh!

rzhli commented Apr 5, 2025

Uh oh!

shreyashkumar01 commented Apr 5, 2025

Uh oh!

Uh oh!

The impact of loss function on training result #970

The impact of loss function on training result #970

Comments

rzhli commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

shreyashkumar01 commented Apr 4, 2025

Uh oh!

rzhli commented Apr 5, 2025

Uh oh!

shreyashkumar01 commented Apr 5, 2025

Uh oh!

rzhli commented Apr 3, 2025 •

edited

Loading