data leakage in predict_next_purchases

Hey,

in the notebook, when using: 

    clf.fit(X, y)
    top_features = utils.feature_importances(clf, features_encoded, n=20)

we introduce data leakage, since we select the features on the whole data set.
With scikit-learn's pipelines it's possible to select the 20 (rather a fraction o features) best features for each fold with [select from model](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel)

With the current set up, you are probably overestimating the AUC.
Besides, cross val score assumes IID samples. However, this will clearly not be the case, since one entity has typically several occurences. I think some thing like [time series split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit) or rather an adaption (since we don't have time series in a classical way but rather time slices) should be the correct thing to use here.

Comments on those issues?

Currently, at work, I have the same issues, so I really appreciate the library you developed so far. I haven't seen something similar so far. So thumbs up in any case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

data leakage in predict_next_purchases #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

data leakage in predict_next_purchases #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions