Skip to content

MultiIndex with NaNs reported as non sorted #17931

Open
@louridas

Description

@louridas

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: dates_range = pd.date_range('2016/10/1', periods=3).repeat(4)
In [4]: statuses = np.tile(['A', 'B', None, 'C'], 3)
In [5]: values = np.random.randn(12)
In [6]: df = pd.DataFrame({'date': dates_range, 'status': statuses, 'value': values})
In [7]: df
Out[7]:
         date status     value
0  2016-10-01      A -1.946876
1  2016-10-01      B  1.080243
2  2016-10-01   None  0.165715
3  2016-10-01      C -0.615913
4  2016-10-02      A  0.662645
5  2016-10-02      B  1.448593
6  2016-10-02   None -1.392233
7  2016-10-02      C  1.534083
8  2016-10-03      A  0.801988
9  2016-10-03      B -0.689987
10 2016-10-03   None -0.150036
11 2016-10-03      C -0.197410

In [8]: df.set_index(['date', 'status']).sort_index()
Out[8]:
                      value
date       status
2016-10-01 A      -1.946876
           B       1.080243
           C      -0.615913
           NaN     0.165715
2016-10-02 A       0.662645
           B       1.448593
           C       1.534083
           NaN    -1.392233
2016-10-03 A       0.801988
           B      -0.689987
           C      -0.197410
           NaN    -0.150036

In [9]: df.set_index(['date', 'status']).sort_index().index.is_lexsorted()
Out[9]: False

In [10]: df.set_index(['date', 'status']).sort_index(level=['date', 'status']).index.is_lexsorted()
Out[10]: True

Problem description

By giving sort_index() no level argument I expect to have the dataframe sorted on all the levels, as indeed it happens. However the is_lexsorted() method afterwards reports that it is not. The problem is fixed if I explicitly pass the level argument.

The behavior is particularly problematic when the MultiIndex has many levels, say n, and the NaNs appear at level m < n. Then slicing the MultiIndex for levels up to m will work, while for levels from m and above it will crash.

Note that this was not so in previous versions. Unfortunately not sure which ones; it worked OK in spring 2017, I came upon the issue by re-running code from that time.

Expected Output

The expected behavior is the one that I would get with passing the level argument with all index levels, as in the transcript above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 17.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.14
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateMultiIndex

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions