Skip to content

masked arrays #311

@BastienLacave

Description

@BastienLacave

Hi,

i was having an issue with the ctapipe-optimize-event-selection tool, see cta-observatory/ctapipe#2861. It seems that in cuts.py, the 'cut' columns was not initialized (filled with ones).
After iterating with chat GPT, I understand it is linke to the fact that np.nanpercentile does not support masked arrays.
Below is a function that works for me :

def calculate_percentile_cut(
    values,
    bin_values,
    bins,
    fill_value,
    percentile=68,
    min_value=None,
    max_value=None,
    smoothing=None,
    min_events=10,
):
    """
    Calculate cuts as the percentile of a given quantity in bins of another
    quantity.

    Parameters
    ----------
    values: ``~numpy.ndarray`` or ``~astropy.units.Quantity``
        The values for which the cut should be calculated
    bin_values: ``~numpy.ndarray`` or ``~astropy.units.Quantity``
        The values used to sort the ``values`` into bins
    edges: ``~numpy.ndarray`` or ``~astropy.units.Quantity``
        Bin edges
    fill_value: float or quantity
        Value for bins with less than ``min_events``,
        must have same unit as values
    percentile: float
        The percentile to calculate in each bin as a percentage,
        i.e. 0 <= percentile <= 100.
    min_value: float or quantity or None
        If given, cuts smaller than this value are replaced with ``min_value``
    max_value: float or quantity or None
        If given, cuts larger than this value are replaced with ``max_value``
    smoothing: float or None
        If given, apply a gaussian filter of width ``sigma`` in terms
        of bins.
    min_events: int
        Bins with less events than this number are replaced with ``fill_value``
    """
    # create a table to make use of groupby operations
    # we use a normal table here to avoid astropy/astropy#13840
    table = Table({"values": values}, copy=COPY_IF_NEEDED)
    unit = table["values"].unit

    # make sure units match
    if unit is not None:
        fill_value = u.Quantity(fill_value).to(unit)

        if min_value is not None:
            min_value = u.Quantity(min_value).to_value(unit)

        if max_value is not None:
            max_value = u.Quantity(max_value).to_value(unit)

    bin_index, valid = calculate_bin_indices(bin_values, bins)
    by_bin = table[valid].group_by(bin_index[valid])

    n_bins = len(bins) - 1
    cut_table = QTable()
    cut_table["low"] = bins[:-1]
    cut_table["high"] = bins[1:]
    cut_table["center"] = bin_center(bins)
    cut_table["n_events"] = 0

    unit = None
    if hasattr(fill_value, 'unit'):
        unit = fill_value.unit
        fill_value = fill_value.value
    
    # Extract scalar from masked arrays if needed
    if hasattr(fill_value, 'unmasked'):
        fill_value = float(fill_value.unmasked)
    elif hasattr(fill_value, 'item'):
        fill_value = float(fill_value.item())
    else:
        fill_value = float(fill_value)

    percentile = np.asanyarray(percentile)
    
    # Initialize cut column with proper shape based on percentile
    if percentile.shape == ():
        cut_table["cut"] = np.full(n_bins, fill_value, dtype=float)
    else:
        cut_table["cut"] = np.full((n_bins, len(percentile)), fill_value, dtype=float)

    if unit is not None:
        cut_table["cut"].unit = unit

    for bin_idx, group in zip(by_bin.groups.keys, by_bin.groups):
        # replace bins with too few events with fill_value
        n_events = len(group)
        cut_table["n_events"][bin_idx] = n_events

        if n_events < min_events:
            cut_table["cut"].value[bin_idx] = fill_value
        else:
            # Extract the actual data from masked arrays
            group_values = group["values"]
            
            # Handle different types of masked arrays
            if hasattr(group_values, 'unmasked'):
                # Astropy MaskedNDArray - get unmasked data and handle mask separately
                data = np.asarray(group_values.unmasked)
                if hasattr(group_values, 'mask') and group_values.mask is not None:
                    mask = np.asarray(group_values.mask)
                    data = np.where(mask, np.nan, data)
            elif hasattr(group_values, 'filled'):
                # numpy.ma.MaskedArray - fill masked values with nan
                data = group_values.filled(np.nan)
            else:
                # Regular array
                data = np.asarray(group_values, dtype=float)
            
            value = np.nanpercentile(data, percentile)
            if min_value is not None or max_value is not None:
                value = np.clip(value, min_value, max_value)

            cut_table["cut"].value[bin_idx] = value

    if smoothing is not None:
        cut_table["cut"].value[:] = gaussian_filter1d(
            cut_table["cut"].value,
            smoothing,
            mode="nearest",
        )

    return cut_table

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions