BUG: DataFrame.sample weights not required to sum to less than 1 #61516

dougj892 · 2025-05-29T22:14:58Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = {'w': [100, 1, 1]}
df = pd.DataFrame(data)

df.sample(n=2, weights=df.w, replace=False)

Issue Description

In order for PPS sampling without replacement to be feasible, the selection probabilities must be less than 1, i.e.

$ \frac{n \cdot w_i}{\sum w_i}< 1$

where w is the weight and n is the total number of units to be sampled. This is often not the case if you are selecting a decent proportion of all units and there is wide variance in unit size. For example, suppose you want to select 2 units with PPS without replacement from a sampling frame of 3 units with sizes 100, 1, and 1. There is no way to make the probability of selection of the first unit 100x the probability of selection of the other two units (since the max prob for the first unit is 1 and at least one of the other units must have prob >= .5).

Unfortunately, pandas df.sampling function doesn't throw an error in this case.

Expected Behavior

The code above should throw some sort of error like "Some unit probabilities are larger than 1 and thus PPS sampling without replacement cannot be performed"

Installed Versions

Replace this line with the output of pd.show_versions()

rhshadrach · 2025-05-30T19:06:27Z

Thanks for the report. As named and documented, these are weights and not probabilities. pandas will normalize the sum to 1 so that they are used a probabilities. E.g. in your example, the corresponding probabilities are:

100 / (100 + 1 + 1), 1 / (100 + 1 + 1), 1 / (100 + 1 + 1)

Marking as a closing candidate for now, let me know if you think there is still behavior that needs adjusted.

Tanush-Jain · 2025-05-30T19:21:45Z

Hi @dougj892 ,

I’ve reviewed this issue and confirmed the bug:

When replace=False in DataFrame.sample(), the current implementation doesn’t validate whether the normalized weights (n * w_i / sum(w)) exceed 1, which violates PPS sampling constraints. The example with weights [100, 1, 1] and n=2 silently produces biased samples instead of raising an error.

Proposed Fix
I’ll address this by:

Adding Validation:

Compute max_normalized_weight = (n * weights.max()) / weights.sum() when replace=False.

Raise a ValueError if max_normalized_weight > 1, with a clear message like:

"PPS sampling without replacement requires (nw_i/sum(w)) ≤ 1 for all weights. Found max normalized weight = {max_weight:.2f}. Use replace=True, adjust weights, or reduce n."*

2)Edge Cases:

Skip validation for replace=True or weights=None.

Handle zero-sum weights gracefully.

3)Documentation:

Update the sample() docstring to clarify PPS constraints.

4)Tests:

Add test cases for invalid/valid weights (e.g., [100, 1, 1] with n=2 vs. n=1).

Verify edge cases (zero weights, empty DataFrames).

Next Steps
I’ll submit a PR shortly with these changes. Let me know if you’d prefer any adjustments to the error messaging or implementation approach.

Thanks for the review!

dougj892 · 2025-05-30T19:30:29Z

Hi @Tanush-Jain - Thanks a lot for the rapid response! That makes a lot of sense to me. For reference, the error message you get when you do this in Stata is "X cases has w_i*n/sum(w)>1" so I think that error message makes sense.

rhshadrach · 2025-05-30T19:33:10Z

I am negative on restricting the weights here. Allowing weights to be greater than 1 is quite common, e.g. random.choices.

Tanush-Jain · 2025-05-30T19:40:50Z

@rhshadrach @dougj892 Thanks for clarifying! I now understand that the current weights behavior intentionally supports broader use cases beyond PPS sampling.

Since this isn’t actually a bug (just a documentation gap for PPS use cases),
Probably the maintainer can (as a suggestion):

Propose a docs PR to:
- Add a Note in sample()'s docstring warning that strict PPS sampling requires (n * weights / weights.sum()).max() <= 1 when replace=False.
- Include an example like:
```
# Manual PPS validation  
if not replace and weights is not None:  
    assert (n * weights / weights.sum()).max() <= 1, "PPS sampling violation"  
```
also let the maintainers decide whether to:
- Merge the docs update.
- Close this issue (since it’s working as designed).

Happy to implement whichever path you prefer!

dougj892 · 2025-05-30T20:08:22Z

@Tanush-Jain and @rhshadrach

To clarify a bit, I agree that weights greater than 1 are totally fine. The issue with the example given is that the probabilities are not in fact proportional to the weights – because they can’t be. @rhshadrach You mentioned that the probabilities for the three units are 100/102, 1/102, and 1/102. In the example code, the probabilities for the last two units are actually ~50%. To see this note that if we sample 2 units without replacement, we are almost always going to select the first unit which means that we have a 50/50 chance of selecting either of the last two units.

This may seem like a small issue but it prevents folks like me from relying on pandas for sampling use cases. For example, I often use PPS wor sampling to select primary stage units (typically villages) for household surveys. If PPS wor sampling is not possible because the variance in weights is too large, I need to the function to return an error so that I can fix things (typically by splitting up larger PSUs or sampling some with certainty). Without an error, I will end up miscalculating the final probablity of selection of the households in my survey.

And, again, thanks so much for taking a look at this both of you!

rhshadrach · 2025-05-30T20:27:14Z

@dougj892 - you get the same behavior using

df.sample(n=2, weights=[100/102, 1/102, 1/102], replace=False)

no?

Edit: Ah, I see, you're multiplying by n in your condition in the OP.

rhshadrach · 2025-05-30T20:32:19Z

I'm positive on this change, but would need to go through the deprecation process of PDEP-17.

microslaw · 2025-06-04T13:58:09Z

Can I start working on this issue? I'm looking for first contribution and this seems simple enough

rhshadrach · 2025-06-04T18:10:15Z

cc @mroeschke - can I get a 2nd eye here, make sure we're good to move forward.

mroeschke · 2025-06-04T21:39:00Z

Would it make sense to just call this a "bug fix" for the case when weights are passed with replace=False?

rhshadrach · 2025-06-04T22:21:05Z

Looking at this again, I think there is a good argument that pandas is returning an incorrect result, even if it is long standing behavior (I haven't checked this, but assume it is). I'm good with a bugfix.

rhshadrach · 2025-06-04T22:21:18Z

@microslaw - PR is welcome!

microslaw · 2025-06-05T07:22:54Z

take

dougj892 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2025

rhshadrach added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 30, 2025

rhshadrach changed the title ~~BUG:~~ BUG: DataFrame.sample weights not required to sum to less than 1 May 30, 2025

rhshadrach added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action and removed Closing Candidate May be closeable, needs more eyeballs labels May 30, 2025

rhshadrach removed Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jun 4, 2025

github-actions bot assigned microslaw Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: DataFrame.sample weights not required to sum to less than 1 #61516

BUG: DataFrame.sample weights not required to sum to less than 1 #61516

dougj892 commented May 29, 2025

rhshadrach commented May 30, 2025

Uh oh!

Tanush-Jain commented May 30, 2025

Uh oh!

dougj892 commented May 30, 2025

Uh oh!

rhshadrach commented May 30, 2025

Uh oh!

Tanush-Jain commented May 30, 2025

Uh oh!

dougj892 commented May 30, 2025 •

edited

Loading

Uh oh!

rhshadrach commented May 30, 2025 •

edited

Loading

Uh oh!

rhshadrach commented May 30, 2025

Uh oh!

microslaw commented Jun 4, 2025

Uh oh!

rhshadrach commented Jun 4, 2025

Uh oh!

mroeschke commented Jun 4, 2025

Uh oh!

rhshadrach commented Jun 4, 2025

Uh oh!

rhshadrach commented Jun 4, 2025

Uh oh!

microslaw commented Jun 5, 2025

Uh oh!

Uh oh!

BUG: DataFrame.sample weights not required to sum to less than 1 #61516

BUG: DataFrame.sample weights not required to sum to less than 1 #61516

Comments

dougj892 commented May 29, 2025

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

rhshadrach commented May 30, 2025

Uh oh!

Tanush-Jain commented May 30, 2025

Uh oh!

dougj892 commented May 30, 2025

Uh oh!

rhshadrach commented May 30, 2025

Uh oh!

Tanush-Jain commented May 30, 2025

Uh oh!

dougj892 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented May 30, 2025

Uh oh!

microslaw commented Jun 4, 2025

Uh oh!

rhshadrach commented Jun 4, 2025

Uh oh!

mroeschke commented Jun 4, 2025

Uh oh!

rhshadrach commented Jun 4, 2025

Uh oh!

rhshadrach commented Jun 4, 2025

Uh oh!

microslaw commented Jun 5, 2025

Uh oh!

dougj892 commented May 30, 2025 •

edited

Loading

rhshadrach commented May 30, 2025 •

edited

Loading