v0.22.0 (December 29, 2017)

This is a major release from 0.21.1 and includes a single, API-breaking change. We recommend that all users upgrade to this version after carefully reading the release note (singular!).

Backwards incompatible API changes

Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The summary is that

  • The sum of an empty or all-NA Series is now 0

  • The product of an empty or all-NA Series is now 1

  • We’ve added a min_count parameter to .sum() and .prod() controlling the minimum number of valid values for the result to be valid. If fewer than min_count non-NA values are present, the result is NA. The default is 0. To return NaN, the 0.21 behavior, use min_count=1.

Some background: In pandas 0.21, we fixed a long-standing inconsistency in the return value of all-NA series depending on whether or not bottleneck was installed. See Sum/prod of all-NaN or empty Series/DataFrames is now consistently NaN. At the same time, we changed the sum and prod of an empty Series to also be NaN.

Based on feedback, we’ve partially reverted those changes.

Arithmetic operations

The default sum for empty or all-NA Series is now 0.

pandas 0.21.x

In [1]: pd.Series([]).sum()
Out[1]: nan

In [2]: pd.Series([np.nan]).sum()
Out[2]: nan

pandas 0.22.0

In [1]: pd.Series([]).sum()
Out[1]: 0.0

In [2]: pd.Series([np.nan]).sum()
Out[2]: 0.0

The default behavior is the same as pandas 0.20.3 with bottleneck installed. It also matches the behavior of NumPy’s np.nansum on empty and all-NA arrays.

To have the sum of an empty series return NaN (the default behavior of pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the min_count keyword.

In [3]: pd.Series([]).sum(min_count=1)
Out[3]: nan

Thanks to the skipna parameter, the .sum on an all-NA series is conceptually the same as the .sum of an empty one with skipna=True (the default).

In [4]: pd.Series([np.nan]).sum(min_count=1)  # skipna=True by default
Out[4]: nan

The min_count parameter refers to the minimum number of non-null values required for a non-NA sum or product.

Series.prod() has been updated to behave the same as Series.sum(), returning 1 instead.

In [5]: pd.Series([]).prod()
Out[5]: 1.0

In [6]: pd.Series([np.nan]).prod()
Out[6]: 1.0

In [7]: pd.Series([]).prod(min_count=1)
Out[7]: nan

These changes affect DataFrame.sum() and DataFrame.prod() as well. Finally, a few less obvious places in pandas are affected by this change.

Grouping by a categorical

Grouping by a Categorical and summing now returns 0 instead of NaN for categories with no observations. The product now returns 1 instead of NaN.

pandas 0.21.x

In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]:
a    3.0
b    NaN
dtype: float64

pandas 0.22

In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b'])

In [9]: pd.Series([1, 2]).groupby(grouper).sum()
Out[9]: 
a    3
b    0
Length: 2, dtype: int64

To restore the 0.21 behavior of returning NaN for unobserved groups, use min_count>=1.

In [10]: pd.Series([1, 2]).groupby(grouper).sum(min_count=1)
Out[10]: 
a    3.0
b    NaN
Length: 2, dtype: float64

Resample

The sum and product of all-NA bins has changed from NaN to 0 for sum and 1 for product.

pandas 0.21.x

In [11]: s = pd.Series([1, 1, np.nan, np.nan],
   ....:               index=pd.date_range('2017', periods=4))
   ....: s
Out[11]:
2017-01-01    1.0
2017-01-02    1.0
2017-01-03    NaN
2017-01-04    NaN
Freq: D, dtype: float64

In [12]: s.resample('2d').sum()
Out[12]:
2017-01-01    2.0
2017-01-03    NaN
Freq: 2D, dtype: float64

pandas 0.22.0

In [11]: s = pd.Series([1, 1, np.nan, np.nan],
   ....:               index=pd.date_range('2017', periods=4))
   ....: 

In [12]: s.resample('2d').sum()
Out[12]: 
2017-01-01    2.0
2017-01-03    0.0
Freq: 2D, Length: 2, dtype: float64

To restore the 0.21 behavior of returning NaN, use min_count>=1.

In [13]: s.resample('2d').sum(min_count=1)
Out[13]: 
2017-01-01    2.0
2017-01-03    NaN
Freq: 2D, Length: 2, dtype: float64

In particular, upsampling and taking the sum or product is affected, as upsampling introduces missing values even if the original series was entirely valid.

pandas 0.21.x

In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])

In [15]: pd.Series([1, 2], index=idx).resample('12H').sum()
Out[15]:
2017-01-01 00:00:00    1.0
2017-01-01 12:00:00    NaN
2017-01-02 00:00:00    2.0
Freq: 12H, dtype: float64

pandas 0.22.0

In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02'])

In [15]: pd.Series([1, 2], index=idx).resample("12H").sum()
Out[15]: 
2017-01-01 00:00:00    1
2017-01-01 12:00:00    0
2017-01-02 00:00:00    2
Freq: 12H, Length: 3, dtype: int64

Once again, the min_count keyword is available to restore the 0.21 behavior.

In [16]: pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1)
Out[16]: 
2017-01-01 00:00:00    1.0
2017-01-01 12:00:00    NaN
2017-01-02 00:00:00    2.0
Freq: 12H, Length: 3, dtype: float64

Rolling and expanding

Rolling and expanding already have a min_periods keyword that behaves similar to min_count. The only case that changes is when doing a rolling or expanding sum with min_periods=0. Previously this returned NaN, when fewer than min_periods non-NA values were in the window. Now it returns 0.

pandas 0.21.1

In [17]: s = pd.Series([np.nan, np.nan])

In [18]: s.rolling(2, min_periods=0).sum()
Out[18]:
0   NaN
1   NaN
dtype: float64

pandas 0.22.0

In [17]: s = pd.Series([np.nan, np.nan])

In [18]: s.rolling(2, min_periods=0).sum()
Out[18]: 
0    0.0
1    0.0
Length: 2, dtype: float64

The default behavior of min_periods=None, implying that min_periods equals the window size, is unchanged.

Compatibility

If you maintain a library that should work across pandas versions, it may be easiest to exclude pandas 0.21 from your requirements. Otherwise, all your sum() calls would need to check if the Series is empty before summing.

With setuptools, in your setup.py use:

install_requires=['pandas!=0.21.*', ...]

With conda, use

requirements:
  run:
    - pandas !=0.21.0,!=0.21.1

Note that the inconsistency in the return value for all-NA series is still there for pandas 0.20.3 and earlier. Avoiding pandas 0.21 will only help with the empty case.

Contributors

A total of 1 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Tom Augspurger