This is a major release from 0.21.1 and includes a single, API-breaking change. We recommend that all users upgrade to this version after carefully reading the release note (singular!).
Pandas 0.22.0 changes the handling of empty and all-NA sums and products. The summary is that
The sum of an empty or all-NA Series is now 0
Series
0
The product of an empty or all-NA Series is now 1
1
We’ve added a min_count parameter to .sum() and .prod() controlling the minimum number of valid values for the result to be valid. If fewer than min_count non-NA values are present, the result is NA. The default is 0. To return NaN, the 0.21 behavior, use min_count=1.
min_count
.sum()
.prod()
NaN
min_count=1
Some background: In pandas 0.21, we fixed a long-standing inconsistency in the return value of all-NA series depending on whether or not bottleneck was installed. See Sum/prod of all-NaN or empty Series/DataFrames is now consistently NaN. At the same time, we changed the sum and prod of an empty Series to also be NaN.
Based on feedback, we’ve partially reverted those changes.
The default sum for empty or all-NA Series is now 0.
pandas 0.21.x
In [1]: pd.Series([]).sum() Out[1]: nan In [2]: pd.Series([np.nan]).sum() Out[2]: nan
pandas 0.22.0
In [1]: pd.Series([]).sum() Out[1]: 0.0 In [2]: pd.Series([np.nan]).sum() Out[2]: 0.0
The default behavior is the same as pandas 0.20.3 with bottleneck installed. It also matches the behavior of NumPy’s np.nansum on empty and all-NA arrays.
np.nansum
To have the sum of an empty series return NaN (the default behavior of pandas 0.20.3 without bottleneck, or pandas 0.21.x), use the min_count keyword.
In [3]: pd.Series([]).sum(min_count=1) Out[3]: nan
Thanks to the skipna parameter, the .sum on an all-NA series is conceptually the same as the .sum of an empty one with skipna=True (the default).
skipna
.sum
skipna=True
In [4]: pd.Series([np.nan]).sum(min_count=1) # skipna=True by default Out[4]: nan
The min_count parameter refers to the minimum number of non-null values required for a non-NA sum or product.
Series.prod() has been updated to behave the same as Series.sum(), returning 1 instead.
Series.prod()
Series.sum()
In [5]: pd.Series([]).prod() Out[5]: 1.0 In [6]: pd.Series([np.nan]).prod() Out[6]: 1.0 In [7]: pd.Series([]).prod(min_count=1) Out[7]: nan
These changes affect DataFrame.sum() and DataFrame.prod() as well. Finally, a few less obvious places in pandas are affected by this change.
DataFrame.sum()
DataFrame.prod()
Grouping by a Categorical and summing now returns 0 instead of NaN for categories with no observations. The product now returns 1 instead of NaN.
Categorical
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b']) In [9]: pd.Series([1, 2]).groupby(grouper).sum() Out[9]: a 3.0 b NaN dtype: float64
pandas 0.22
In [8]: grouper = pd.Categorical(['a', 'a'], categories=['a', 'b']) In [9]: pd.Series([1, 2]).groupby(grouper).sum() Out[9]: a 3 b 0 Length: 2, dtype: int64
To restore the 0.21 behavior of returning NaN for unobserved groups, use min_count>=1.
min_count>=1
In [10]: pd.Series([1, 2]).groupby(grouper).sum(min_count=1) Out[10]: a 3.0 b NaN Length: 2, dtype: float64
The sum and product of all-NA bins has changed from NaN to 0 for sum and 1 for product.
In [11]: s = pd.Series([1, 1, np.nan, np.nan], ....: index=pd.date_range('2017', periods=4)) ....: s Out[11]: 2017-01-01 1.0 2017-01-02 1.0 2017-01-03 NaN 2017-01-04 NaN Freq: D, dtype: float64 In [12]: s.resample('2d').sum() Out[12]: 2017-01-01 2.0 2017-01-03 NaN Freq: 2D, dtype: float64
In [11]: s = pd.Series([1, 1, np.nan, np.nan], ....: index=pd.date_range('2017', periods=4)) ....: In [12]: s.resample('2d').sum() Out[12]: 2017-01-01 2.0 2017-01-03 0.0 Freq: 2D, Length: 2, dtype: float64
To restore the 0.21 behavior of returning NaN, use min_count>=1.
In [13]: s.resample('2d').sum(min_count=1) Out[13]: 2017-01-01 2.0 2017-01-03 NaN Freq: 2D, Length: 2, dtype: float64
In particular, upsampling and taking the sum or product is affected, as upsampling introduces missing values even if the original series was entirely valid.
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02']) In [15]: pd.Series([1, 2], index=idx).resample('12H').sum() Out[15]: 2017-01-01 00:00:00 1.0 2017-01-01 12:00:00 NaN 2017-01-02 00:00:00 2.0 Freq: 12H, dtype: float64
In [14]: idx = pd.DatetimeIndex(['2017-01-01', '2017-01-02']) In [15]: pd.Series([1, 2], index=idx).resample("12H").sum() Out[15]: 2017-01-01 00:00:00 1 2017-01-01 12:00:00 0 2017-01-02 00:00:00 2 Freq: 12H, Length: 3, dtype: int64
Once again, the min_count keyword is available to restore the 0.21 behavior.
In [16]: pd.Series([1, 2], index=idx).resample("12H").sum(min_count=1) Out[16]: 2017-01-01 00:00:00 1.0 2017-01-01 12:00:00 NaN 2017-01-02 00:00:00 2.0 Freq: 12H, Length: 3, dtype: float64
Rolling and expanding already have a min_periods keyword that behaves similar to min_count. The only case that changes is when doing a rolling or expanding sum with min_periods=0. Previously this returned NaN, when fewer than min_periods non-NA values were in the window. Now it returns 0.
min_periods
min_periods=0
pandas 0.21.1
In [17]: s = pd.Series([np.nan, np.nan]) In [18]: s.rolling(2, min_periods=0).sum() Out[18]: 0 NaN 1 NaN dtype: float64
In [17]: s = pd.Series([np.nan, np.nan]) In [18]: s.rolling(2, min_periods=0).sum() Out[18]: 0 0.0 1 0.0 Length: 2, dtype: float64
The default behavior of min_periods=None, implying that min_periods equals the window size, is unchanged.
min_periods=None
If you maintain a library that should work across pandas versions, it may be easiest to exclude pandas 0.21 from your requirements. Otherwise, all your sum() calls would need to check if the Series is empty before summing.
sum()
With setuptools, in your setup.py use:
setup.py
install_requires=['pandas!=0.21.*', ...]
With conda, use
requirements: run: - pandas !=0.21.0,!=0.21.1
Note that the inconsistency in the return value for all-NA series is still there for pandas 0.20.3 and earlier. Avoiding pandas 0.21 will only help with the empty case.
A total of 1 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
Tom Augspurger