Dealing with dates_Python Data Analysis（Second Edition）-QQ阅读中文古言网

上QQ阅读APP看书，第一时间看更新

Dealing with dates

Dates are complicated. Just think of the Y2K bug, the pending Year 2038 problem, and the confusion caused by time zones. It's a mess. We encounter dates naturally when dealing with the time-series data. Pandas can create date ranges, resample time-series data, and perform date arithmetic operations.

Create a range of dates starting from January 1 1900 and lasting 42 days, as follows:

print("Date range", pd.date_range('1/1/1900', periods=42, freq='D'))

January has less than 42 days, so the end date falls in February, as you can check for yourself:

Date range <class 'pandas.tseries.index.DatetimeIndex'>
[1900-01-01, ..., 1900-02-11]
Length: 42, Freq: D, Timezone: None

The following table from the Pandas official documentation (refer to http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) describes the frequencies used in Pandas:

Date ranges have their limits in Pandas. Timestamps in Pandas (based on the NumPy datetime64 data type) are represented by a 64-bit integer with nanosecond resolution (a billionth of a second). This limits legal timestamps to dates in the range approximately between the year 1677 and 2262 (not all dates in these years are valid). The exact midpoint of this range is at January 1 1970. For example, January 1 1677 cannot be defined with a Pandas timestamp, while September 30 1677 can, as demonstrated in the following code snippet:

try: 
   print("Date range", pd.date_range('1/1/1677', periods=4, freq='D')) 
except: 
   etype, value, _ = sys.exc_info() 
   print("Error encountered", etype, value)

The code snippet prints the following error message:

Date range Error encountered <class 'pandas.tslib.OutOfBoundsDatetime'> Out of bounds nanosecond timestamp: 1677-01-01 00:00:00

Given all the previous information, calculate the allowed date range with Pandas DateOffset as follows:

offset = DateOffset(seconds=2 ** 33/10 ** 9) 
mid = pd.to_datetime('1/1/1970') 
print("Start valid range", mid - offset) 
print("End valid range", mid + offset')

We get the following range values:

Start valid range 1969-12-31 23:59:51.410065 
End valid range 1970-01-01 00:00:08.589935

We can convert a list of strings to dates with Pandas. Of course, not all strings can be converted. If Pandas is unable to convert a string, an error is often reported. Sometimes, ambiguities can arise due to differences in the way dates are defined in different locales. In this case, use a format string, as follows:

print("With format", pd.to_datetime(['19021112', '19031230'], format='%Y%m%d'))

The strings should be converted without an error occurring:

With format [datetime.datetime(1902, 11, 12, 0, 0) datetime.datetime(1903, 12, 30, 0, 0)]

If we try to convert a string, which is clearly not a date, by default the string is not converted:

print("Illegal date", pd.to_datetime(['1902-11-12', 'not a date']))

The second string in the list should not be converted:

Illegal date ['1902-11-12' 'not a date']

To force conversion, set the coerce parameter to True:

print("Illegal date coerced", pd.to_datetime(['1902-11-12', 'not a date'], errors='coerce'))

Obviously, the second string still cannot be converted to a date, so the only valid value we can give it is NaT ('not a time'):

Illegal date coerced <class 'pandas.tseries.index.DatetimeIndex'>
[1902-11-12, NaT]Length: 2, Freq: None, Timezone: None

The code for this example is in ch-03.ipynb of this book's code bundle.