In [1]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
In [2]:
from IPython.core.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; }</style>"))

import logging
logging.basicConfig(level='INFO')
import sys
sys.path.append('/opt/webrender')

Welcome to Timeseria!

Timeseria is a time series processing library which aims at making it easy to handle time series data and to build statistical and machine learning models on top of it.

It comes with a built-in set of common operations (resampling, slotting, differencing etc.) as well as models (reconstruction, forecasting and anomaly detection), and both custom operations and models can be easily plugged in.

Timeseria also tries to address by design all those annoying things which are often left as an implementation detail but that actually cause wasting massive amounts of time - as handling data losses, non-uniform sampling rates, differences between time-slotted data and punctual observations, variable time units, timezones, DST changes and so on.

This is a realitively structured welcome tutorial. If you are looking o get started immeditaley, have a look at the quickstart notebook. Also the reference documentation might be useful.

You can also run this welcome tutorial interactively in Binder.

Table of contents



Data structures

Points


Points are the most basic datastructure. They can be initialized in any n-dimensional space you want:

In [3]:
from timeseria.datastructures import Point
point = Point(5,7)

..that's it, just a point:

In [4]:
point
Out[4]:
Point @ (5, 7)

TimePoints are Points in the "time" dimension, which is epoch (the number of seconds from the midnight of 1st January 1970, UTC). Milliseconds, microseconds etc. can be expressed using the decimal places.

In [5]:
from timeseria.datastructures import TimePoint
TimePoint(60.857754)
Out[5]:
TimePoint @ 60.857754 (1970-01-01 00:01:00.857754+00:00)

They put some more focus on time management, and for added clarity and ease of use their time coordinate can be also set and accessed as "t" as well as "dt" using Python DateTime objects:

In [6]:
timepoint = TimePoint(t=60)
timepoint.t
Out[6]:
60
In [7]:
import datetime
timepoint = TimePoint(dt=datetime.datetime(1970, 1, 1, 0, 1))
timepoint.dt
Out[7]:
datetime.datetime(1970, 1, 1, 0, 1, tzinfo=<UTC>)

..and you can of course set the "t" and access the "dt" and vice-versa, they get automatically computed. Aslo note that when setting the dt here above, the UTC timezone has been automatically set as in Timeseria no timestamps without timezones are allowed.

In [8]:
timepoint.t
Out[8]:
60.0

DataTimePoints are TimePoints with some data attached. This data can be whatever you want: a string, an image, a single value, a list of values, a dict of values. However, most of the functionalities of Timeseria require it to be in the form of a list of numerical values or a dict of numerical values, which is the format that will be used in the following.

In [9]:
from timeseria.datastructures import DataTimePoint
DataTimePoint(t=60.857754, data=[56.8])
Out[9]:
DataTimePoint @ 60.857754 (1970-01-01 00:01:00.857754+00:00) with data "[56.8]"

This concept, of encapsulating data on another level with respect to the time coordinate, it is one of the main building pillars of Timeseria. We will not therefore handle matrices of x,y data where x is the time and y is the value, instead will have a key for the time and the data as its value.

DataTimePoints can also have a specific attribute for handlig missing data, the data_loss index (a floating point number between 0 and 1), that will be widely used in the following:

In [10]:
DataTimePoint(t=60.857754, data=[56.8], data_loss=0.76)
Out[10]:
DataTimePoint @ 60.857754 (1970-01-01 00:01:00.857754+00:00) with data "[56.8]" and data_loss="0.76"


Slots


While points are meant to represent punctual observations (i.e. a sensor reading or an event occured at a specific time), slots are meant to represent data over a time period (i.e. from a meter or aggregated over time). Before intorducing them, let's move a step back to explain while it is worth to add a different data structure other than the point.

In time series, samples are naturally identified with a specific point in time, in particular when the sampling rate is high (i.e. a data point every minute). However, when it comes to lowe sampling rates (i.e. a datapoint every hour) or to data aggregated over a time period, the punctual definition of the sample starts falling apart.

When we talk about hourly data at 15:00, are we talking about a punctual observation for 15:00 (and therefoe somewhat representative of what happened between 14:30 to 15:30), or are we talking about the aggregated data between 15:00 and 16:00?

Things get ever worse when we talk about days, months and years: while hours and minute intervals have always the same length in seconds (60 and 3600, respectively), a day can last 23, 24 or 25 hours, due to daylight saving time. So that if you identify a day by its midnight, you can't just add 24 hours to get to the next day. Similary for months and years due to the variable number of days.

To properly address these issues, Timeseria formalizes and introduces a new concept alongside the Point: the Slot.

A Slot, unlike a Point, it is not identified by a single coordinate, but by a starting and and ending point:

In [11]:
from timeseria.datastructures import Slot
Slot(start=Point(1), end=Point(3))
Out[11]:
Slot @ [1,3]

Which in two dimensions is a rectangle, and in one dimension (like in the time dimension) is basically just an interval.

As for the Points, we have also TimeSlots, DataSlots and, most importantly, the DataTimeSlot:

In [12]:
from timeseria.datastructures import DataTimeSlot
DataTimeSlot(start=TimePoint(t=0), end=TimePoint(t=3600), data=[24.5])
Out[12]:
DataTimeSlot @ [0,3600] ([1970-01-01 00:00:00+00:00,1970-01-01 01:00:00+00:00]) with data=[24.5]

Slots are naturally associated with a unit, which represents the portion of the space occupied by the slot itself:

In [13]:
slot = Slot(start=Point(1), end=Point(3))
slot.unit
Out[13]:
2

You can create slots by just setting the starting point and the unit:

In [14]:
from timeseria.units import Unit
DataTimeSlot(start=TimePoint(t=0), unit=3600, data=[24.5])
Out[14]:
DataTimeSlot @ [0,3600.0] ([1970-01-01 00:00:00+00:00,1970-01-01 01:00:00+00:00]) with data=[24.5]

Units can be handled either as floating point values or using the dedicated Unit object:

In [15]:
from timeseria.units import Unit
DataTimeSlot(start=TimePoint(t=0), unit=Unit(3600), data=[24.5])
Out[15]:
DataTimeSlot @ [0,3600.0] ([1970-01-01 00:00:00+00:00,1970-01-01 01:00:00+00:00]) with data=[24.5]

The reason for a dedicated Unit object is to support variable length units, which might look like a counter-intuitive concept but it is on the contrary the only way to handle calendar times units as days, weeks, and years.

Indeed, while in the time dimension a slot with a length of 3600 is just an hour, an 86400-length slot is not a day! This is because days are actually a variable lenght unit, since due to Daylight Saving Time they can last 86400, 82800 or 90000 seconds. The same applies for months (which can have a variable number of days) and in turn, for years (with the leap years).

A dedicated Unit object allows to implement a specific obect for modeling variable length time units, the TimeUnit. TimeUnits can handle both "physical" and "calendar" units, where "physical" units are always fixed in legth (as seconds, minutes and hours), and "calendar" units have a variable lenght depending on a calendar (as days, weeks, months, and years).

TimeUnits are defined either manually setting all their components (ad for Python datetime objets) or using their string representation: 1s for one second, 1m for one minute, 1h for one hour, 1D for one day, 1M for one month, 1W for one weelk and 1Y for one year. The 1 can be replaced by any integer of choice.

For example, to create a one-day slot using a TimeUnit object:

In [16]:
from timeseria.units import TimeUnit
DataTimeSlot(start=TimePoint(t=0), unit=TimeUnit('1D'), data=[24.5])
Out[16]:
DataTimeSlot @ [0,86400.0] ([1970-01-01 00:00:00+00:00,1970-01-02 00:00:00+00:00]) with data=[24.5]

TimeSlots defined by a "calendar" TimeUnit, like "1D" for one day and "1M" for one month can therefore have different lengths even if they represent slots with the same time unit.

The TimeUnit object supports time math so that you can sum a day after another or a month after another, and can also be used for generic time manipulation with classic Python datetime objects. However, be aware that some operations are are not always well-defined, for example if you sum a TimeUnit of one month to the 31st of January, you would end up on the 31st of February, which does not exist (and that will lead to an error).

As for the DataPoints, DataTimeSlots can also have a data_loss index, which represents how well the slot is "covered" by the underlying data points. If you for example have hourly slots computed out from 10-minute data points, and you miss two of them for a given slot, this slot will have a data_loss of 0.66. Data losses are usually automatically computed out from resampling or slotting transformations, but can also be set manually:

In [17]:
DataTimeSlot(start=TimePoint(t=0), end=TimePoint(t=3600), data=[5], data_loss=0.66)
Out[17]:
DataTimeSlot @ [0,3600] ([1970-01-01 00:00:00+00:00,1970-01-01 01:00:00+00:00]) with data=[5] and data_loss=0.66


Time series


Ok, let's now move the the time series. The time series we all know (data points over time) is implemented using the DataTimePointSeries object, which is a series of DataTimePoints. All the other "components", like the Series, the PointSeries, the TimePointSeries and the DataPointSeries are all defined and useful in some specific contexts.

Creating a time series using the DataTimePointSeries object is similar to creating Python list:

In [18]:
from timeseria.datastructures import DataTimePointSeries

timeseries = DataTimePointSeries(DataTimePoint(t=60, data=[5.8]),
                                 DataTimePoint(t=120, data=[6.1]),
                                 DataTimePoint(t=180, data=[5.7]),
                                 DataTimePoint(t=240, data=[4.8]))

As for Python lists, you can also append new items:

In [19]:
timeseries.append(DataTimePoint(t=300, data=[5.1]))

From now on you only need to care about how to manipulate the DataTimePointSerie object: you don't really need to care about the internals and data handling when performing common tasks.

For example, to plot it:

In [20]:
timeseries.plot()

This is not the classic Jupyter plot, it is an advanced plot powered by Dygraphs which is designed exclusively for time series data. It smoothly handles thousands of data points, and you can even plot millions of them, since Timeseria will perform an aggegation before plotting to prevent crashing your browser as we will see in a moment. Try to zoom-in by highlighting a region, move the slider and then double-click to reset the zoom.

Time series have also defined the resolution attribute, which for a point time series is either a floating point number representing the sampling interval in seconds, or the string 'variable' if the samples are not equally spaced (i.e. in case of data losses).

In [21]:
timeseries.resolution
Out[21]:
1m

You can also create a slot timeseries:

In [22]:
from timeseria.datastructures import DataTimeSlotSeries
from timeseria.time import dt
unit = TimeUnit('1M')
timeseries = DataTimeSlotSeries(DataTimeSlot(start=TimePoint(dt=dt(2016,1,1)), unit=unit, data={'kWh':5.8}),
                                DataTimeSlot(start=TimePoint(dt=dt(2016,2,1)), unit=unit, data={'kWh':6.1}),
                                DataTimeSlot(start=TimePoint(dt=dt(2016,3,1)), unit=unit, data={'kWh':5.7}),
                                DataTimeSlot(start=TimePoint(dt=dt(2016,4,1)), unit=unit, data={'kWh':4.8}))
In [23]:
timeseries
Out[23]:
Time series of #4 slots of 1M, from slot starting @ 1451606400.0 (2016-01-01 00:00:00+00:00) to slot starting @ 1459468800.0 (2016-04-01 00:00:00+00:00)
In [24]:
timeseries.plot()

As you can see the representation changes: from points connected with a linear interpolation we have a step plot, to get somehow closer to a histogram-style plot in order to graphycally render the concept of the slot (the missing line on the last slot is a plotting limitation that will be hopefully fixed in future). The example is indeed plotting an aggregated value for the months, which is the total energy consumprion in kWh.

Also note that the resolution attribute is now providing a string representation and not a floating point value anymore, because the resolution of a slot series is always a string representation of a TimeUnit object (i.e. "1D", "1M", "6h", "300s"):

In [25]:
timeseries.resolution
Out[25]:
1M

To load and store data Timeseria uses the Storage objects. This can be anything capable of storing and loading data: a CSV file, a SQLite database, a Postgres database, or even a distributed NoSQL system. The idea is that you can wrap whatever load and store logic you want in a Storage object, thus decoupling the data storage step from the data processing.

At the moment only CSV file storages are implemented in Timeseria, using the CSVFileStorage class, and we will stick with this one here.

Let's load an indoor temperature time series from CSV file then:

In [26]:
from timeseria import storages
DATASET_PATH = '/'.join(storages.__file__.split('/')[0:-1]) + '/tests/test_data/csv/'

storage = storages.CSVFileStorage(DATASET_PATH + 'temperature_winter.csv')
temperature_timeseries = storage.get()
In [27]:
temperature_timeseries
Out[27]:
Time series of #14403 points at variable resolution (~600s), from point @ 1546477200.0 (2019-01-03 01:00:00+00:00) to point @ 1555550400.0 (2019-04-18 01:20:00+00:00)

As in the example above, we can plot this time series as well, but in this case since there is too much data to provide a smooth plotting experience, data will be aggregated before getting plotted, and the plot itself will look different than the previous one:

In [28]:
temperature_timeseries.plot()
INFO:timeseria.plots:Aggregating by "10" for improved plotting

Try to zoom in, and you will see that this is not exactly a line chart - instead, it is a line chart plus an area chart. This is because even if Timseria aggregated the data points before plotting (by a factor of 10 in this case), it also reported the minimum and maximum value for each aggregation, which is rendered by the area chart.

This is how Timeseria allows to easily visualize millions of data points without crashing the plotting engine, but at the same time not removing much information and most importantly not hiding peaks and spikes.

If you want to see all the data, you can disable the aggregation, if you browser doesn't crash (it shoudl not with this time series):

In [29]:
temperature_timeseries.plot(aggregate=False)

Pandas DataFrames


Pandas data structures are undeniably the most common to manipulate nuerical data. Timeseria does not use them internally for a number of reasons, but it provides conversion functionalities.

You can always access a time series as a Pandas DataFrame using the df property (Slots are dumped using the starting point timestamp):

In [30]:
timeseries.df.head()
Out[30]:
kWh
Timestamp
2016-01-01 00:00:00+00:00 5.8
2016-02-01 00:00:00+00:00 6.1
2016-03-01 00:00:00+00:00 5.7
2016-04-01 00:00:00+00:00 4.8

...and you can straightfowrad create a Timeseria time series out of a Pandas Data Frame, which you might have obtained by other means (i.e. as a result of some computation, or using the pandas.read_csv function):

In [31]:
DataTimePointSeries(timeseries.df)
Out[31]:
Time series of #4 points at variable resolution (~2678400s), from point @ 1451606400.0 (2016-01-01 00:00:00+00:00) to point @ 1459468800.0 (2016-04-01 00:00:00+00:00)
In [32]:
from timeseria.datastructures import DataTimeSlotSeries
DataTimeSlotSeries(timeseries.df)
INFO:timeseria.datastructures:Assuming a slot time unit of "1M"
Out[32]:
Time series of #4 slots of 1M, from slot starting @ 1451606400.0 (2016-01-01 00:00:00+00:00) to slot starting @ 1459468800.0 (2016-04-01 00:00:00+00:00)


Transformations

The Resampler


To change the sampling rate of a point time series you can use the Resampler. A Resampler is initialized by setting the new sampling interval, either as a floating point number for seconds or as a time unit or its string representation (i.e. '10s', '1m', '1h', etc.), but does not support calendar time units as years, months, weeks and days.

In [33]:
from timeseria.transformations import Resampler

resampler = Resampler(3600) # or, '1h'
resampled_temperature_timeseries = resampler.process(temperature_timeseries)
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Resampled 14403 DataTimePoints in 2521 DataTimePoints

You can also access the resampler using the resample() shortcut directly from a time series:

In [34]:
resampled_temperature_timeseries = temperature_timeseries.resample(3600)
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Resampled 14403 DataTimePoints in 2521 DataTimePoints

Let's now plot the resampled time series:

In [35]:
resampled_temperature_timeseries.plot()

The most noticable thing here is the red highlight which is for data losses. A data loss occur when there is not enough information to create the new sample, either because there are not enough datapoints (data loss > 0 and < 1) or because there are no datapoins at all (data loss = 1). By default the resampler reconstructs completely missing data (data loss = 1) by linear interpolation.

You can of course just use a resampler to resample data with the same original sample rate as the original, in this case it will make data uniform and mark the data losses. Let's get back to the original temeprature time series and let's have a look at the resolution attribute:

In [36]:
temperature_timeseries.resolution
Out[36]:
'variable'

The sample rate is reported variable because even if the temperature time series is supposed to be sampled at a 600 seconds intervals, there are missing datapoints which make the sampling as, indeed, variable. Let's now resample it at the same 600 seconds:

In [37]:
uniformed_temperature_timeseries = temperature_timeseries.resample(600)
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Resampled 14403 DataTimePoints in 15123 DataTimePoints

...and let's have a look again at the resolution:

In [38]:
uniformed_temperature_timeseries.resolution
Out[38]:
10m

Which is not variable anymore, but fixed at 600 seconds (and thus ready to be processed from a model, by the way).

The Slotter


To slot a point time series you can use the Slotter. A Slotter is initialized by setting the target slot unit, either as a floating point number for seconds or as a time unit or its string representation (i.e. '15m', '1D', '1M' etc.), with full support for calendar time units as years, months, weeks and days.

Let's see an example of how to use the Slotter to slot our temperature time series in hourly slots:

In [39]:
from timeseria.transformations import Slotter
hourly_temperature_timeseries = Slotter('1h').process(temperature_timeseries)
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Slotted 14403 DataTimePoints in 2520 DataTimeSlots

You can also access the slotter using the slot() shortcut directly from a time series:

In [40]:
hourly_temperature_timeseries = temperature_timeseries.slot('1h')
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Slotted 14403 DataTimePoints in 2520 DataTimeSlots

As you can see, the Slotter automatically detected the sampling interval of the points in the input time series (10 minutes in this case, as we seen before). While this auto detection is usually fast and robust, you can also set the sampling rate manually, and there is also (limited) support for variable sampling rates defined on a per-point basis.

After detecting the sampling interval, the Slotter created the hourly slotted time series that we asked for, uisng the default slotting operation which is the average. You can change it (i.e. you might want to sum the data insetad of averaging it) and there is also support for extra operations as the min, max, etc. as well as for custom operations (see the operations section for more details).

Let's have a look at the slotted time series then:

In [41]:
hourly_temperature_timeseries
Out[41]:
Time series of #2520 slots of 1h, from slot starting @ 1546477200.0 (2019-01-03 01:00:00+00:00) to slot starting @ 1555545600.0 (2019-04-18 00:00:00+00:00)
In [42]:
hourly_temperature_timeseries.resolution
Out[42]:
1h

...and now let's plot it:

In [43]:
hourly_temperature_timeseries.plot()


Another example, slotting in one-day slots:

In [44]:
daily_temperature_timeseries = temperature_timeseries.slot('1D')
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Slotted 14397 DataTimePoints in 104 DataTimeSlots
In [45]:
daily_temperature_timeseries
Out[45]:
Time series of #104 slots of 1D, from slot starting @ 1546560000.0 (2019-01-04 00:00:00+00:00) to slot starting @ 1555459200.0 (2019-04-17 00:00:00+00:00)
In [46]:
daily_temperature_timeseries.plot()

As final comment, please note that if you already have aggregated data, as hourly website visitors or daily energy consumptions, then you need to directly instantiate a DataTimeSlotSeries, without going to the DataTimePointSeries. The CSVFileStorage has support for that, just set slots=True when calling the get() function.

Being aggregates, these data usually don't have data gaps, but if they do instantiating a DataTimeSlotSerie won't work as it would break the requirement of dense coverage of the time dimension. The CSVFileStorage works around this by creating on the fly missing data slots, but if you are creating the DataTimeSlotSerie from another source then either you reconstruct missing data slots manually or you pretend that they are data points and you slot them using a Slotter with the same time unit as the aggregates.

In [47]:
hourly_temperature_timeseries.resolution
Out[47]:
1h


Operations


There are a number of operations available for manipulating time series, which are both available in the operations package as well as directly form the time series objects. Some of them return scalars:

In [48]:
hourly_temperature_timeseries.avg()
Out[48]:
22.494559292328024
In [49]:
hourly_temperature_timeseries.min()
Out[49]:
18.694285714285716

While others, as the diff, mavg and csum return other time series:

In [50]:
hourly_temperature_timeseries.diff().plot()
In [51]:
hourly_temperature_timeseries.mavg(24).plot() # 24 hours moving average


Scalar operations can be used in the Slotter, both as extra operations and to override the default one:

In [52]:
from timeseria.operations import min, max
min_max_slotter = Slotter('1D', extra_operations=[min, max])
min_max_slotter.process(temperature_timeseries).plot()
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Slotted 14397 DataTimePoints in 104 DataTimeSlots


You can also define custom operations as simple Python functions, and use them within the slotting process as well:

In [53]:
def myop(slot_timeseries, **kwargs):
    # When used in the Slotter, this function is called for each slot, and its
    # slot_timeseries argument is the time series portion relative to the slot
    # being processed only.
    # Here we compute the differences, their average and we multiply by 1000
    # just to have a more readable "variation" index
    return slot_timeseries.diff().avg()*1000

myop_slotter = Slotter('1D', extra_operations=[myop])
temperature_timeseries_myop = myop_slotter.process(temperature_timeseries)
temperature_timeseries_myop.plot()
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Slotted 14397 DataTimePoints in 104 DataTimeSlots


There are also time series manipulation operations, as the merge and filter:

In [54]:
hourly_temperature_timeseries.merge(hourly_temperature_timeseries.diff()).plot()
In [55]:
hourly_temperature_timeseries.filter(from_dt=dt(2019, 2, 1), to_dt=dt(2019,3,1)).plot()

In the filter operation you can also filter by a data key and using partial bounds:

In [56]:
temperature_timeseries_myop.filter(data_key='temperature_myop', from_dt=dt(2019, 3, 1))
Out[56]:
Time series of #48 slots of 1D, from slot starting @ 1551398400.0 (2019-03-01 00:00:00+00:00) to slot starting @ 1555459200.0 (2019-04-17 00:00:00+00:00)

...and if you just need to filter by a data key you can also use the square braket notation directly from the time series:

In [57]:
temperature_timeseries_myop['temperature_myop'].plot()


Models

There are two main models in Timeseria: the Models and the ParametricModels. The first one are stateless (you can just use them as they are), while the second ones can be fitted, saved and loaded.

Reconstructors


Reconstructors are models that reconstruct missing parts of a time series in the past. By "in the past" it means that they expect to fill a gap in more or less sophisticated ways, but it is a gap where you know where you need to get after the gap starts: it is not a forecast.

Let's see how we can use one of the most simple data reconstruction models, the PeriodicAverageReconstructor, that just detects if there is a periodicity in the time series and then creates an average for each slot of the period:

In [58]:
from timeseria.models import PeriodicAverageReconstructor
reconstructor = PeriodicAverageReconstructor()
reconstructor.fit(hourly_temperature_timeseries)
INFO:timeseria.models.reconstructors:Detected periodicity: 24x 1h

The fit executed above detected for the temperature time series a periodicty of 24h hours, as one would expect.

Let's now apply the model on our original time series to get a time series where data is reconstructed if missing, and let's also have a look at how this reconstructed data looks like:

In [59]:
hourly_temperature_timeseries_reconstructed = reconstructor.apply(hourly_temperature_timeseries)
hourly_temperature_timeseries_reconstructed.plot()

Having a data reconstruction algorithm in place does not mean that using its data is always what you want to do: for each slot you always have the data_loss index, so that you can choose the logic you prefer when using them for your statistics or for training your models, and you can for example decide that you don't want to include any slot with data_loss above a given threshold.

You can also use the extra data_loss_threshold parameter when applying a data reconstruction model to set when it should kick in (or, in other words, when you consider the original data still representative for the slot, even if some of it was missing). This parameter is by default set to one, which means to reconstruct only slots where there was no data at all.

Reconstructors, and all the other models in the follwoing, can make use of far more advanced time series modeling techniques based on ARIMA, Facebook's Prophet, LSTM neural netowrs, and more.

Forecasters


Timeseria uses Forecaster models to perform forecasts, as the PeriodicAverageForecaster, which like the PeriodicAverageReconstructor detects if there is a periodicty in the time series and then create an average for each time slot of the period, which is then used to forecast one or more data points in the future based on a window of past data points which is by default set equal to the periodicty.

In [60]:
from timeseria.models import PeriodicAverageForecaster
forecaster = PeriodicAverageForecaster()
forecaster.fit(hourly_temperature_timeseries)
INFO:timeseria.models.forecasters:Detected periodicity: 24x 1h
INFO:timeseria.models.forecasters:Using a window of "24"

Let's now forecast three days:

In [61]:
hourly_temperature_timeseries_forecast = forecaster.apply(hourly_temperature_timeseries, n=24*3)
hourly_temperature_timeseries_forecast.plot()

While this forecast seems quite reasonable, please keep in mind that this model is really simple and that this data (although real data) is particularly well suited for this use-case.

Anomaly Detectors


Anomaly detectors detect if something is not following a definition of "normal". For example, the PeriodicAverageAnomalyDetector works by fitting a PeriodicAverageForecaster model and then evaluating for each point the error between the forecast and the real value, which if it is greater than a given threshold is considered as anomalous. The threshold is set computing the error distribution in the fit phase, and then using its standard deviation as evaluation metric. The default setting for marking something as anomalous is 3 standard deviations, but in the following example we will use 5 standard deviations since the model is not particulary good and would lead to many false positives.

In [62]:
from timeseria.models import PeriodicAverageAnomalyDetector
anomaly_detector = PeriodicAverageAnomalyDetector()
anomaly_detector.fit(hourly_temperature_timeseries, stdevs=5)
INFO:timeseria.models.forecasters:Detected periodicity: 24x 1h
INFO:timeseria.models.forecasters:Using a window of "24"
INFO:timeseria.models.anomaly_detectors:Using 5 standard deviations as anomaly threshold: 1.833136585648122
In [63]:
hourly_temperature_timeseries_anomalies = anomaly_detector.apply(hourly_temperature_timeseries)
In [64]:
hourly_temperature_timeseries_anomalies.plot()

As you can see the bigger anomaly is detected 23 january 2019 afternoon, where instead of the usual rising peak the temperature was instead falling steady.

Custom models


To implement your own custom models, you just need to extend one of the three main base classes (Reconstructor, Forecaster, AnomalyDetector) and to implement the _fit() and _predict() methods. The documentation for the fit and predict interfaces is still a work in progress.

Here is an example for a forecaster using an "averaged hold" strategy, which will compute the prediction by avearging the last value of the time series with its average over the time series itself:

In [65]:
from timeseria.models import Forecaster

class MyForecaster(Forecaster):

    def _fit(self, timeseries, **kwargs):

        # Compute the average value on the time series for each key, but skip when the
        # data loss is full, to prevent fitting the model on data which is not reliable
        averages = {}
        for item in timeseries:            
            if item.data_loss == 1:
                continue
            for key in timeseries.data_keys():
                if key not in averages:
                    averages[key] = item.data[key]
                else:
                    averages[key] += item.data[key]
        self.averages = {key:averages[key]/len(timeseries) for key in timeseries.data_keys()}

        # Mark the model as fitted
        self.fitted = True

        
    def _predict(self, timeseries, n=1, **kwargs):

        # Disable the forecast for n>1. Note that Timeseria will still be able to run multi-step
        # forecasts as it will just append a prediction ad use it as if it was a real value.
        if n>1:
            raise NotImplementedError('Cannot forecast in one-go for n>1')

        # Compute the prediction for each data key by avearging the last value with its average
        # over the time series, computed in the _fit().   
        prediction = {}
        for key in timeseries.data_keys():
            prediction[key] = (timeseries[-1].data[key] + self.averages[key]) / 2

        return prediction

The model can now be used exactly as the other ones, with all the common functionalities coming for free:

In [66]:
my_forecaster = MyForecaster()
my_forecaster.fit(daily_temperature_timeseries)
daily_temperature_timeseries_forecast = my_forecaster.apply(daily_temperature_timeseries, n=3)
daily_temperature_timeseries_forecast.plot()


Model Evaluation


For the sake of exposition we of omitted a lot of details about model accuracy, error scores, reliability, generalization and so on, but one of the main features of Timeseria it is to actually make it easy to perform this kind of tasks.

Let's get back to our forecaster and evaluate it in the same time series using for the fit, which will give you an idea about how good the model is to capture the dynamic (but will tell you nothing about how the model will perform on date it has never seen):

In [67]:
forecaster.evaluate(hourly_temperature_timeseries)
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['RMSE', 'MAE']
Out[67]:
{'RMSE': 0.6152303097126565, 'MAE': 0.46101261543442196}

By default, the evaluation is done on the first 1000 samples, but this choice is just to speed up the default evaluation behaviour. If you want to evaluate on all the samples:

In [68]:
forecaster.evaluate(hourly_temperature_timeseries, limit=None)
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['RMSE', 'MAE']
Out[68]:
{'RMSE': 0.6152303097126565, 'MAE': 0.46101261543442196}

To get a better sense of how the model would actually work on data that it has never seen, you have to train and evaluate on specific subsets of the same time series, also known as trainig and test data.

While you can prepare the training and test data by slicing a time series (either by index or by datetime), but you can also just pass the boundaries to the fit() and evaluate() methods, which would avoid duplicating data to create the test and training sets.

Let's fit a forecaster until the first of March then:

In [69]:
from timeseria.time import dt
forecaster = PeriodicAverageForecaster()
forecaster.fit(hourly_temperature_timeseries, to_dt=dt(2019,3,1))
INFO:timeseria.models.forecasters:Detected periodicity: 24x 1h
INFO:timeseria.models.forecasters:Using a window of "24"

..and let's evalate it from the first of March until the end:

In [70]:
forecaster.evaluate(hourly_temperature_timeseries, from_dt=dt(2019,3,1), limit=None)
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['RMSE', 'MAE']
Out[70]:
{'RMSE': 0.5525355248094277, 'MAE': 0.4175969435867638}

As you might have noticed, the evaluation is done foe one step ahead and 24 steps ahead, which is the periodicity of this temperature signal. But you can evaluate on specific steps if you care more about them, for example about the evolotion of the temperature over the next three hours:

In [71]:
forecaster.evaluate(hourly_temperature_timeseries, from_dt=dt(2019,3,1), steps=[1,2,3])
INFO:timeseria.models.forecasters:Will evaluate model for [1, 2, 3] steps with metrics ['RMSE', 'MAE']
Out[71]:
{'RMSE': 0.4863670523208427, 'MAE': 0.3640246159576184}

Which is more reliable than a 24-steps ahead prediction, as one would expect. You can also include the details and set which error metrics (at the moment only RMSE, MAE and MAPE are supported:

In [72]:
forecaster.evaluate(hourly_temperature_timeseries, from_dt=dt(2019,3,1), details=True, metrics=['MAE', 'MAPE'])
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['MAE', 'MAPE']
Out[72]:
{'MAE_1_steps': 0.35700142587030215,
 'MAPE_1_steps': 0.0154316875231193,
 'MAE_24_steps': 0.4781924613032254,
 'MAPE_24_steps': 0.020684587302167653,
 'MAE': 0.4175969435867638,
 'MAPE': 0.018058137412643477}

You can also run cross validation:

In [73]:
forecaster.cross_validate(hourly_temperature_timeseries, rounds=3)
INFO:timeseria.models.base:Cross validation round #1 of 3: validate from 1546477200.0 (2019-01-03 01:00:00+00:00) to 1549501200.0 (2019-02-07 01:00:00+00:00), fit on the rest.
INFO:timeseria.models.forecasters:Detected periodicity: 24x 1h
INFO:timeseria.models.forecasters:Using a window of "24"
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['RMSE', 'MAE']
INFO:timeseria.models.base:Cross validation round #2 of 3: validate from 1549501200.0 (2019-02-07 01:00:00+00:00) to 1552525200.0 (2019-03-14 01:00:00+00:00), fit on the rest.
INFO:timeseria.models.forecasters:Detected periodicity: 24x 1h
INFO:timeseria.models.forecasters:Using a window of "24"
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['RMSE', 'MAE']
INFO:timeseria.models.base:Cross validation round #3 of 3: validate from 1552525200.0 (2019-03-14 01:00:00+00:00) to 1555545600.0 (2019-04-18 00:00:00+00:00), fit on the rest.
INFO:timeseria.models.forecasters:Detected periodicity: 24x 1h
INFO:timeseria.models.forecasters:Using a window of "24"
INFO:timeseria.models.forecasters:Will evaluate model for [1, 24] steps with metrics ['RMSE', 'MAE']
Out[73]:
{'RMSE_avg': 0.6200012726041833,
 'RMSE_stdev': 0.13558102733034552,
 'MAE_avg': 0.47403945567689904,
 'MAE_stdev': 0.10348347371176882}


Timezones and their implications


Until now we did not mention anything about time zones. Timeseria silently assumed everything on UTC, since in Timeseria there are no timestamps without a timezone (and epoch seconds are on UTC by definition).

However, in time series data analysis, time zones often steps in, and in particular due to Daylight Saving Time (DST), also known as summertime.

Timeseria tries to provide extensive support for timezones, so that the slotter will for example correclty compute the slots. Let's change the timezone of the temperature time series and re-slot it in days:

In [74]:
tzaware_temperature_timeseries = temperature_timeseries.duplicate()
tzaware_temperature_timeseries.change_timezone('Europe/Rome')
tzaware_temperature_timeseries
Out[74]:
Time series of #14403 points at variable resolution (~600s), from point @ 1546477200.0 (2019-01-03 02:00:00+01:00) to point @ 1555550400.0 (2019-04-18 03:20:00+02:00)

So this thime series is now on timezone Europe/Rome, and the offset of its data with respect to UTC is +1 hour at the beginning and +2 hours at the end, because it includes a DST change.

let's now slot the time series into one-day slots:

In [75]:
tzaware_daily_temperature_timeseries = tzaware_temperature_timeseries.slot('1D')
INFO:timeseria.transformations:Using auto-detected sampling interval: 600.0s
INFO:timeseria.transformations:Slotted 14385 DataTimePoints in 104 DataTimeSlots
In [76]:
tzaware_daily_temperature_timeseries.plot()

The plot is reporting the Europe/Rome timezone, but most importantly the slots have been computed on the data actually belonging to days on the Europe/Rome timezone and the slot value is averaged on the correct duration of the day, that when the DST change it is not 86400 seconds:

In [77]:
for slot in tzaware_daily_temperature_timeseries.filter(from_dt = dt(2019,3,30, tz='Europe/Rome'),
                                                        to_dt   = dt(2019,4,2,  tz='Europe/Rome')):
    print('Slot starting at {} --> lenght {}'.format(slot.start.dt,slot.length))
Slot starting at 2019-03-30 00:00:00+01:00 --> lenght 86400.0
Slot starting at 2019-03-31 00:00:00+01:00 --> lenght 82800.0
Slot starting at 2019-04-01 00:00:00+02:00 --> lenght 86400.0

To get a better sense about how time zones and DST can impact time series modeling, let's consider the following example, which is traffic data from an Highway situated north to London by Highways England:

In [78]:
traffic_timeseries1 = storages.CSVFileStorage(DATASET_PATH + 'traffic1.csv').get(item_type='slots')
traffic_timeseries1.plot()
INFO:timeseria.storages:Assuming 15m time unit and creating Slots.

Here you can clearly see that the peaks of the morning commute at 7 AM.

Let's now load another portion of the data:

In [79]:
traffic_timeseries2 = storages.CSVFileStorage(DATASET_PATH + 'traffic2_missing.csv').get(item_type='slots')
traffic_timeseries2.plot()
INFO:timeseria.storages:Assuming 15m time unit and creating Slots.

Here, apart from the missing data, you can see that the peak is not at 7 AM anymore, instead it is at 6 AM. This is because the DST changed in between the first and the second timeseries, thus shifiting everything by one hour. If you were trying to compute on these two time series, for example, hourly avergaes of highway traffic, you would have got to a misleading answer.

Assuming everything on UTC and using epoch sounds indeed clean and simple at first, but clashes with real world data, that if not purely "physical" gets heavily affected by DST. At the same time, not useing UTC and epoch but local times makes time math much harder and completely breaks it twice a year (when DST goes off you go back in time of one hour, and when it kics in you jump in future of one hour, causing the 23-hour and 25-hour days already mentioned with regard to the Slots and the variable time units).

To adress these issues, Timeseria tries to keep both at the same time, meaning that you always have epoch (and you can do whatever time math you want) and local time (provided that you decorated the time series with a timezone), as showed above when slotting the time zone aware temperature timeseries.

Let's decorate our timeseries with the London timezone then:

In [80]:
traffic_timeseries1.change_timezone('Europe/London')
traffic_timeseries1.plot()
In [81]:
traffic_timeseries2.change_timezone('Europe/London')
traffic_timeseries2.plot()

..so now you can clearly see that all the peaks are at 7 AM London time. If we now want to train a reconstructor without telling it that it has to be affected by DST, this is what we get:

In [82]:
from timeseria.models import PeriodicAverageReconstructor
reconstructor = PeriodicAverageReconstructor()
reconstructor.fit(traffic_timeseries1, offset_method='extremes')
r_traffic_timeseries2 = reconstructor.apply(traffic_timeseries2)
r_traffic_timeseries2.plot()
INFO:timeseria.models.reconstructors:Detected periodicity: 96x 15m

..or a wrong reconstruction (the peak is now at 8 AM London time due to the lack of DST-awareness. But if we tell the reconstructor to be DST-aware (which requires a timezone-aware tmeseries), then the reconstruction works correctly:

In [83]:
from timeseria.models import PeriodicAverageReconstructor
reconstructor = PeriodicAverageReconstructor()
reconstructor.fit(traffic_timeseries1, offset_method='extremes', dst_affected=True)
r_traffic_timeseries2 = reconstructor.apply(traffic_timeseries2)
r_traffic_timeseries2.plot()
INFO:timeseria.models.reconstructors:Detected periodicity: 96x 15m

This is a trivial example, however there are phenomena which are affected by both natural and DST cycles, and Timeseria gives all the framework to corrctly handle timezones in order to having everything working correctly.


Where to go from here


You can have a look at the example repository (Timeseria-notebooks), which is also ready to be play wiht in Binder.


Or, you can give it a try in your own projects:

pip install git+https://github.com/sarusso/Timeseria.git

..or you can contribute! :)