A lot of my work heavily involves time series analysis. One of the great but lesser-known algorithms that I use is change point detection.
Change point detection (or CPD) detects abrupt shifts in time series trends (i.e. shifts in a time series’ instantaneous velocity), that can be easily identified via the human eye, but are harder to pinpoint using traditional statistical approaches. CPD is applicable across an array of industries, including finance, manufacturing quality control, energy, medical diagnostics, and human activity analysis.
CPD is great for the following use cases:
I find CPD particularly useful when automating the process of identifying and removing anomalous sequences from a time series, as shown below:
It’s also great if I’m attempting to identify a rate change in a system, allowing me to focus on average rates across similar sequences:
This article provides a brief, easy-to-understand background on change point detection, with packages for practical implementation in Python (example code included!).
There are two different categories of CPD–offline and online. In this section, I provide a brief overview of both.
Offline Change Point Detection
Change point detection approaches are “offline” when they don’t use live streaming data, and require the complete time series for statistical analysis. Because offline approaches analyze the whole time series, they are generally more accurate. A few characteristics of offline change point detection are as follows (1):
Online Change Point Detection
In contrast with offline change point detection, online change point detection is used on live-streaming time series, usually to for the purpose of constant monitoring or immediate anomaly detection (1). Online CPD processes individual data points as they become available, with the intent of detecting state changes as soon as they occur (2). There are a few characteristics of online change point detection:
R has an excellent package for change point detection, called changepoint. This package allows users to use multiple search methods to perform change point analysis on a time series. Unfortunately, there isn’t a direct Python equivalent of R’s changepoint. However, there are a couple of other packages that offer change point detection, available via Python:
Out of the three options, I find options #1 and #3 the simplest for implementation as they don’t require downloading and configuring R and rpy2 in a Python environment. The R changepoint package’s functionality is by far the most robust, but configuring it is time-consuming. Consequently, it isn’t focused on in this post.
If you are interested in a in-depth background on calling the R changepoint package via Python using rpy2, check out this tutorial by Steven Reitsma.
Let’s explore options #1 and #3 further.
Charles Truong adapted the ruptures package from the R changepoint package. It specifically focuses on offline changepoint detection, where the whole sequence is analyzed. Out of all of the Python changepoint options, it is the best documented. We can install it using the basic pip install command:
pip install ruptures
The package offers a variety of search methods (binary segmentation, Pelt, window-based change detection, dynamic programming, etc.), as well as multiple cost functions to play around with. In this tutorial, we focus specifically on search methods.
Search Method Background
This section provides a brief background on some of the search methods available in the ruptures package, including binary segmentation, Pelt, window-based change detection, and dynamic programming.
Pruned Exact Linear Time (PELT) search method: The PELT method is an exact method, and generally produces quick and consistent results. It detects change points through the minimization of costs (4). The algorithm has a computational cost of O(n), where n is the number of data points (4). For more info on the PELT method, check out this paper.
Dynamic programming search method: This is an exact method, which has a considerable computational cost of O(Qn^2 ), where Q is the max number of change points and n is the number of data points (4). For more info on the dynamic programming search method, check out this paper.
Binary segmentation search method: This method is arguably the most established in literature (4). Binary segmentation is an approximate method with an efficient computational cost of O (n log n), where n is the number of data points (4). The algorithm works by iteratively applying a single change point method to the entire sequence to determine if a split exists. If a split is detected, then the sequence splits into two sub-sequences (5). The same process is then applied to both sub-sequences, and so on (5). For more info on binary segmentation, check out this paper.
Window-based search method: This is a relatively simple approximate search method. The window-based search method “computes the discrepancy between two adjacent windows that move along with signal y” (6). When the two windows are highly dissimilar, a high discrepancy between the two values occurs, which is indicative of a change point (6). Upon generating a discrepancy curve, the algorithm locates optimal change point indices in the sequence (6). For more info on the window-based search method, check out this paper.
Code Example
In the below code, we perform change point detection using the search methods described above. We use the time series for daily WTI oil prices, from 2014 to now, pulled via the Energy Information Administration’s (EIA) API (see this tutorial for more info on using the EIA API to pull data):
def retrieve_time_series(api, series_ID): """ Return the time series dataframe, based on API and unique Series ID api: API that we're connected to series_ID: string. Name of the series that we want to pull from the EIA API """ #Retrieve Data By Series ID series_search = api.data_by_series(series=series_ID) ##Create a pandas dataframe from the retrieved time series df = pd.DataFrame(series_search) return df """ Execution in main block """ #Create EIA API using your specific API key api_key = 'YOUR API KEY HERE' api = eia.API(api_key) #Pull the oil WTI price data series_ID='PET.RWTC.D' price_df=retrieve_time_series(api, series_ID) price_df.reset_index(level=0, inplace=True) #Rename the columns for easier analysis price_df.rename(columns=, inplace=True) #Format the 'Date' column price_df['Date']=price_df['Date'].astype(str).str[:-3] #Convert the Date column into a date object price_df['Date']=pd.to_datetime(price_df['Date'], format='%Y %m%d') #Subset to only include data going back to 2014 price_df=price_df[(price_df['Date']>='2014-01-01')] #Convert the time series values to a numpy 1D array points=np.array(price_df['WTI_Price']) #RUPTURES PACKAGE #Changepoint detection with the Pelt search method model="rbf" algo = rpt.Pelt(model=model).fit(points) result = algo.predict(pen=10) rpt.display(points, result, figsize=(10, 6)) plt.title('Change Point Detection: Pelt Search Method') plt.show() #Changepoint detection with the Binary Segmentation search method model = "l2" algo = rpt.Binseg(model=model).fit(points) my_bkps = algo.predict(n_bkps=10) # show results rpt.show.display(points, my_bkps, figsize=(10, 6)) plt.title('Change Point Detection: Binary Segmentation Search Method') plt.show() #Changepoint detection with window-based search method model = "l2" algo = rpt.Window(width=40, model=model).fit(points) my_bkps = algo.predict(n_bkps=10) rpt.show.display(points, my_bkps, figsize=(10, 6)) plt.title('Change Point Detection: Window-Based Search Method') plt.show() #Changepoint detection with dynamic programming search method model = "l1" algo = rpt.Dynp(model=model, min_size=3, jump=5).fit(points) my_bkps = algo.predict(n_bkps=10) rpt.show.display(points, my_bkps, figsize=(10, 6)) plt.title('Change Point Detection: Dynamic Programming Search Method') plt.show()
As you can see in the graphics above, the detected change points in the sequence differ based on the search method used. The optimal search method depends on what you value most when subsetting the time series. The PELT and dynamic programming methods are both exact (as opposed to approximate) methods, so they are generally more accurate.
The changefinder package is specifically for online change point detection. To perform change point detection, the package uses SDAR modelling, or sequentially discounting autoregression time series modelling. SDAR is exactly what it sounds like–it’s an extension of autoregressive (AR) modelling, where older data points in the sequence are ‘discounted’, i.e. are less important than more recent values in the sequence. Because recent data is weighed more heavily in an SDAR model, SDAR is well-suited for online change point detection, which focuses on detecting the most recent changes in a sequence.
Autoregressive modeling (AR) is one of the most popular forms of time series modeling, where the current value is predicted based on previous values in the sequence (3). For more information on SDAR models (as well as multivariate SDVAR models), check out this paper.
Code Example
Now that we have some initial background on the changefinder package, let’s use it to perform online change point detection.
For this example, we’re going to autogenerate data using the random() and numpy() packages:
#Create a synthetic data set to test against points=np.concatenate([np.random.rand(100)+5, np.random.rand(100)+10, np.random.rand(100)+5])
After we’ve generated some synthetic data, we run the data through the ChangeFinder function, and generate an anomaly score, based on SDAR, for each data point:
#CHANGEFINDER PACKAGE f, (ax1, ax2) = plt.subplots(2, 1) f.subplots_adjust(hspace=0.4) ax1.plot(points) ax1.set_title("data point") #Initiate changefinder function cf = changefinder.ChangeFinder() scores = [cf.update(p) for p in points] ax2.plot(scores) ax2.set_title("anomaly score") plt.show()
In the above visual, the anomaly score peaks at time 100 and time 200, which corresponds to points where massive shifts in the time series occur. Setting a minimum threshold for anomaly scores, where anything above a certain threshold corresponds to a change point in the sequence, is the best way to identify individual change points in the series.
This concludes my brief introduction to change point detection. For access to the code that I use in this tutorial, check out my Github repo.
As always, thanks for reading!
Check out some of my other data science articles and tutorials: