Business Analytics Series: Time Series Prediction with Prophet and H2O

After reading some of the quality work in blogs such as business-science.io and Practical Business Python, I have been waiting to write some posts that focus on business analytics. So…here goes.

Most companies, especially companies with large work forces, will employ teams of statisticians, data scientists and financial analysts to perform tasks such as projections of revenue, costs and other performance measures. While I have been part of companies that can employee these specialists, I have also worked for companies that are not afforded this luxury. During my years, I have found myself with the task of forming these projections for values such as website traffic, conversions and revenue. I would characterize my earlier attempts to predict as somewhere between educated guesses and darts thrown at a board. As I grew a bit more wise and a bit more educated, I found that I could use statistical models in tools like R to remove some of the guess work from my projections. After chatting with a colleague last week, I was reminded of some of the work that I had done with projections and how some of the newer statistical packages size up against each other in this regard. In this post, I’ll compare H2o’s Automl function with the widely praised Facebook Prophet package.

Automl is part of H20’s machine learning framework. Anyone that has dipped their toe into the machine learning world would attest that after data preparation, the most time consuming part of the machine learning workflow is optimizing models. Automl allows users to automate much of the the machine learning workflow, mainly the model selection and optimization processes. Automl will chose train multiple Deep Learning, GBM and Random Forest models and ensemble the best models together to find the most accurate model. Users can simply specify some pretty basic parameters such as the amount of resources they would like to use (these resources being time in seconds, the number of models they would like to train and the amount of compute power available) and the function returns the optimal model. The more resources you have, the better your model will theoretically be. My example is based on the Bike Sharing Dataset from the UCI Machine Learning Repository and my example is also heavily influenced by the Time Series and Feature Engineering blog post on business-science.io.

Scripts

Even though I have added comments in my script, here is a brief synopsis of the steps used to prep the data.

  1. Load libraries
  2. Load data
  3. Create new date features using the timetk library
  4. Create partitions for train and test data
  5. Visualize data
  6. Model data with the Automl function in h2o
  7. Create diagnostic metrics and visualize projection
  8. Model data with Prophet
  9. Create diagnostic metrics and visualize projection
  10. Visualize different projections side by side
#load packages
library(data.table)
library(timetk)
library(tidyquant)
library(h2o)
library(prophet)
library(forecast)
library(ggplot2)

# load data ---------------------------------------------------------------
bike_data <- fread("~\Downloads\\Bike-Sharing-Dataset\\day.csv")
bike_data <- bike_data %>%
  select(dteday, cnt) %>%
  rename(date = dteday, value = cnt) %>%
  mutate(date = as.Date(date))


# date features -----------------------------------
recipe <- recipe(value ~., data = bike_data, strings_as_factors = F) %>% step_timeseries_signature(date)
data <- bake(prep(recipe), new_data = bike_data) %>% setDT()

#h2o doesn't accept ordered factors
data <- data[,date_month.lbl := NULL]
data <- data[,date_wday.lbl := NULL]

#remove contstant columns
data <- data[,!sapply(data, function(.col){ all( is.na(.col) | .col[1L] == .col ) } ), with=F]

# create partitions ----
train <- data[date < "2012-07-01"]
test <- data[date >= "2012-07-01"]

# visualize data ----------------------------------------------------------
# Plot Bike Rentals with train and test sets shown
data %>%
  ggplot(aes(date, value)) +
  # Train Region
  annotate("text", x = ymd("2012-01-01"), y = 7000,
           color = palette_light()[[1]], label = "Train Region") +
  # Test Region
  geom_rect(xmin = as.numeric(ymd("2012-07-01")), 
            xmax = as.numeric(ymd("2013-01-01")),
            ymin = 0, ymax = Inf, alpha = 0.02,
            fill = palette_light()[[4]]) +
  annotate("text", x = ymd("2012-07-01"), y = 7000,
           color = palette_light()[[1]], label = "Test\nRegion") +
  # Data
  geom_line(col = palette_light()[1]) +
  geom_point(col = palette_light()[1]) +
  geom_ma(ma_fun = SMA, n = 12, size = 1) +
  # Aesthetics
  theme_tq() +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(title = "Bike Rentals",
       subtitle = "Train and Test Sets Shown")
Bike Rental Data in ggplot2
Bike Rental Data

Automl

# model with h2o ----

#automl
h2o.init()

train_h2o <- as.h2o(train)
test_h2o  <- as.h2o(test)

y <- "value"
x <- setdiff(names(train_h2o),y)


automl_models_h2o <- h2o.automl(
  x = x, 
  y = y, 
  training_frame = train_h2o, 
  leaderboard_frame = test_h2o, 
  max_runtime_secs = 60, 
  stopping_metric = "RMSE",
  nfolds = 5,
  seed = 1234)

automl_models_h2o@leader

pred_h2o <- h2o.predict(automl_models_h2o@leader, newdata = test_h2o)

h2o.performance(automl_models_h2o@leader, newdata = test_h2o)

error_tbl <- bike_data %>% 
  filter(date >= "2012-07-01") %>%
  mutate(pred = pred_h2o %>% as_tibble() %>% pull(predict)) %>%
  rename(actual = value) %>%
  mutate(
    error     = actual - pred,
    error_pct = error / actual
  ) 
error_tbl

error_tbl %>%
  summarise(
    me   = mean(error),
    rmse = mean(error^2)^0.5,
    mae  = mean(abs(error)),
    mape = mean(abs(error_pct)),
    mpe  = mean(error_pct)
  ) %>%
  glimpse()

After running, we can type automl_models_h2o@leader to learn about the model selected as best.

Automl selected a Deep Learning model after training
Performance Metrics for the Automl Model

Here, we can see that we have added 5 different metrics to use, but let’s focus on Root Mean Squared Error or “rmse”. If you’d like to learn more about these metrics, take a look at the Evaluating Forecast Accuracy chapter in Rob Hyndman’s Forecasting: Principle and Practice. It’s an excellent book. Next we plot our new projection.

# plot prediction ------------------

data %>%
  ggplot(aes(date, value)) +
  # Test Region
  geom_rect(xmin = as.numeric(ymd("2012-07-01")), 
            xmax = as.numeric(ymd("2013-01-01")),
            ymin = 0, ymax = Inf, alpha = 0.02,
            fill = palette_light()[[4]]) +
  annotate("text", x = ymd("2012-12-01"), y = 8000,
           color = palette_light()[[1]], label = "Test\nRegion") +
  # Data
  geom_line(col = palette_light()[1]) +
  geom_point(col = palette_light()[1]) +
  geom_ma(ma_fun = SMA, n = 12, size = 1) +
  
  #prediction
  geom_point(aes(y = pred), color = "gray", alpha = 1, shape = 21, fill = palette_light()[2], data = error_tbl) +
  geom_line(aes(y = pred), color = palette_light()[2], size = 0.5, data = error_tbl) +
  # Aesthetics
  theme_tq() +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(title = "Bike Rentals",
       subtitle = "Train and Test Sets Shown")
Data with Automl projection in red
Data with Automl projection in red

Prophet

Prophet is a time series focused tool developed by the data science at Facebook. Prophet was created to be easily tuned, robust to effects such as seasonality and useful for data scientists and analysts. Unlike with our Automl example (which could be used with the default settings), there are some parameters which must be set for Prophet to, well…prophesize.

daily.seasonality = T – this tells Prophet that there is seasonality in the data at the daily level.

yearly.seasonality = T – this tells Prophet that there is seasonality in the data at the daily level.

The make_future_dataframe() function simply creates a dataframe with dates prepopulated for insertion of projection data once the model is fit on the data.

periods = 184 – our test data spans 184 days (this would be how ever periods your test data is).

freq = “day” – our data is daily data.

# prophet --------------------------
prophet_model <- prophet(df = train %>% rename(ds = date, y = value), daily.seasonality = T, yearly.seasonality = T)
prophet_future <- make_future_dataframe(prophet_model, periods = 184, freq = "day")
prophet_forecast <- predict(prophet_model, prophet_future)

#visualize

plot(prophet_model, prophet_forecast)
prophet_plot_components(prophet_model, prophet_forecast)
performance_metrics(prophet_cv)

# compare results -------------------
error_tbl <- error_tbl %>%
  mutate(prophet = pull(tibble::enframe(tail(prophet_forecast$yhat,184)))) %>%
  mutate(prophet_error = actual - prophet,
         prophet_error_pct = prophet_error/actual)

error_tbl %>%
  summarise(
    me   = mean(error),
    rmse = mean(error^2)^0.5,
    mae  = mean(abs(error)),
    mape = mean(abs(error_pct)),
    mpe  = mean(error_pct),
    prophet_me   = mean(prophet_error),
    prophet_rmse = mean(prophet_error^2)^0.5,
    prophet_mae  = mean(abs(prophet_error)),
    prophet_mape = mean(abs(prophet_error_pct)),
    prophet_mpe  = mean(prophet_error_pct)
  ) %>%
  glimpse()

#pinpoint error at different times
ggplot(reshape2::melt(error_tbl, id.vars = "date", measure.vars = c("error_pct","prophet_error_pct")), 
       aes(date, abs(value), color = variable)) + geom_smooth()
Automl and Prophet Performance Metrics

As you can see, Prophet beats Automl with a score of 1407.157 vs 1476.311. With that said, the one caveat I must mention is that Automl performs based on the resources available to the user (time in seconds, maximum number of models trained or compute power). Increasing any of these resources will result in a theoretical increase in accuracy. Therefore, under different circumstances, Automl might produce better results. Also, Prophet includes some other tunable parameters which might effect results such as the holidays parameter an the changepoints parameter. In closing, I’ll add the proverbial YRMV (Your Results May Vary) in this case.

Like and share!

Leave a Reply

Your email address will not be published.