Machine Learning with Pycaret and the Home Price Dataset

I’ve felt a little bit hypocritical over the last month or two as I feel as though I’ve been cheating. I’ve spent the last few weeks checking out Pycaret, the Python machine learning library inspired by the R machine learning library Caret. Now, time for a second confession from this self-proclaimed R “stan”. I like it. I like it a lot!

Why I like Pycaret

  1. Pycaret is created with speed and automation at mind – Pycaret bills itself an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes (and when they say minutes, they mean minutes). I was able to get my first model trained in a few minutes and runs off of the scikit-learn API.
  2. Pycaret is not very verbose – What Pycaret lacks in the number of overall functions, it makes up for simplicity. The setup() function does all of the data preparation; from encoding categoricals to imputing missing values to removing outliers, etc. This all done in single function instead of spreading the work into a number of different, hard to remember function calls (yes, I’m looking at you tidymodels recipes). The compare_models() function runs a benchmark against all of the applicable algorithms and returns performance data.
  3. Simple evaluation and deployment – Functions such as evaluate_model() and interpret_model() return easy to use interfaces for developing a deeper understanding of your model. The deploy_model() function allows the user a process for deploying models in AWS, Google Cloud or Azure. Users can also create a scikit-learn compatible object for deployment in other environments.

Mchine Learning with Pycaret and the Ames Housing Dataset

Pycaret is written specifically for users that write in Python Notebook type IDEs, so Jupyter is probably the best option here.

The Ames Housing Dataset can be found on the Kaggle Ames Housing Regression Competition. I like using this dataset as includes both numeric and categorical features and since it is part of a “getting started” competition, there is a good amount of available notebooks and comparisons a user can make to learn more about machine learning. The dataset is also a good tool for learning some statistical conventions such as dealing with outliers and constant features. I’m going to forgo discussing most of those conventions as they could require their own post.

Let’s start by installing our necessary libraries, reading in the data and importing our libraries:

#install libraries
pip install pandas
pip install pycaret

#import libraries
import pandas as pd
from pycaret.regression import *

#read data
train = pd.read_csv("C:\\Documents\\ameshomeprice\\train.csv")
test = pd.read_csv("C:\\Documents\\ameshomeprice\\test.csv")

Here we can see that that the training data includes both numeric and categorical features. We can automatically let the setup() function deal with encoding (one-hot encoding only at the time of this post). We can also specify if any of our categorical features are ordinal instead of nominal.

Let’s drop the “Id” column as it has no predictive power:

train = train.drop('Id', axis=1)

Next we’ll run our setup function with no extra parameters:

pyc_setup = setup(train, target = 'SalePrice')
Setup Output

As you can see, setup() will output a list of all of the parameters in the setup function and what the actual values used. This is very helpful as the user knows all of the changes that function placed on the underlying data. The image doesn’t show the full list as the list of parameters is rather long, but other parameters of note are imputation methods, normalization and outlier removal.

Now, lets run the compare_models() function to get a benchmark of model performance:

models_all = compare_models(sort = 'rmse')

The function is pretty simple in comparison to other frameworks I’ve used (MLR, Caret, Scikit-Learn, etc.).

Score Grid

The function will run each algorithm and return cross validated score (default). As we can see, catboost, lightgbm and xgboost performed the best with our out of the box setup based on RMSE (Root Mean Squared Error).

So, we’ve created our score grid, but we want to see if we can make some changes to the preprocessing function to possibly improve our score. Let’s add a few new parameters to setup().

setup_norm_remove = setup(train, target = 'SalePrice', normalize = True, remove_outliers = True)
models_all_norm_remove = compare_models(sort = 'rmse')
Performance Grid

As we can see, pretty much every algorithm performed better after the preprocessing was tweaked. What makes Pycaret stand out here is just how easy the preprocessing was.

We can see that the Huber Regressor performed best here. Let’s see if we can improve that model though hyperparameter tuning.

huber = create_model('huber')
huber_tuned = tune_model(huber, n_iter = 500)

The code above creates the default model and tunes the model. There are a number of different hyperparameter tuning search options as well including the Scikit-learn and Optuna libraries which provide search algorithms such as Bayesian. In my limited testing of the search libraries, the default random search seemed to outperform any of the other libraries, however, only when tweaking the number of iterations to include a greater number than the default of 10 (the above example includes 500). Users with higher compute power and great amounts of time will probably see greater improvement with the other libraries.

We were able to get a rather small improvement in RMSE, but Standard Deviation increased a bit. Again, because this is the default random search, higher iteration count might further improve performance.

Evalulation and Prediction

Another feature that allows Pycaret to shine is in evaluation and interpretation. Other machine learning frameworks include evaluation tools, but (other than H2o’s Flow) I’ve never seen it quite this intuitive.

Evaluate Plots

Pycaret returns a number of plots that can be viewed to evaluate the model. This is great as most other frameworks I’ve used force you to type in individual functions for each individual plot the user needs to return. This is point and click.

Feature Importance Plot

Of course, the main reason we use these models is to predict future values, so let’s do that.

pred = predict_model(huber_tuned, data = test)
Prediction Returned with the Column Name “Label”

To conclude, I really enjoy using this framework as it is intuitive yet powerful. With that said, there are always going to be some quirks. For instance, there are some features that I have seen in other frameworks that are not included. One of those would be AutoML. AutoML in H2o is a topic I covered in my Time Series Predicition article. Pycaret has an automl() function, but it only returns the best model from the latest run of the create_models() function. In my opinion, that is misleading because as AutoML is defined as, “AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model”. Also, in my opinion, AutoML should at the least apply Artificial Intelligence, grid search, random search or some other search algorithm to find the highest performance set of hyperparameters and/or model ensembles. Pycaret’s automl() function does not do this. Also, Pycaret works well in Jupyter Notebooks, but trying it in another IDE such as Spyder or PyCharm yields strange results. With that said, I’ve only scratched the surface of the features available such as ensemble learning, blending and the other algorithms for predictions such as Classification, Clustering and Natural Language Processing. I think I’ll be due to discuss these in future posts.

Like and share!

Leave a Reply

Your email address will not be published. Required fields are marked *