Tutorial: Tabular training with Dask

How to use bigtabular for training on large tabular datasets

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

This is a small dataset that can easily be processed in-memorory by Pandas. In practice, fast.ai’s TabularPandas should be used when the data can be handled with Pandas. This tutorial is only to illustrate the functionality of bigtabular and to show the similarity to the fastai.tabular API. The guidance from Dask applies:

Dask DataFrames are often used either when …

  1. Your data is too big
  2. Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

  1. Your data is small
  2. Your computation is fast (subsecond)
  3. There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.
from fastai.tabular.all import *
from bigtabular.core import *
from bigtabular.data import *
from bigtabular.learner import *
import dask.dataframe as dd

We can download a sample of this dataset with the usual untar_data command:

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()
(#3) [Path('/home/stefan/.fastai/data/adult_sample/export.pkl'),Path('/home/stefan/.fastai/data/adult_sample/adult.csv'),Path('/home/stefan/.fastai/data/adult_sample/models')]

Then we can load the data into a Dask dataframe and have a look at how it is structured:

df = pd.read_csv(path/'adult.csv')
ddf = dd.from_pandas(df)
ddf.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse <NA> Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced <NA> Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])
/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

The last part is the list of pre-processors we apply to our data:

These processors are Dask compatible versions of Categorify, FillMissing and Normalize in fastai.tabular.

To further expose what’s going on below the surface, let’s rewrite this utilizing the TabularDask class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:

split_func = RandomTrainMask()
to = TabularDask(ddf, procs=[DaskCategorify, DaskFillMissing, DaskNormalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   train_mask_func=split_func)

By comparison, to show the similarity between the APIs, this is the TabularPandas equivalent on which TabularDask is based:

splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to_ = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                    cont_names = ['age', 'fnlwgt', 'education-num'],
                    y_names='salary',
                    splits=splits)
/home/stefan/anaconda3/envs/bigtabular/lib/python3.10/site-packages/fastai/tabular/core.py:312: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)

Once we build our TabularDask object, our data is completely preprocessed as seen below:

to.xs.head()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 5 8 3 0 6 5 1 0.763014 -0.838956 0.752354
1 5 13 1 5 2 5 1 0.396491 0.444738 1.534279
2 5 12 1 0 5 3 2 -0.043336 -0.887631 -0.029572
3 6 15 3 11 1 2 1 -0.043336 -0.729692 1.925242
4 7 6 3 9 6 3 2 0.249882 -1.019274 -0.029572

Now we can build our DataLoaders again:

dls = to.dataloaders(bs=64)
/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

Later we will explore why using TabularDask to preprocess will be valuable.

The show_batch method works the same as in fastai.tabular:

dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.0 101320.001104 12.0 >=50k
1 Private Masters Divorced Exec-managerial Not-in-family White False 44.0 236746.000557 14.0 >=50k
2 Private HS-grad Divorced #na# Unmarried Black True 38.0 96185.000211 10.0 <50k
3 Self-emp-inc Prof-school Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander False 38.0 112846.997183 15.0 >=50k
4 Self-emp-not-inc 7th-8th Married-civ-spouse Other-service Wife Black True 42.0 82296.999111 10.0 <50k
5 Private Some-college Divorced #na# Other-relative White False 49.0 44434.004438 10.0 <50k
6 Private 11th Married-civ-spouse #na# Husband White False 37.0 138940.000986 7.0 <50k
7 Self-emp-inc HS-grad Married-civ-spouse #na# Husband White True 36.0 216710.998684 10.0 >=50k
8 Private Bachelors Never-married #na# Own-child Black False 23.0 529222.998504 13.0 <50k

We can define a model using the dask_learner method. The DaskLearner class inherits from the TabularLearner class. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = DaskCategoryBlock in your constructor so fastai won’t presume you are doing regression.

learn = dask_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).

learn.fit_one_cycle(1)
epoch train_loss valid_loss accuracy time
0 0.367721 0.358477 0.838588 01:02

We can then have a look at some predictions:

learn.show_results()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary salary_pred
0 5 12 5 7 4 5 1 -1.362819 -1.200198 -0.420534 0 0
1 5 12 3 4 1 5 1 0.543100 1.311777 -0.420534 1 0
2 8 13 1 0 2 5 1 1.276146 0.798919 1.534279 0 0
3 5 10 3 11 1 5 2 0.469796 0.740680 -0.029572 1 1
4 5 7 1 13 2 5 2 0.543100 -0.684592 -0.029572 0 0
5 5 12 4 0 4 3 2 -0.703078 10.223143 -0.029572 0 0
6 5 12 3 4 1 5 1 0.763014 0.544722 -0.420534 1 0
7 6 13 3 5 1 5 1 1.202842 0.310791 1.534279 0 1
8 7 16 1 0 5 5 2 0.616405 0.226713 -0.029572 0 0

Or use the predict method on a row:

row, clas, probs = learn.predict(ddf.head().iloc[0])
row
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.0 101320.001104 12.0
clas, probs
(tensor(1), tensor([0.3358, 0.6642]))

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

test_ddf = ddf.copy()
test_ddf = test_ddf.drop(['salary'], axis=1)
dl = learn.dls.test_dl(test_ddf)

Then Learner.get_preds will give you the predictions:

learn.get_preds(dl=dl)
(tensor([[0.3358, 0.6642],
         [0.4716, 0.5284],
         [0.9259, 0.0741],
         ...,
         [0.6240, 0.3760],
         [0.6896, 0.3104],
         [0.7852, 0.2148]]),
 None)
Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

Export the learner: Export the content without the items and the optimizer state for inference

learn.export("model.pk")

We can then load the exported learner for inference:

trained_learner = load_learner("model.pk")
dl = trained_learner.dls.test_dl(test_ddf)
trained_learner.get_preds(dl=dl)
(tensor([[0.3358, 0.6642],
         [0.4716, 0.5284],
         [0.9259, 0.0741],
         ...,
         [0.6240, 0.3760],
         [0.6896, 0.3104],
         [0.7852, 0.2148]]),
 None)

Cleanup: delete the exported learner:

if os.path.exists("model.pk"): os.remove("model.pk")