Tutorial: Tabular training with Dask

How to use bigtabular for training on large tabular datasets

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

This is a small dataset that can easily be processed in-memorory by Pandas. In practice, fast.ai’s TabularPandas should be used when the data can be handled with Pandas. This tutorial is only to illustrate the functionality of bigtabular and to show the similarity to the fastai.tabular API. The guidance from Dask applies:

Dask DataFrames are often used either when …

Your data is too big

Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

Your data is small

Your computation is fast (subsecond)

There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.

from fastai.tabular.all import *
from bigtabular.core import *
from bigtabular.data import *
from bigtabular.learner import *
import dask.dataframe as dd

We can download a sample of this dataset with the usual untar_data command:

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/home/stefan/.fastai/data/adult_sample/export.pkl'),Path('/home/stefan/.fastai/data/adult_sample/adult.csv'),Path('/home/stefan/.fastai/data/adult_sample/models')]

Then we can load the data into a Dask dataframe and have a look at how it is structured:

df = pd.read_csv(path/'adult.csv')
ddf = dd.from_pandas(df)
ddf.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	<NA>	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	<NA>	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])

/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

The last part is the list of pre-processors we apply to our data:

DaskCategorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
DaskFillMissing will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
DaskNormalize will normalize the continuous variables (subtract the mean and divide by the std)

These processors are Dask compatible versions of Categorify, FillMissing and Normalize in fastai.tabular.

To further expose what’s going on below the surface, let’s rewrite this utilizing the TabularDask class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:

split_func = RandomTrainMask()

to = TabularDask(ddf, procs=[DaskCategorify, DaskFillMissing, DaskNormalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   train_mask_func=split_func)

By comparison, to show the similarity between the APIs, this is the TabularPandas equivalent on which TabularDask is based:

splits = RandomSplitter(valid_pct=0.2)(range_of(df))

to_ = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                    cont_names = ['age', 'fnlwgt', 'education-num'],
                    y_names='salary',
                    splits=splits)

/home/stefan/anaconda3/envs/bigtabular/lib/python3.10/site-packages/fastai/tabular/core.py:312: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)

Once we build our TabularDask object, our data is completely preprocessed as seen below:

to.xs.head()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	5	8	3	0	6	5	1	0.763014	-0.838956	0.752354
1	5	13	1	5	2	5	1	0.396491	0.444738	1.534279
2	5	12	1	0	5	3	2	-0.043336	-0.887631	-0.029572
3	6	15	3	11	1	2	1	-0.043336	-0.729692	1.925242
4	7	6	3	9	6	3	2	0.249882	-1.019274	-0.029572

Now we can build our DataLoaders again:

dls = to.dataloaders(bs=64)

/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

Later we will explore why using TabularDask to preprocess will be valuable.

The show_batch method works the same as in fastai.tabular:

dls.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101320.001104	12.0	>=50k
1	Private	Masters	Divorced	Exec-managerial	Not-in-family	White	False	44.0	236746.000557	14.0	>=50k
2	Private	HS-grad	Divorced	#na#	Unmarried	Black	True	38.0	96185.000211	10.0	<50k
3	Self-emp-inc	Prof-school	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	False	38.0	112846.997183	15.0	>=50k
4	Self-emp-not-inc	7th-8th	Married-civ-spouse	Other-service	Wife	Black	True	42.0	82296.999111	10.0	<50k
5	Private	Some-college	Divorced	#na#	Other-relative	White	False	49.0	44434.004438	10.0	<50k
6	Private	11th	Married-civ-spouse	#na#	Husband	White	False	37.0	138940.000986	7.0	<50k
7	Self-emp-inc	HS-grad	Married-civ-spouse	#na#	Husband	White	True	36.0	216710.998684	10.0	>=50k
8	Private	Bachelors	Never-married	#na#	Own-child	Black	False	23.0	529222.998504	13.0	<50k

We can define a model using the dask_learner method. The DaskLearner class inherits from the TabularLearner class. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = DaskCategoryBlock in your constructor so fastai won’t presume you are doing regression.

learn = dask_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.367721	0.358477	0.838588	01:02

We can then have a look at some predictions:

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5	12	5	7	4	5	1	-1.362819	-1.200198	-0.420534	0	0
1	5	12	3	4	1	5	1	0.543100	1.311777	-0.420534	1	0
2	8	13	1	0	2	5	1	1.276146	0.798919	1.534279	0	0
3	5	10	3	11	1	5	2	0.469796	0.740680	-0.029572	1	1
4	5	7	1	13	2	5	2	0.543100	-0.684592	-0.029572	0	0
5	5	12	4	0	4	3	2	-0.703078	10.223143	-0.029572	0	0
6	5	12	3	4	1	5	1	0.763014	0.544722	-0.420534	1	0
7	6	13	3	5	1	5	1	1.202842	0.310791	1.534279	0	1
8	7	16	1	0	5	5	2	0.616405	0.226713	-0.029572	0	0

Or use the predict method on a row:

row, clas, probs = learn.predict(ddf.head().iloc[0])

row

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101320.001104	12.0

clas, probs

(tensor(1), tensor([0.3358, 0.6642]))

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

test_ddf = ddf.copy()
test_ddf = test_ddf.drop(['salary'], axis=1)
dl = learn.dls.test_dl(test_ddf)

Then Learner.get_preds will give you the predictions:

learn.get_preds(dl=dl)

(tensor([[0.3358, 0.6642],
         [0.4716, 0.5284],
         [0.9259, 0.0741],
         ...,
         [0.6240, 0.3760],
         [0.6896, 0.3104],
         [0.7852, 0.2148]]),
 None)

Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

Export the learner: Export the content without the items and the optimizer state for inference

learn.export("model.pk")

We can then load the exported learner for inference:

trained_learner = load_learner("model.pk")
dl = trained_learner.dls.test_dl(test_ddf)
trained_learner.get_preds(dl=dl)

(tensor([[0.3358, 0.6642],
         [0.4716, 0.5284],
         [0.9259, 0.0741],
         ...,
         [0.6240, 0.3760],
         [0.6896, 0.3104],
         [0.7852, 0.2148]]),
 None)

Cleanup: delete the exported learner:

if os.path.exists("model.pk"): os.remove("model.pk")