from fastai.tabular.all import *
from bigtabular.core import *
from bigtabular.data import *
from bigtabular.learner import *
import dask.dataframe as ddTutorial: Tabular training with Dask
To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.
This is a small dataset that can easily be processed in-memorory by Pandas. In practice, fast.ai’s TabularPandas should be used when the data can be handled with Pandas. This tutorial is only to illustrate the functionality of bigtabular and to show the similarity to the fastai.tabular API. The guidance from Dask applies:
Dask DataFrames are often used either when …
- Your data is too big
- Your computation is too slow and other techniques don’t work
You should probably stick to just using pandas if …
- Your data is small
- Your computation is fast (subsecond)
- There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.
We can download a sample of this dataset with the usual untar_data command:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()(#3) [Path('/home/stefan/.fastai/data/adult_sample/export.pkl'),Path('/home/stefan/.fastai/data/adult_sample/adult.csv'),Path('/home/stefan/.fastai/data/adult_sample/models')]
Then we can load the data into a Dask dataframe and have a look at how it is structured:
df = pd.read_csv(path/'adult.csv')
ddf = dd.from_pandas(df)
ddf.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | <NA> | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | <NA> | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:
dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [DaskCategorify, DaskFillMissing, DaskNormalize])/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
The last part is the list of pre-processors we apply to our data:
DaskCategorifyis going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.DaskFillMissingwill fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)DaskNormalizewill normalize the continuous variables (subtract the mean and divide by the std)
These processors are Dask compatible versions of Categorify, FillMissing and Normalize in fastai.tabular.
To further expose what’s going on below the surface, let’s rewrite this utilizing the TabularDask class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:
split_func = RandomTrainMask()to = TabularDask(ddf, procs=[DaskCategorify, DaskFillMissing, DaskNormalize],
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
y_names='salary',
train_mask_func=split_func)By comparison, to show the similarity between the APIs, this is the TabularPandas equivalent on which TabularDask is based:
splits = RandomSplitter(valid_pct=0.2)(range_of(df))to_ = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
y_names='salary',
splits=splits)/home/stefan/anaconda3/envs/bigtabular/lib/python3.10/site-packages/fastai/tabular/core.py:312: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
to[n].fillna(self.na_dict[n], inplace=True)
Once we build our TabularDask object, our data is completely preprocessed as seen below:
to.xs.head()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 8 | 3 | 0 | 6 | 5 | 1 | 0.763014 | -0.838956 | 0.752354 |
| 1 | 5 | 13 | 1 | 5 | 2 | 5 | 1 | 0.396491 | 0.444738 | 1.534279 |
| 2 | 5 | 12 | 1 | 0 | 5 | 3 | 2 | -0.043336 | -0.887631 | -0.029572 |
| 3 | 6 | 15 | 3 | 11 | 1 | 2 | 1 | -0.043336 | -0.729692 | 1.925242 |
| 4 | 7 | 6 | 3 | 9 | 6 | 3 | 2 | 0.249882 | -1.019274 | -0.029572 |
Now we can build our DataLoaders again:
dls = to.dataloaders(bs=64)/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
Later we will explore why using
TabularDaskto preprocess will be valuable.
The show_batch method works the same as in fastai.tabular:
dls.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 101320.001104 | 12.0 | >=50k |
| 1 | Private | Masters | Divorced | Exec-managerial | Not-in-family | White | False | 44.0 | 236746.000557 | 14.0 | >=50k |
| 2 | Private | HS-grad | Divorced | #na# | Unmarried | Black | True | 38.0 | 96185.000211 | 10.0 | <50k |
| 3 | Self-emp-inc | Prof-school | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | False | 38.0 | 112846.997183 | 15.0 | >=50k |
| 4 | Self-emp-not-inc | 7th-8th | Married-civ-spouse | Other-service | Wife | Black | True | 42.0 | 82296.999111 | 10.0 | <50k |
| 5 | Private | Some-college | Divorced | #na# | Other-relative | White | False | 49.0 | 44434.004438 | 10.0 | <50k |
| 6 | Private | 11th | Married-civ-spouse | #na# | Husband | White | False | 37.0 | 138940.000986 | 7.0 | <50k |
| 7 | Self-emp-inc | HS-grad | Married-civ-spouse | #na# | Husband | White | True | 36.0 | 216710.998684 | 10.0 | >=50k |
| 8 | Private | Bachelors | Never-married | #na# | Own-child | Black | False | 23.0 | 529222.998504 | 13.0 | <50k |
We can define a model using the dask_learner method. The DaskLearner class inherits from the TabularLearner class. When we define our model, fastai will try to infer the loss function based on our y_names earlier.
Note: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = DaskCategoryBlock in your constructor so fastai won’t presume you are doing regression.
learn = dask_learner(dls, metrics=accuracy)And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).
learn.fit_one_cycle(1)| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.367721 | 0.358477 | 0.838588 | 01:02 |
We can then have a look at some predictions:
learn.show_results()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | salary_pred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 12 | 5 | 7 | 4 | 5 | 1 | -1.362819 | -1.200198 | -0.420534 | 0 | 0 |
| 1 | 5 | 12 | 3 | 4 | 1 | 5 | 1 | 0.543100 | 1.311777 | -0.420534 | 1 | 0 |
| 2 | 8 | 13 | 1 | 0 | 2 | 5 | 1 | 1.276146 | 0.798919 | 1.534279 | 0 | 0 |
| 3 | 5 | 10 | 3 | 11 | 1 | 5 | 2 | 0.469796 | 0.740680 | -0.029572 | 1 | 1 |
| 4 | 5 | 7 | 1 | 13 | 2 | 5 | 2 | 0.543100 | -0.684592 | -0.029572 | 0 | 0 |
| 5 | 5 | 12 | 4 | 0 | 4 | 3 | 2 | -0.703078 | 10.223143 | -0.029572 | 0 | 0 |
| 6 | 5 | 12 | 3 | 4 | 1 | 5 | 1 | 0.763014 | 0.544722 | -0.420534 | 1 | 0 |
| 7 | 6 | 13 | 3 | 5 | 1 | 5 | 1 | 1.202842 | 0.310791 | 1.534279 | 0 | 1 |
| 8 | 7 | 16 | 1 | 0 | 5 | 5 | 2 | 0.616405 | 0.226713 | -0.029572 | 0 | 0 |
Or use the predict method on a row:
row, clas, probs = learn.predict(ddf.head().iloc[0])row| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 101320.001104 | 12.0 |
clas, probs(tensor(1), tensor([0.3358, 0.6642]))
To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.
test_ddf = ddf.copy()
test_ddf = test_ddf.drop(['salary'], axis=1)
dl = learn.dls.test_dl(test_ddf)Then Learner.get_preds will give you the predictions:
learn.get_preds(dl=dl)(tensor([[0.3358, 0.6642],
[0.4716, 0.5284],
[0.9259, 0.0741],
...,
[0.6240, 0.3760],
[0.6896, 0.3104],
[0.7852, 0.2148]]),
None)
Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training
Export the learner: Export the content without the items and the optimizer state for inference
learn.export("model.pk")We can then load the exported learner for inference:
trained_learner = load_learner("model.pk")
dl = trained_learner.dls.test_dl(test_ddf)
trained_learner.get_preds(dl=dl)(tensor([[0.3358, 0.6642],
[0.4716, 0.5284],
[0.9259, 0.0741],
...,
[0.6240, 0.3760],
[0.6896, 0.3104],
[0.7852, 0.2148]]),
None)
Cleanup: delete the exported learner:
if os.path.exists("model.pk"): os.remove("model.pk")