BigTabular

Extension of fastai.tabular for larger-than-memory datasets with Dask

This library replicates much the functionality of the tabular data application in the fastai library to work with larger-than-memory datasets. Pandas, which is used for data transformations in fastai.tabular, is replaced with Dask DataFrames.

Most of the Dask implementations were written as they were needed for a personal project, but then refactored to match the fastai API more closely. The flow of the Jupyter notebooks follows those from fastai.tabular closely and most of the examples and tests were replicated.

When not to use BigTabular

Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:

Dask DataFrames are often used either when …

Your data is too big

Your computation is too slow and other techniques don’t work

You should probably stick to just using pandas if …

Your data is small

Your computation is fast (subsecond)

There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.

Install

pip install bigtabular

How to use

Refer to the tutorial for a more detailed usage example.

Get a Dask DataFrame:

path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
ddf.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	<NA>	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	<NA>	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

Create dataloaders. Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods:

dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [DaskCategorify, DaskFillMissing, DaskNormalize])

Create a Learner:

learn = dask_learner(dls, metrics=accuracy)

Train the model for one epoch:

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.356636	0.369994	0.831705	00:04

We can then have a look at some predictions:

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5	13	1	5	2	5	1	0.400453	0.445578	1.539163	1	1
1	5	2	5	2	4	5	2	-1.508332	0.252159	-0.028897	0	0
2	5	1	3	8	1	5	2	-0.040036	-0.183701	-0.028897	1	0
3	5	2	3	0	1	5	1	0.253623	-1.130087	-1.204943	0	0
4	5	8	5	2	2	5	2	0.547282	1.552272	-0.028897	0	0
5	7	12	3	0	1	5	2	-0.186866	0.470807	-0.028897	0	0
6	5	12	3	0	4	5	1	-1.141258	0.314934	-0.420913	0	0
7	5	12	5	7	4	5	1	-1.434917	0.696942	-0.420913	0	0
8	5	13	5	5	2	5	2	-0.700769	-0.305753	-0.028897	0	0