BigTabular core

Basic functions to preprocess larger-than-memory tabular data with Dask before assembling it in DataLoaders.

Initial preprocessing

Define Dask versions of the make_date, add_datepart, and add_elapsed_times functions defined in tabular.core. The dask_make_date function uses Dask’s to_datetime function rather than the Pandas version. The dask_add_datepart and dask_add_elapsed_times functions just wrap add_datepart in the Dask map_partitions function.

dask_make_date

 dask_make_date (ddf, date_field)

Convert df[date_field] to date type.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
ddf = dd.from_pandas(df)
dask_make_date(ddf, 'date')
test_eq(ddf['date'].dtype, np.dtype('datetime64[ns]'))

dask_add_datepart

 dask_add_datepart (ddf, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of ddf

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
ddf = dd.from_pandas(df)
ddf = dask_add_datepart(ddf, 'date')
ddf.head()

	Year	Month	Week	Day	Dayofweek	Dayofyear	Is_month_end	Is_month_start	Is_quarter_end	Is_quarter_start	Is_year_end	Is_year_start	Elapsed
0	2019.0	12.0	49.0	4.0	2.0	338.0	False	False	False	False	False	False	1.575418e+09
1	NaN	NaN	NaN	NaN	NaN	NaN	False	False	False	False	False	False	NaN
2	2019.0	11.0	46.0	15.0	4.0	319.0	False	False	False	False	False	False	1.573776e+09
3	2019.0	10.0	43.0	24.0	3.0	297.0	False	False	False	False	False	False	1.571875e+09

dask_add_elapsed_times

 dask_add_elapsed_times (ddf, field_names, date_field, base_field)

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
ddf = dd.from_pandas(df)
ddf = dask_add_elapsed_times(ddf, ['event'], 'date', 'base')
ddf.head()

	date	event	base	Afterevent	Beforeevent	event_bw	event_fw
0	2019-12-04	False	1	5	0	1.0	0.0
1	2019-11-29	True	1	0	0	1.0	1.0
2	2019-11-15	False	2	22	0	1.0	0.0
3	2019-10-24	True	2	0	0	1.0	1.0

dask_cont_cat_split

 dask_cont_cat_split (df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

We also define a Dask version of the cont_cat_split function. The only difference to the original function is calling compute on the Dask dataframe to determine the cardinality of the columns. This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
ddf = dd.from_pandas(df)
cont_names, cat_names = dask_cont_cat_split(ddf)

cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`

# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
ddf = dd.from_pandas(df)
ddf = dask_add_datepart(ddf, 'd1_date', drop=False)

ddf['cat1'] = ddf['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)

cont_names, cat_names = dask_cont_cat_split(ddf, max_card=0)

cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

get_random_train_mask

 get_random_train_mask (df, train_frac=0.8, seed=None)

RandomTrainMask

 RandomTrainMask (train_frac=0.8, seed=None)

Initialize self. See help(type(self)) for accurate signature.

A class to create a random train/validation set mask over the Dask dataframe.

TabularDask

 TabularDask (ddf, procs=None, cat_names=None, cont_names=None,
              y_names=None, y_block=None, train_mask_func=None,
              do_setup=True, device=None, reset_index=True)

A Dask DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __iter__. The aim is to replicate the TabularPandas API as closely as possible.

df: A DataFrame of your data
cat_names: Your categorical x variables
cont_names: Your continuous x variables
y_names: Your dependent y variables
- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block: How to sub-categorize the type of y_names (CategoryBlock or RegressionBlock)
train_mask_func: A function that creates a train/validation mask over a DataFrame. See get_random_train_mask for an example.
do_setup: A parameter for if Tabular will run the data through the procs upon initialization
device: cuda or cpu

Transforms

These transforms inherit from TabularProc and are applied as soon as the data is available rather than as data is called from the DataLoader

DaskCategoryMap

 DaskCategoryMap (col, sort=True, add_na=False, strict=False)

Dask implementation of CategoryMap. Collection of categories with the reverse mapping in o2i

DaskCategorify

 DaskCategorify (cat_vocabs:"'dict|None'"=None)

Transform the categorical variables to something similar to pd.Categorical

The Categorify class from fastai.tabular.core.Categorify is modified to: - be compatible with Dask - accept existing vocabs through the cat_vocabs input

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

ddf = dd.from_pandas(pd.DataFrame({'a':[0,1,2,0,2]}))
to = TabularDask(ddf, DaskCategorify, 'a')
to.show()

	a
0	0
1	1
2	2
3	0
4	2

Each column’s unique values are stored in a dictionary of column:[values]:

cat = to.procs.dask_categorify
cat.classes

{'a': ['#na#', 0, 1, 2]}

We can provide an exisiting vocab if it exists, for example if pretrained weights will be used for a categorical variable:

ddf = dd.from_pandas(pd.DataFrame({'a':['Cat','Dog','Lion','Leopard','Honey badger']}))

With default vocab:

to = TabularDask(ddf, DaskCategorify, 'a')
cat = to.procs.dask_categorify
cat.classes

{'a': ['#na#', 'Cat', 'Dog', 'Honey badger', 'Leopard', 'Lion']}

With predefined vocab:

vocab = {'a': ['Honey badger', 'Dog', 'Cat', 'Lion','Leopard']}
to = TabularDask(ddf, DaskCategorify(cat_vocabs=vocab), 'a')
cat = to.procs.dask_categorify
cat.classes

{'a': ['Honey badger', 'Dog', 'Cat', 'Lion', 'Leopard']}

DaskNormalize

 DaskNormalize (cols=None)

Base class to write a non-lazy tabular processor for dataframes

DaskCategorize

 DaskCategorize (vocab=None, sort=True, add_na=False)

A transform with a __repr__ that shows its attrs

DaskFillStrategy

 DaskFillStrategy ()

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.

DaskFillMissing

 DaskFillMissing (fill_strategy=<function median>, add_col=True,
                  fill_vals=None)

Fill the missing values in continuous columns.

DaskRegressionSetup

 DaskRegressionSetup (c=None)

A Dask-compatible transform that floatifies targets

We define basic TransformBlocks that are compatible with the Dask transforms:

DaskCategoryBlock

 DaskCategoryBlock
                    (vocab:collections.abc.MutableSequence|pandas.core.ser
                    ies.Series=None, sort:bool=True, add_na:bool=False)

A Dask-compatible TransformBlock for single-label categorical targets

	Type	Default	Details
vocab	MutableSequence \| pd.Series	None	List of unique class names
sort	bool	True	Sort the classes alphabetically
add_na	bool	False	Add `#na#` to `vocab`

DaskRegressionBlock

 DaskRegressionBlock (n_out:int=None)

A Dask-compatible TransformBlock for float targets

	Type	Default	Details
n_out	int	None	Number of output values

DaskDataLoader

 DaskDataLoader (dataset=None, bs=None, num_workers=0, pin_memory=False,
                 timeout=0, batch_size=None, shuffle=False,
                 drop_last=False, indexed=None, n=None, device=None,
                 persistent_workers=False, pin_memory_device='', wif=None,
                 before_iter=None, after_item=None, before_batch=None,
                 after_batch=None, after_iter=None, create_batches=None,
                 create_item=None, create_batch=None, retain=None,
                 get_idxs=None, sample=None, shuffle_fn=None,
                 do_batch=None)

Iterable dataloader for tabular learning with Dask

Integration example

For a more in-depth explanation, see the BigTabular tutorial

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
ddf_main, ddf_test = dd.from_pandas(df_main), dd.from_pandas(df_test)
ddf_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	<NA>	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	<NA>	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]

to = TabularDask(
    ddf_main, procs, cat_names, cont_names, y_names="salary", train_mask_func=RandomTrainMask()
)

dls = to.dataloaders()
dls.valid.show_batch()

/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Masters	Divorced	Exec-managerial	Not-in-family	White	False	44.0	236745.998760	14.0	>=50k
1	Self-emp-inc	Prof-school	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	False	38.0	112847.001163	15.0	>=50k
2	Private	HS-grad	Never-married	Handlers-cleaners	Own-child	White	False	20.0	63210.002966	9.0	<50k
3	Private	Bachelors	Never-married	#na#	Own-child	Black	False	23.0	529222.995495	13.0	<50k
4	Private	Assoc-voc	Married-civ-spouse	Sales	Husband	White	True	43.0	84660.997258	10.0	<50k
5	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	49.0	247294.000118	9.0	>=50k
6	Private	11th	Married-civ-spouse	#na#	Husband	White	False	42.0	70055.004990	7.0	<50k
7	Private	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	45.0	242390.999669	13.0	>=50k
8	Private	Some-college	Never-married	Sales	Not-in-family	Black	True	41.0	140589.998619	10.0	<50k

to.show()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101320.0	12.0	>=50k
1	Private	Masters	Divorced	Exec-managerial	Not-in-family	White	False	44.0	236746.0	14.0	>=50k
2	Private	HS-grad	Divorced	#na#	Unmarried	Black	True	38.0	96185.0	10.0	<50k
3	Self-emp-inc	Prof-school	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	False	38.0	112847.0	15.0	>=50k
4	Self-emp-not-inc	7th-8th	Married-civ-spouse	Other-service	Wife	Black	True	42.0	82297.0	10.0	<50k
5	Private	HS-grad	Never-married	Handlers-cleaners	Own-child	White	False	20.0	63210.0	9.0	<50k
6	Private	Some-college	Divorced	#na#	Other-relative	White	False	49.0	44434.0	10.0	<50k
7	Private	11th	Married-civ-spouse	#na#	Husband	White	False	37.0	138940.0	7.0	<50k
8	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	46.0	328216.0	9.0	>=50k
9	Self-emp-inc	HS-grad	Married-civ-spouse	#na#	Husband	White	True	36.0	216711.0	10.0	>=50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

row = to.items.head().iloc[0]
to.decode_row(row)

age                                49.0
workclass                       Private
fnlwgt                    101319.997963
education                    Assoc-acdm
education-num                      12.0
marital-status       Married-civ-spouse
occupation                         #na#
relationship                       Wife
race                              White
sex                              Female
capital-gain                          0
capital-loss                       1902
hours-per-week                       40
native-country            United-States
salary                              NaN
_int_train_mask                    True
education-num_na                  False
Name: 0, dtype: object

We can make new test datasets based on the training data with the to.new()

Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

to_tst = to.new(ddf_test)
to_tst.process()
to_tst.items.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	education-num_na
0	0.467001	5	1.320081	10	1.174229	3	2	1	2	Male	0	0	40	Philippines	1
1	-0.923288	5	1.234092	12	-0.424305	3	15	1	4	Male	0	0	40	United-States	1
2	1.052386	5	0.144505	2	-1.223573	1	9	2	5	Female	0	0	37	United-States	1
3	0.540174	5	-0.283457	12	-0.424305	7	2	5	5	Female	0	0	43	United-States	1
4	0.759693	6	1.421398	9	0.374962	3	5	1	5	Male	0	0	60	United-States	1

We can then convert it to a DataLoader:

tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	Private	Bachelors	Married-civ-spouse	Adm-clerical	Husband	Asian-Pac-Islander	False	45.000000	338104.994720	13.0
1	Private	HS-grad	Married-civ-spouse	Transport-moving	Husband	Other	False	26.000000	328663.004174	9.0
2	Private	11th	Divorced	Other-service	Not-in-family	White	False	53.000000	209021.999309	7.0
3	Private	HS-grad	Widowed	Adm-clerical	Unmarried	White	False	46.000000	162030.000870	9.0
4	Self-emp-inc	Assoc-voc	Married-civ-spouse	Exec-managerial	Husband	White	False	49.000000	349230.001375	11.0
5	Local-gov	Some-college	Married-civ-spouse	Exec-managerial	Husband	White	False	34.000000	124826.998224	10.0
6	Self-emp-inc	Some-college	Married-civ-spouse	Sales	Husband	White	False	53.000000	290640.003056	10.0
7	Private	Some-college	Never-married	Sales	Own-child	White	False	19.000000	106273.000912	10.0
8	Private	Some-college	Married-civ-spouse	Protective-serv	Husband	Black	False	72.000001	53684.002983	10.0

Other target types

Multi-label categories

one-hot encoded label

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
ddf_main, ddf_test = dd.from_pandas(df_main), dd.from_pandas(df_test)

ddf_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	male	white
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	<NA>	Wife	White	Female	0	1902	40	United-States	True	False	True
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	True	True	True
2	38	Private	96185	HS-grad	NaN	Divorced	<NA>	Unmarried	Black	Female	0	0	32	United-States	False	False	False
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	True	True	False
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	False	False	False

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
y_names=["salary", "male", "white"]

to = TabularDask(
    ddf_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names),
    train_mask_func=get_random_train_mask
)

CPU times: user 966 ms, sys: 0 ns, total: 966 ms
Wall time: 1 s

dls = to.dataloaders()
dls.valid.show_batch()

/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	male	white
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	1.013200e+05	12.0	True	False	True
1	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	46.0	3.282160e+05	9.0	True	True	True
2	State-gov	Masters	Divorced	#na#	Not-in-family	White	False	56.0	2.741110e+05	14.0	False	True	True
3	Private	Some-college	Married-civ-spouse	#na#	Wife	Black	True	40.0	1.889420e+05	10.0	False	False	False
4	Private	HS-grad	Married-spouse-absent	#na#	Own-child	Black	True	29.0	1.268339e+06	10.0	False	True	False
5	Self-emp-not-inc	Some-college	Divorced	#na#	Unmarried	White	True	47.0	2.137450e+05	10.0	False	False	True
6	Private	11th	Married-civ-spouse	#na#	Husband	White	False	42.0	7.005500e+04	7.0	False	True	True
7	Local-gov	HS-grad	Divorced	Adm-clerical	Unmarried	White	True	44.0	1.501710e+05	10.0	False	False	True
8	Private	Masters	Never-married	Exec-managerial	Not-in-family	White	True	29.0	1.572620e+05	10.0	False	False	True

Regression

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
ddf_main, ddf_test = dd.from_pandas(df_main), dd.from_pandas(df_test)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]

to = TabularDask(ddf_main, procs, cat_names, cont_names, y_names='age', train_mask_func=get_random_train_mask)

CPU times: user 860 ms, sys: 7.57 ms, total: 868 ms
Wall time: 871 ms

to.procs[-1].means

fnlwgt           192459.673567
education-num        10.071133
dtype: float64

dls = to.dataloaders()
dls.valid.show_batch()

/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

	workclass	education	marital-status	occupation	relationship	race	education-num_na	fnlwgt	education-num	age
0	Private	Bachelors	Married-civ-spouse	#na#	Husband	White	True	55291.003489	10.0	30.0
1	Private	HS-grad	Never-married	Handlers-cleaners	Own-child	Black	False	746431.990793	9.0	26.0
2	Private	HS-grad	Never-married	Sales	Other-relative	White	False	91524.997303	9.0	18.0
3	Private	Masters	Never-married	Exec-managerial	Not-in-family	White	True	157262.001416	10.0	29.0
4	Private	HS-grad	Married-civ-spouse	#na#	Husband	Amer-Indian-Eskimo	True	216811.000529	10.0	30.0
5	Private	HS-grad	Never-married	Sales	Own-child	White	True	156084.000249	10.0	36.0
6	Self-emp-not-inc	Doctorate	Married-civ-spouse	#na#	Husband	White	False	65278.004714	16.0	32.0
7	Local-gov	HS-grad	Never-married	Adm-clerical	Own-child	White	False	129232.000243	9.0	23.0
8	Private	HS-grad	Married-civ-spouse	Transport-moving	Wife	White	False	123397.002997	9.0	31.0