BigTabular core

Basic functions to preprocess larger-than-memory tabular data with Dask before assembling it in DataLoaders.

Initial preprocessing

Define Dask versions of the make_date, add_datepart, and add_elapsed_times functions defined in tabular.core. The dask_make_date function uses Dask’s to_datetime function rather than the Pandas version. The dask_add_datepart and dask_add_elapsed_times functions just wrap add_datepart in the Dask map_partitions function.


source

dask_make_date

 dask_make_date (ddf, date_field)

Convert df[date_field] to date type.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
ddf = dd.from_pandas(df)
dask_make_date(ddf, 'date')
test_eq(ddf['date'].dtype, np.dtype('datetime64[ns]'))

source

dask_add_datepart

 dask_add_datepart (ddf, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of ddf

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
ddf = dd.from_pandas(df)
ddf = dask_add_datepart(ddf, 'date')
ddf.head()
Year Month Week Day Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end Is_quarter_start Is_year_end Is_year_start Elapsed
0 2019.0 12.0 49.0 4.0 2.0 338.0 False False False False False False 1.575418e+09
1 NaN NaN NaN NaN NaN NaN False False False False False False NaN
2 2019.0 11.0 46.0 15.0 4.0 319.0 False False False False False False 1.573776e+09
3 2019.0 10.0 43.0 24.0 3.0 297.0 False False False False False False 1.571875e+09

source

dask_add_elapsed_times

 dask_add_elapsed_times (ddf, field_names, date_field, base_field)
df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
ddf = dd.from_pandas(df)
ddf = dask_add_elapsed_times(ddf, ['event'], 'date', 'base')
ddf.head()
date event base Afterevent Beforeevent event_bw event_fw
0 2019-12-04 False 1 5 0 1.0 0.0
1 2019-11-29 True 1 0 0 1.0 1.0
2 2019-11-15 False 2 22 0 1.0 0.0
3 2019-10-24 True 2 0 0 1.0 1.0

source

dask_cont_cat_split

 dask_cont_cat_split (df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

We also define a Dask version of the cont_cat_split function. The only difference to the original function is calling compute on the Dask dataframe to determine the cardinality of the columns. This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
ddf = dd.from_pandas(df)
cont_names, cat_names = dask_cont_cat_split(ddf)
cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
ddf = dd.from_pandas(df)
ddf = dask_add_datepart(ddf, 'd1_date', drop=False)

ddf['cat1'] = ddf['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)

cont_names, cat_names = dask_cont_cat_split(ddf, max_card=0)
cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

source

get_random_train_mask

 get_random_train_mask (df, train_frac=0.8, seed=None)

source

RandomTrainMask

 RandomTrainMask (train_frac=0.8, seed=None)

Initialize self. See help(type(self)) for accurate signature.

A class to create a random train/validation set mask over the Dask dataframe.


source

TabularDask

 TabularDask (ddf, procs=None, cat_names=None, cont_names=None,
              y_names=None, y_block=None, train_mask_func=None,
              do_setup=True, device=None, reset_index=True)

A Dask DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __iter__. The aim is to replicate the TabularPandas API as closely as possible.

  • df: A DataFrame of your data
  • cat_names: Your categorical x variables
  • cont_names: Your continuous x variables
  • y_names: Your dependent y variables
    • Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
  • y_block: How to sub-categorize the type of y_names (CategoryBlock or RegressionBlock)
  • train_mask_func: A function that creates a train/validation mask over a DataFrame. See get_random_train_mask for an example.
  • do_setup: A parameter for if Tabular will run the data through the procs upon initialization
  • device: cuda or cpu

Transforms

These transforms inherit from TabularProc and are applied as soon as the data is available rather than as data is called from the DataLoader


source

DaskCategoryMap

 DaskCategoryMap (col, sort=True, add_na=False, strict=False)

Dask implementation of CategoryMap. Collection of categories with the reverse mapping in o2i


source

DaskCategorify

 DaskCategorify (cat_vocabs:"'dict|None'"=None)

Transform the categorical variables to something similar to pd.Categorical

The Categorify class from fastai.tabular.core.Categorify is modified to: - be compatible with Dask - accept existing vocabs through the cat_vocabs input

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

ddf = dd.from_pandas(pd.DataFrame({'a':[0,1,2,0,2]}))
to = TabularDask(ddf, DaskCategorify, 'a')
to.show()
a
0 0
1 1
2 2
3 0
4 2

Each column’s unique values are stored in a dictionary of column:[values]:

cat = to.procs.dask_categorify
cat.classes
{'a': ['#na#', 0, 1, 2]}

We can provide an exisiting vocab if it exists, for example if pretrained weights will be used for a categorical variable:

ddf = dd.from_pandas(pd.DataFrame({'a':['Cat','Dog','Lion','Leopard','Honey badger']}))

With default vocab:

to = TabularDask(ddf, DaskCategorify, 'a')
cat = to.procs.dask_categorify
cat.classes
{'a': ['#na#', 'Cat', 'Dog', 'Honey badger', 'Leopard', 'Lion']}

With predefined vocab:

vocab = {'a': ['Honey badger', 'Dog', 'Cat', 'Lion','Leopard']}
to = TabularDask(ddf, DaskCategorify(cat_vocabs=vocab), 'a')
cat = to.procs.dask_categorify
cat.classes
{'a': ['Honey badger', 'Dog', 'Cat', 'Lion', 'Leopard']}

source

DaskNormalize

 DaskNormalize (cols=None)

Base class to write a non-lazy tabular processor for dataframes


source

DaskCategorize

 DaskCategorize (vocab=None, sort=True, add_na=False)

A transform with a __repr__ that shows its attrs


source

DaskFillStrategy

 DaskFillStrategy ()

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.


source

DaskFillMissing

 DaskFillMissing (fill_strategy=<function median>, add_col=True,
                  fill_vals=None)

Fill the missing values in continuous columns.


source

DaskRegressionSetup

 DaskRegressionSetup (c=None)

A Dask-compatible transform that floatifies targets

We define basic TransformBlocks that are compatible with the Dask transforms:


source

DaskCategoryBlock

 DaskCategoryBlock
                    (vocab:collections.abc.MutableSequence|pandas.core.ser
                    ies.Series=None, sort:bool=True, add_na:bool=False)

A Dask-compatible TransformBlock for single-label categorical targets

Type Default Details
vocab MutableSequence | pd.Series None List of unique class names
sort bool True Sort the classes alphabetically
add_na bool False Add #na# to vocab

source

DaskRegressionBlock

 DaskRegressionBlock (n_out:int=None)

A Dask-compatible TransformBlock for float targets

Type Default Details
n_out int None Number of output values

source

DaskDataLoader

 DaskDataLoader (dataset=None, bs=None, num_workers=0, pin_memory=False,
                 timeout=0, batch_size=None, shuffle=False,
                 drop_last=False, indexed=None, n=None, device=None,
                 persistent_workers=False, pin_memory_device='', wif=None,
                 before_iter=None, after_item=None, before_batch=None,
                 after_batch=None, after_iter=None, create_batches=None,
                 create_item=None, create_batch=None, retain=None,
                 get_idxs=None, sample=None, shuffle_fn=None,
                 do_batch=None)

Iterable dataloader for tabular learning with Dask

Integration example

For a more in-depth explanation, see the BigTabular tutorial

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
ddf_main, ddf_test = dd.from_pandas(df_main), dd.from_pandas(df_test)
ddf_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse <NA> Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced <NA> Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
to = TabularDask(
    ddf_main, procs, cat_names, cont_names, y_names="salary", train_mask_func=RandomTrainMask()
)
dls = to.dataloaders()
dls.valid.show_batch()
/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Masters Divorced Exec-managerial Not-in-family White False 44.0 236745.998760 14.0 >=50k
1 Self-emp-inc Prof-school Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander False 38.0 112847.001163 15.0 >=50k
2 Private HS-grad Never-married Handlers-cleaners Own-child White False 20.0 63210.002966 9.0 <50k
3 Private Bachelors Never-married #na# Own-child Black False 23.0 529222.995495 13.0 <50k
4 Private Assoc-voc Married-civ-spouse Sales Husband White True 43.0 84660.997258 10.0 <50k
5 Private HS-grad Married-civ-spouse Craft-repair Husband White False 49.0 247294.000118 9.0 >=50k
6 Private 11th Married-civ-spouse #na# Husband White False 42.0 70055.004990 7.0 <50k
7 Private Bachelors Married-civ-spouse Exec-managerial Husband White False 45.0 242390.999669 13.0 >=50k
8 Private Some-college Never-married Sales Not-in-family Black True 41.0 140589.998619 10.0 <50k
to.show()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.0 101320.0 12.0 >=50k
1 Private Masters Divorced Exec-managerial Not-in-family White False 44.0 236746.0 14.0 >=50k
2 Private HS-grad Divorced #na# Unmarried Black True 38.0 96185.0 10.0 <50k
3 Self-emp-inc Prof-school Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander False 38.0 112847.0 15.0 >=50k
4 Self-emp-not-inc 7th-8th Married-civ-spouse Other-service Wife Black True 42.0 82297.0 10.0 <50k
5 Private HS-grad Never-married Handlers-cleaners Own-child White False 20.0 63210.0 9.0 <50k
6 Private Some-college Divorced #na# Other-relative White False 49.0 44434.0 10.0 <50k
7 Private 11th Married-civ-spouse #na# Husband White False 37.0 138940.0 7.0 <50k
8 Private HS-grad Married-civ-spouse Craft-repair Husband White False 46.0 328216.0 9.0 >=50k
9 Self-emp-inc HS-grad Married-civ-spouse #na# Husband White True 36.0 216711.0 10.0 >=50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

row = to.items.head().iloc[0]
to.decode_row(row)
age                                49.0
workclass                       Private
fnlwgt                    101319.997963
education                    Assoc-acdm
education-num                      12.0
marital-status       Married-civ-spouse
occupation                         #na#
relationship                       Wife
race                              White
sex                              Female
capital-gain                          0
capital-loss                       1902
hours-per-week                       40
native-country            United-States
salary                              NaN
_int_train_mask                    True
education-num_na                  False
Name: 0, dtype: object

We can make new test datasets based on the training data with the to.new()

Note

Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

to_tst = to.new(ddf_test)
to_tst.process()
to_tst.items.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country education-num_na
0 0.467001 5 1.320081 10 1.174229 3 2 1 2 Male 0 0 40 Philippines 1
1 -0.923288 5 1.234092 12 -0.424305 3 15 1 4 Male 0 0 40 United-States 1
2 1.052386 5 0.144505 2 -1.223573 1 9 2 5 Female 0 0 37 United-States 1
3 0.540174 5 -0.283457 12 -0.424305 7 2 5 5 Female 0 0 43 United-States 1
4 0.759693 6 1.421398 9 0.374962 3 5 1 5 Male 0 0 60 United-States 1

We can then convert it to a DataLoader:

tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num
0 Private Bachelors Married-civ-spouse Adm-clerical Husband Asian-Pac-Islander False 45.000000 338104.994720 13.0
1 Private HS-grad Married-civ-spouse Transport-moving Husband Other False 26.000000 328663.004174 9.0
2 Private 11th Divorced Other-service Not-in-family White False 53.000000 209021.999309 7.0
3 Private HS-grad Widowed Adm-clerical Unmarried White False 46.000000 162030.000870 9.0
4 Self-emp-inc Assoc-voc Married-civ-spouse Exec-managerial Husband White False 49.000000 349230.001375 11.0
5 Local-gov Some-college Married-civ-spouse Exec-managerial Husband White False 34.000000 124826.998224 10.0
6 Self-emp-inc Some-college Married-civ-spouse Sales Husband White False 53.000000 290640.003056 10.0
7 Private Some-college Never-married Sales Own-child White False 19.000000 106273.000912 10.0
8 Private Some-college Married-civ-spouse Protective-serv Husband Black False 72.000001 53684.002983 10.0

Other target types

Multi-label categories

one-hot encoded label

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
ddf_main, ddf_test = dd.from_pandas(df_main), dd.from_pandas(df_test)
ddf_main.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary male white
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse <NA> Wife White Female 0 1902 40 United-States True False True
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States True True True
2 38 Private 96185 HS-grad NaN Divorced <NA> Unmarried Black Female 0 0 32 United-States False False False
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States True True False
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States False False False
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
y_names=["salary", "male", "white"]
to = TabularDask(
    ddf_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names),
    train_mask_func=get_random_train_mask
)
CPU times: user 966 ms, sys: 0 ns, total: 966 ms
Wall time: 1 s
dls = to.dataloaders()
dls.valid.show_batch()
/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary male white
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.0 1.013200e+05 12.0 True False True
1 Private HS-grad Married-civ-spouse Craft-repair Husband White False 46.0 3.282160e+05 9.0 True True True
2 State-gov Masters Divorced #na# Not-in-family White False 56.0 2.741110e+05 14.0 False True True
3 Private Some-college Married-civ-spouse #na# Wife Black True 40.0 1.889420e+05 10.0 False False False
4 Private HS-grad Married-spouse-absent #na# Own-child Black True 29.0 1.268339e+06 10.0 False True False
5 Self-emp-not-inc Some-college Divorced #na# Unmarried White True 47.0 2.137450e+05 10.0 False False True
6 Private 11th Married-civ-spouse #na# Husband White False 42.0 7.005500e+04 7.0 False True True
7 Local-gov HS-grad Divorced Adm-clerical Unmarried White True 44.0 1.501710e+05 10.0 False False True
8 Private Masters Never-married Exec-managerial Not-in-family White True 29.0 1.572620e+05 10.0 False False True

Regression

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)
ddf_main, ddf_test = dd.from_pandas(df_main), dd.from_pandas(df_test)
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
to = TabularDask(ddf_main, procs, cat_names, cont_names, y_names='age', train_mask_func=get_random_train_mask)
CPU times: user 860 ms, sys: 7.57 ms, total: 868 ms
Wall time: 871 ms
to.procs[-1].means
fnlwgt           192459.673567
education-num        10.071133
dtype: float64
dls = to.dataloaders()
dls.valid.show_batch()
/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')
workclass education marital-status occupation relationship race education-num_na fnlwgt education-num age
0 Private Bachelors Married-civ-spouse #na# Husband White True 55291.003489 10.0 30.0
1 Private HS-grad Never-married Handlers-cleaners Own-child Black False 746431.990793 9.0 26.0
2 Private HS-grad Never-married Sales Other-relative White False 91524.997303 9.0 18.0
3 Private Masters Never-married Exec-managerial Not-in-family White True 157262.001416 10.0 29.0
4 Private HS-grad Married-civ-spouse #na# Husband Amer-Indian-Eskimo True 216811.000529 10.0 30.0
5 Private HS-grad Never-married Sales Own-child White True 156084.000249 10.0 36.0
6 Self-emp-not-inc Doctorate Married-civ-spouse #na# Husband White False 65278.004714 16.0 32.0
7 Local-gov HS-grad Never-married Adm-clerical Own-child White False 129232.000243 9.0 23.0
8 Private HS-grad Married-civ-spouse Transport-moving Wife White False 123397.002997 9.0 31.0