BigTabular data

Helper functions to get larger-than-memory Dask dataframes in a DataLoaders in the tabular application and higher class DaskDataLoaders

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the BigTabular tutorial for examples of use.


source

DaskDataLoaders

 DaskDataLoaders (*loaders, path:str|pathlib.Path='.', device=None)

Basic wrapper around DaskDataLoader with factory methods for large tabular datasets with Dask

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

  • cat_names: the names of the categorical variables
  • cont_names: the names of the continuous variables
  • y_names: the names of the dependent variables
  • y_block: the TransformBlock to use for the target
  • valid_idx: the indices to use for the validation set (defaults to a random split otherwise)
  • bs: the batch size
  • val_bs: the batch size for the validation DataLoader (defaults to bs)
  • shuffle_train: if we shuffle the training DataLoader or not
  • n: overrides the numbers of elements in the dataset
  • device: the PyTorch device to use (defaults to default_device())

source

DaskDataLoaders.from_ddf

 DaskDataLoaders.from_ddf (ddf:dd.DataFrame, path:str|Path='.',
                           procs:list=None, cat_names:list=None,
                           cont_names:list=None, y_names:list=None,
                           y_block:TransformBlock=None,
                           train_mask_func:callable=None, bs:int=64,
                           shuffle_train:bool=None, shuffle:bool=True,
                           val_shuffle:bool=False, n:int=None,
                           device:torch.device=None, drop_last:bool=None,
                           val_bs:int=None)

Create TabularDataLoaders from df in path using procs

Type Default Details
ddf dd.DataFrame A Dask dataframe
path str | Path . Location of df, defaults to current working directory
procs list None List of TabularProcs
cat_names list None Column names pertaining to categorical variables
cont_names list None Column names pertaining to continuous variables
y_names list None Names of the dependent variables
y_block TransformBlock None TransformBlock to use for the target(s)
train_mask_func callable None A function that creates a train/validation mask over a DataFrame
bs int 64 Batch size
shuffle_train bool None (Deprecated, use shuffle) Shuffle training DataLoader
shuffle bool True Shuffle is currently ignored in DaskDataLoader
val_shuffle bool False Shuffle validation DataLoader
n int None Size of Datasets used to create DataLoader
device torch.device None Device to put DataLoaders
drop_last bool None Drop last incomplete batch, defaults to shuffle. Currently ignored in DaskDataLoader
val_bs int None Validation batch size, defaults to bs

Let’s have a look on an example with the adult dataset:

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
ddf = dd.from_pandas(df)
ddf.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse <NA> Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced <NA> Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]

The following function gives the same result as valid_idx=list(range(800,1000)) in TabularDataLoaders. This is only the cases for a Dask dataframe with one partition.

def split_func(df): return pd.Series([False if i >= 800 and i < 1000 else True for i in range(len(df))])
dls = DaskDataLoaders.from_ddf(ddf, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", train_mask_func=split_func, bs=64)
/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Private Assoc-acdm Married-civ-spouse #na# Wife White False 49.0 101320.001686 12.0 >=50k
1 Private Masters Divorced Exec-managerial Not-in-family White False 44.0 236745.998860 14.0 >=50k
2 Private HS-grad Divorced #na# Unmarried Black True 38.0 96185.001882 10.0 <50k
3 Self-emp-inc Prof-school Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander False 38.0 112847.002752 15.0 >=50k
4 Self-emp-not-inc 7th-8th Married-civ-spouse Other-service Wife Black True 42.0 82297.004480 10.0 <50k
5 Private HS-grad Never-married Handlers-cleaners Own-child White False 20.0 63209.995727 9.0 <50k
6 Private Some-college Divorced #na# Other-relative White False 49.0 44434.004384 10.0 <50k
7 Private 11th Married-civ-spouse #na# Husband White False 37.0 138940.000568 7.0 <50k
8 Private HS-grad Married-civ-spouse Craft-repair Husband White False 46.0 328216.004421 9.0 >=50k

source

DaskDataLoaders.from_csv

 DaskDataLoaders.from_csv (csv:str|Path|io.BufferedReader, *args,
                           skipinitialspace=True, header='infer',
                           dtype_backend=None, storage_options=None,
                           path:str|Path='.', procs:list=None,
                           cat_names:list=None, cont_names:list=None,
                           y_names:list=None, y_block:TransformBlock=None,
                           train_mask_func:callable=None, bs:int=64,
                           shuffle_train:bool=None, shuffle:bool=True,
                           val_shuffle:bool=False, n:int=None,
                           device:torch.device=None, drop_last:bool=None,
                           val_bs:int=None)

Create TabularDataLoaders from csv file in path using procs

Type Default Details
csv str | Path | io.BufferedReader A csv of training data
args
skipinitialspace bool True
header str infer
dtype_backend NoneType None
storage_options NoneType None
path str | Path . Location of df, defaults to current working directory
procs list None List of TabularProcs
cat_names list None Column names pertaining to categorical variables
cont_names list None Column names pertaining to continuous variables
y_names list None Names of the dependent variables
y_block TransformBlock None TransformBlock to use for the target(s)
train_mask_func callable None A function that creates a train/validation mask over a DataFrame
bs int 64 Batch size
shuffle_train bool None (Deprecated, use shuffle) Shuffle training DataLoader
shuffle bool True Shuffle is currently ignored in DaskDataLoader
val_shuffle bool False Shuffle validation DataLoader
n int None Size of Datasets used to create DataLoader
device torch.device None Device to put DataLoaders
drop_last bool None Drop last incomplete batch, defaults to shuffle. Currently ignored in DaskDataLoader
val_bs int None Validation batch size, defaults to bs
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
dls = DaskDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                  y_names="salary", train_mask_func=split_func, bs=64)
/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

source

DaskDataLoaders.test_dl

 DaskDataLoaders.test_dl (test_items, rm_type_tfms=None,
                          process:bool=True, inplace:bool=False, **kwargs)

Create test DaskDataLoader from test_items using validation procs

Type Default Details
test_items Items to create new test TabDataLoader formatted the same as the training data
rm_type_tfms NoneType None Number of Transforms to be removed from procs
process bool True Apply validation TabularProcs to test_items immediately
inplace bool False Keep separate copy of original test_items in memory if False
kwargs

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by TabularDataLoaders.from_csv(). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.

test_data = {
    'age': [49], 
    'workclass': ['Private'], 
    'fnlwgt': [101320],
    'education': ['Assoc-acdm'], 
    'education-num': [12.0],
    'marital-status': ['Married-civ-spouse'], 
    'occupation': [''],
    'relationship': ['Wife'],
    'race': ['White'],
}
input = dd.from_pandas(pd.DataFrame(test_data))
tdl = dls.test_dl(input)

test_ne(0, tdl.dataset.items.compute().iloc[0]['workclass'])