The main class to get your data ready for model training is TabularDataLoaders
and its factory methods. Checkout the BigTabular tutorial for examples of use.
source
DaskDataLoaders
DaskDataLoaders (*loaders, path:str|pathlib.Path='.', device=None)
Basic wrapper around DaskDataLoader
with factory methods for large tabular datasets with Dask
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
cat_names
: the names of the categorical variables
cont_names
: the names of the continuous variables
y_names
: the names of the dependent variables
y_block
: the TransformBlock
to use for the target
valid_idx
: the indices to use for the validation set (defaults to a random split otherwise)
bs
: the batch size
val_bs
: the batch size for the validation DataLoader
(defaults to bs
)
shuffle_train
: if we shuffle the training DataLoader
or not
n
: overrides the numbers of elements in the dataset
device
: the PyTorch device to use (defaults to default_device()
)
source
DaskDataLoaders.from_ddf
DaskDataLoaders.from_ddf (ddf:dd.DataFrame, path:str|Path='.',
procs:list=None, cat_names:list=None,
cont_names:list=None, y_names:list=None,
y_block:TransformBlock=None,
train_mask_func:callable=None, bs:int=64,
shuffle_train:bool=None, shuffle:bool=True,
val_shuffle:bool=False, n:int=None,
device:torch.device=None, drop_last:bool=None,
val_bs:int=None)
Create TabularDataLoaders
from df
in path
using procs
ddf
dd.DataFrame
A Dask dataframe
path
str | Path
.
Location of df
, defaults to current working directory
procs
list
None
List of TabularProc
s
cat_names
list
None
Column names pertaining to categorical variables
cont_names
list
None
Column names pertaining to continuous variables
y_names
list
None
Names of the dependent variables
y_block
TransformBlock
None
TransformBlock
to use for the target(s)
train_mask_func
callable
None
A function that creates a train/validation mask over a DataFrame
bs
int
64
Batch size
shuffle_train
bool
None
(Deprecated, use shuffle
) Shuffle training DataLoader
shuffle
bool
True
Shuffle is currently ignored in DaskDataLoader
val_shuffle
bool
False
Shuffle validation DataLoader
n
int
None
Size of Datasets
used to create DataLoader
device
torch.device
None
Device to put DataLoaders
drop_last
bool
None
Drop last incomplete batch, defaults to shuffle
. Currently ignored in DaskDataLoader
val_bs
int
None
Validation batch size, defaults to bs
Let’s have a look on an example with the adult dataset:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/ 'adult.csv' , skipinitialspace= True )
ddf = dd.from_pandas(df)
ddf.head()
0
49
Private
101320
Assoc-acdm
12.0
Married-civ-spouse
<NA>
Wife
White
Female
0
1902
40
United-States
>=50k
1
44
Private
236746
Masters
14.0
Divorced
Exec-managerial
Not-in-family
White
Male
10520
0
45
United-States
>=50k
2
38
Private
96185
HS-grad
NaN
Divorced
<NA>
Unmarried
Black
Female
0
0
32
United-States
<50k
3
38
Self-emp-inc
112847
Prof-school
15.0
Married-civ-spouse
Prof-specialty
Husband
Asian-Pac-Islander
Male
0
0
40
United-States
>=50k
4
42
Self-emp-not-inc
82297
7th-8th
NaN
Married-civ-spouse
Other-service
Wife
Black
Female
0
0
50
United-States
<50k
cat_names = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' , 'race' ]
cont_names = ['age' , 'fnlwgt' , 'education-num' ]
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
The following function gives the same result as valid_idx=list(range(800,1000))
in TabularDataLoaders. This is only the cases for a Dask dataframe with one partition.
def split_func(df): return pd.Series([False if i >= 800 and i < 1000 else True for i in range (len (df))])
dls = DaskDataLoaders.from_ddf(ddf, path, procs= procs, cat_names= cat_names, cont_names= cont_names,
y_names= "salary" , train_mask_func= split_func, bs= 64 )
/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
0
Private
Assoc-acdm
Married-civ-spouse
#na#
Wife
White
False
49.0
101320.001686
12.0
>=50k
1
Private
Masters
Divorced
Exec-managerial
Not-in-family
White
False
44.0
236745.998860
14.0
>=50k
2
Private
HS-grad
Divorced
#na#
Unmarried
Black
True
38.0
96185.001882
10.0
<50k
3
Self-emp-inc
Prof-school
Married-civ-spouse
Prof-specialty
Husband
Asian-Pac-Islander
False
38.0
112847.002752
15.0
>=50k
4
Self-emp-not-inc
7th-8th
Married-civ-spouse
Other-service
Wife
Black
True
42.0
82297.004480
10.0
<50k
5
Private
HS-grad
Never-married
Handlers-cleaners
Own-child
White
False
20.0
63209.995727
9.0
<50k
6
Private
Some-college
Divorced
#na#
Other-relative
White
False
49.0
44434.004384
10.0
<50k
7
Private
11th
Married-civ-spouse
#na#
Husband
White
False
37.0
138940.000568
7.0
<50k
8
Private
HS-grad
Married-civ-spouse
Craft-repair
Husband
White
False
46.0
328216.004421
9.0
>=50k
source
DaskDataLoaders.from_csv
DaskDataLoaders.from_csv (csv:str|Path|io.BufferedReader, *args,
skipinitialspace=True, header='infer',
dtype_backend=None, storage_options=None,
path:str|Path='.', procs:list=None,
cat_names:list=None, cont_names:list=None,
y_names:list=None, y_block:TransformBlock=None,
train_mask_func:callable=None, bs:int=64,
shuffle_train:bool=None, shuffle:bool=True,
val_shuffle:bool=False, n:int=None,
device:torch.device=None, drop_last:bool=None,
val_bs:int=None)
Create TabularDataLoaders
from csv
file in path
using procs
csv
str | Path | io.BufferedReader
A csv of training data
args
skipinitialspace
bool
True
header
str
infer
dtype_backend
NoneType
None
storage_options
NoneType
None
path
str | Path
.
Location of df
, defaults to current working directory
procs
list
None
List of TabularProc
s
cat_names
list
None
Column names pertaining to categorical variables
cont_names
list
None
Column names pertaining to continuous variables
y_names
list
None
Names of the dependent variables
y_block
TransformBlock
None
TransformBlock
to use for the target(s)
train_mask_func
callable
None
A function that creates a train/validation mask over a DataFrame
bs
int
64
Batch size
shuffle_train
bool
None
(Deprecated, use shuffle
) Shuffle training DataLoader
shuffle
bool
True
Shuffle is currently ignored in DaskDataLoader
val_shuffle
bool
False
Shuffle validation DataLoader
n
int
None
Size of Datasets
used to create DataLoader
device
torch.device
None
Device to put DataLoaders
drop_last
bool
None
Drop last incomplete batch, defaults to shuffle
. Currently ignored in DaskDataLoader
val_bs
int
None
Validation batch size, defaults to bs
cat_names = ['workclass' , 'education' , 'marital-status' , 'occupation' , 'relationship' , 'race' ]
cont_names = ['age' , 'fnlwgt' , 'education-num' ]
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
dls = DaskDataLoaders.from_csv(path/ 'adult.csv' , path= path, procs= procs, cat_names= cat_names, cont_names= cont_names,
y_names= "salary" , train_mask_func= split_func, bs= 64 )
/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
source
DaskDataLoaders.test_dl
DaskDataLoaders.test_dl (test_items, rm_type_tfms=None,
process:bool=True, inplace:bool=False, **kwargs)
Create test DaskDataLoader
from test_items
using validation procs
test_items
Items to create new test TabDataLoader
formatted the same as the training data
rm_type_tfms
NoneType
None
Number of Transform
s to be removed from procs
process
bool
True
Apply validation TabularProc
s to test_items
immediately
inplace
bool
False
Keep separate copy of original test_items
in memory if False
kwargs
External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ..."
. Often trimming is needed. Pandas has a convenient parameter skipinitialspace
that is exposed by TabularDataLoaders.from_csv()
. Otherwise category labels use for inference later such as workclass
:Private
will be categorized wrongly to 0 or "#na#"
if training label was read as " Private"
. Let’s test this feature.
test_data = {
'age' : [49 ],
'workclass' : ['Private' ],
'fnlwgt' : [101320 ],
'education' : ['Assoc-acdm' ],
'education-num' : [12.0 ],
'marital-status' : ['Married-civ-spouse' ],
'occupation' : ['' ],
'relationship' : ['Wife' ],
'race' : ['White' ],
}
input = dd.from_pandas(pd.DataFrame(test_data))
tdl = dls.test_dl(input )
test_ne(0 , tdl.dataset.items.compute().iloc[0 ]['workclass' ])