def split_func(df): return pd.Series([False if i >= 800 and i < 1000 else True for i in range(len(df))])BigTabular learner
Learner ready to train for tabular data with Dask
The main function you probably want to use in this module is dask_learner. It will automatically create a TabularModel suitable for your data and infer the right loss function. See the BigTabular tutorial for an example of use in context.
Main functions
DaskLearner
Learner for tabular data in Dask
DaskLearner inherits from fast.ai’s TabularLearner. It works exactly as a normal Learner, the only difference is that it implements a predict method specific to work on a row of data.
dask_learner
dask_learner (dls:DataLoaders, layers:list=None, emb_szs:list=None, config:dict=None, n_out:int=None, y_range:Tuple=None, loss_func:callable|None=None, opt_func:Optimizer|OptimWrapper=<function Adam>, lr:float|slice=0.001, splitter:callable=<function trainable_params>, cbs:Callback|MutableSequence|None=None, metrics:callable|MutableSequence|None=None, path:str|Path|None=None, model_dir:str|Path='models', wd:float|int|None=None, wd_bn_bias:bool=False, train_bn:bool=True, moms:tuple=(0.95, 0.85, 0.95), default_cbs:bool=True)
Get a Learner using dls, with metrics, including a TabularModel created using the remaining params.
| Type | Default | Details | |
|---|---|---|---|
| dls | DataLoaders | DataLoaders containing fastai or PyTorch DataLoaders |
|
| layers | list | None | Size of the layers generated by LinBnDrop |
| emb_szs | list | None | Tuples of n_unique, embedding_size for all categorical features |
| config | dict | None | Config params for TabularModel from tabular_config |
| n_out | int | None | Final output size of the model |
| y_range | Tuple | None | Low and high for the final sigmoid function |
| loss_func | callable | None | None | Loss function. Defaults to dls loss |
| opt_func | Optimizer | OptimWrapper | Adam | Optimization function for training |
| lr | float | slice | 0.001 | Default learning rate |
| splitter | callable | trainable_params | Split model into parameter groups. Defaults to one parameter group |
| cbs | Callback | MutableSequence | None | None | Callbacks to add to Learner |
| metrics | callable | MutableSequence | None | None | Metrics to calculate on validation set |
| path | str | Path | None | None | Parent directory to save, load, and export models. Defaults to dls path |
| model_dir | str | Path | models | Subdirectory to save and load models |
| wd | float | int | None | None | Default weight decay |
| wd_bn_bias | bool | False | Apply weight decay to normalization and bias parameters |
| train_bn | bool | True | Train frozen normalization layers |
| moms | tuple | (0.95, 0.85, 0.95) | Default momentum for schedulers |
| default_cbs | bool | True | Include default Callbacks |
If your data was built with fastai, you probably won’t need to pass anything to emb_szs unless you want to change the default of the library (produced by get_emb_sz), same for n_out which should be automatically inferred. layers will default to [200,100] and is passed to TabularModel along with the config.
Use tabular_config to create a config and customize the model used. There is just easy access to y_range because this argument is often used.
All the other arguments are passed to Learner.
The following function gives the same result as valid_idx=list(range(800,1000)) in TabularDataLoaders. This is only the cases for a Dask dataframe with one partition.
path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
dls = DaskDataLoaders.from_ddf(ddf, path, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names="salary", train_mask_func=split_func, bs=64)
learn = dask_learner(dls)/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
DaskLearner.predict
DaskLearner.predict (row:pandas.core.series.Series)
Predict on a single sample
| Type | Details | |
|---|---|---|
| row | pd.Series | Features to be predicted |
We can pass in an individual row of data into our TabularLearner’s predict method. It’s output is slightly different from the other predict methods, as this one will always return the input as well:
row, clas, probs = learn.predict(ddf.head().iloc[0])row| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 101320.001686 | 12.0 |
clas, probs(tensor(0), tensor([0.5152, 0.4848]))