BigTabular learner

The function to immediately get a Learner ready to train for tabular data with Dask

The main function you probably want to use in this module is dask_learner. It will automatically create a TabularModel suitable for your data and infer the right loss function. See the BigTabular tutorial for an example of use in context.

Main functions

source

DaskLearner

Learner for tabular data in Dask

DaskLearner inherits from fast.ai’s TabularLearner. It works exactly as a normal Learner, the only difference is that it implements a predict method specific to work on a row of data.

source

dask_learner

 dask_learner (dls:DataLoaders, layers:list=None, emb_szs:list=None,
               config:dict=None, n_out:int=None, y_range:Tuple=None,
               loss_func:callable|None=None,
               opt_func:Optimizer|OptimWrapper=<function Adam>,
               lr:float|slice=0.001, splitter:callable=<function
               trainable_params>, cbs:Callback|MutableSequence|None=None,
               metrics:callable|MutableSequence|None=None,
               path:str|Path|None=None, model_dir:str|Path='models',
               wd:float|int|None=None, wd_bn_bias:bool=False,
               train_bn:bool=True, moms:tuple=(0.95, 0.85, 0.95),
               default_cbs:bool=True)

Get a Learner using dls, with metrics, including a TabularModel created using the remaining params.

	Type	Default	Details
dls	DataLoaders		`DataLoaders` containing fastai or PyTorch `DataLoader`s
layers	list	None	Size of the layers generated by `LinBnDrop`
emb_szs	list	None	Tuples of `n_unique, embedding_size` for all categorical features
config	dict	None	Config params for TabularModel from `tabular_config`
n_out	int	None	Final output size of the model
y_range	Tuple	None	Low and high for the final sigmoid function
loss_func	callable \| None	None	Loss function. Defaults to `dls` loss
opt_func	Optimizer \| OptimWrapper	Adam	Optimization function for training
lr	float \| slice	0.001	Default learning rate
splitter	callable	trainable_params	Split model into parameter groups. Defaults to one parameter group
cbs	Callback \| MutableSequence \| None	None	`Callback`s to add to `Learner`
metrics	callable \| MutableSequence \| None	None	`Metric`s to calculate on validation set
path	str \| Path \| None	None	Parent directory to save, load, and export models. Defaults to `dls` `path`
model_dir	str \| Path	models	Subdirectory to save and load models
wd	float \| int \| None	None	Default weight decay
wd_bn_bias	bool	False	Apply weight decay to normalization and bias parameters
train_bn	bool	True	Train frozen normalization layers
moms	tuple	(0.95, 0.85, 0.95)	Default momentum for schedulers
default_cbs	bool	True	Include default `Callback`s

If your data was built with fastai, you probably won’t need to pass anything to emb_szs unless you want to change the default of the library (produced by get_emb_sz), same for n_out which should be automatically inferred. layers will default to [200,100] and is passed to TabularModel along with the config.

Use tabular_config to create a config and customize the model used. There is just easy access to y_range because this argument is often used.

All the other arguments are passed to Learner.

The following function gives the same result as valid_idx=list(range(800,1000)) in TabularDataLoaders. This is only the cases for a Dask dataframe with one partition.

def split_func(df): return pd.Series([False if i >= 800 and i < 1000 else True for i in range(len(df))])

path = untar_data(URLs.ADULT_SAMPLE)
ddf = dd.from_pandas(pd.read_csv(path/'adult.csv'))
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [DaskCategorify, DaskFillMissing, DaskNormalize]
dls = DaskDataLoaders.from_ddf(ddf, path, procs=procs, cat_names=cat_names, cont_names=cont_names,
                              y_names="salary", train_mask_func=split_func, bs=64)
learn = dask_learner(dls)

/home/stefan/Insync/OneDrive_personal/python_workspace/dev/tmp/bigtabular/bigtabular/core.py:203: UserWarning: `shuffle` and `drop_last` are currently ignored.
  warnings.warn('`shuffle` and `drop_last` are currently ignored.')

source

DaskLearner.predict

 DaskLearner.predict (row:pandas.core.series.Series)

Predict on a single sample

	Type	Details
row	pd.Series	Features to be predicted

We can pass in an individual row of data into our TabularLearner’s predict method. It’s output is slightly different from the other predict methods, as this one will always return the input as well:

row, clas, probs = learn.predict(ddf.head().iloc[0])

row

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101320.001686	12.0

clas, probs

(tensor(0), tensor([0.5152, 0.4848]))