Extension of fastai.tabular for larger-than-memory datasets with Dask
This library replicates much the functionality of the tabular data application in the fastai library to work with larger-than-memory datasets. Pandas, which is used for data transformations in fastai.tabular, is replaced with Dask DataFrames.
Most of the Dask implementations were written as they were needed for a personal project, but then refactored to match the fastai API more closely. The flow of the Jupyter notebooks follows those from fastai.tabular closely and most of the examples and tests were replicated.
When not to use BigTabular
Don’t use this library when you don’t need to use Dask. The Dask website gives the following guidance:
Dask DataFrames are often used either when …
Your data is too big
Your computation is too slow and other techniques don’t work
You should probably stick to just using pandas if …
Your data is small
Your computation is fast (subsecond)
There are simpler ways to accelerate your computation, like avoiding .apply or Python for loops and using a built-in pandas method instead.
Install
pip install bigtabular
How to use
Refer to the tutorial for a more detailed usage example.
Create dataloaders. Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in DaskDataLoaders factory methods: