= pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
df = dd.from_pandas(df)
ddf 'date')
dask_make_date(ddf, 'date'].dtype, np.dtype('datetime64[ns]')) test_eq(ddf[
BigTabular core
DataLoaders
.
Initial preprocessing
Define Dask versions of the make_date
, add_datepart
, and add_elapsed_times
functions defined in tabular.core
. The dask_make_date
function uses Dask’s to_datetime
function rather than the Pandas version. The dask_add_datepart
and dask_add_elapsed_times
functions just wrap add_datepart
in the Dask map_partitions
function.
dask_make_date
dask_make_date (ddf, date_field)
Convert df[date_field]
to date type.
dask_add_datepart
dask_add_datepart (ddf, field_name, prefix=None, drop=True, time=False)
Helper function that adds columns relevant to a date in the column field_name of ddf
For example if we have a series of dates we can then generate features such as Year
, Month
, Day
, Dayofweek
, Is_month_start
, etc as shown below:
= pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = dd.from_pandas(df)
ddf = dask_add_datepart(ddf, 'date')
ddf ddf.head()
Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019.0 | 12.0 | 49.0 | 4.0 | 2.0 | 338.0 | False | False | False | False | False | False | 1.575418e+09 |
1 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | False | False | NaN |
2 | 2019.0 | 11.0 | 46.0 | 15.0 | 4.0 | 319.0 | False | False | False | False | False | False | 1.573776e+09 |
3 | 2019.0 | 10.0 | 43.0 | 24.0 | 3.0 | 297.0 | False | False | False | False | False | False | 1.571875e+09 |
dask_add_elapsed_times
dask_add_elapsed_times (ddf, field_names, date_field, base_field)
= pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
df 'event': [False, True, False, True], 'base': [1,1,2,2]})
= dd.from_pandas(df)
ddf = dask_add_elapsed_times(ddf, ['event'], 'date', 'base')
ddf ddf.head()
date | event | base | Afterevent | Beforeevent | event_bw | event_fw | |
---|---|---|---|---|---|---|---|
0 | 2019-12-04 | False | 1 | 5 | 0 | 1.0 | 0.0 |
1 | 2019-11-29 | True | 1 | 0 | 0 | 1.0 | 1.0 |
2 | 2019-11-15 | False | 2 | 22 | 0 | 1.0 | 0.0 |
3 | 2019-10-24 | True | 2 | 0 | 0 | 1.0 | 1.0 |
dask_cont_cat_split
dask_cont_cat_split (df, max_card=20, dep_var=None)
Helper function that returns column names of cont and cat variables from given df
.
We also define a Dask version of the cont_cat_split
function. The only difference to the original function is calling compute
on the Dask dataframe to determine the cardinality of the columns. This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card
parameter (or a float
datatype) then it will be added to the cont_names
else cat_names
. An example is below:
# Example with simple numpy types
= pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
df 'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
= dd.from_pandas(df)
ddf = dask_cont_cat_split(ddf) cont_names, cat_names
cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
# Example with pandas types and generated columns
= pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
df 'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
})= dd.from_pandas(df)
ddf = dask_add_datepart(ddf, 'd1_date', drop=False)
ddf
'cat1'] = ddf['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
ddf[
= dask_cont_cat_split(ddf, max_card=0) cont_names, cat_names
cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']
get_random_train_mask
get_random_train_mask (df, train_frac=0.8, seed=None)
RandomTrainMask
RandomTrainMask (train_frac=0.8, seed=None)
Initialize self. See help(type(self)) for accurate signature.
A class to create a random train/validation set mask over the Dask dataframe.
TabularDask
TabularDask (ddf, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, train_mask_func=None, do_setup=True, device=None, reset_index=True)
A Dask DataFrame
wrapper that knows which cols are cont/cat/y, and returns rows in __iter__
. The aim is to replicate the TabularPandas API as closely as possible.
df
: ADataFrame
of your datacat_names
: Your categoricalx
variablescont_names
: Your continuousx
variablesy_names
: Your dependenty
variables- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block
: How to sub-categorize the type ofy_names
(CategoryBlock
orRegressionBlock
)train_mask_func
: A function that creates a train/validation mask over aDataFrame
. Seeget_random_train_mask
for an example.do_setup
: A parameter for ifTabular
will run the data through theprocs
upon initializationdevice
:cuda
orcpu
Transforms
These transforms inherit from TabularProc
and are applied as soon as the data is available rather than as data is called from the DataLoader
DaskCategoryMap
DaskCategoryMap (col, sort=True, add_na=False, strict=False)
Dask implementation of CategoryMap. Collection of categories with the reverse mapping in o2i
DaskCategorify
DaskCategorify (cat_vocabs:"'dict|None'"=None)
Transform the categorical variables to something similar to pd.Categorical
The Categorify
class from fastai.tabular.core.Categorify
is modified to: - be compatible with Dask - accept existing vocabs through the cat_vocabs
input
While visually in the DataFrame
you will not see a change, the classes are stored in to.procs.categorify
as we can see below on a dummy DataFrame
:
= dd.from_pandas(pd.DataFrame({'a':[0,1,2,0,2]}))
ddf = TabularDask(ddf, DaskCategorify, 'a')
to to.show()
a | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 0 |
4 | 2 |
Each column’s unique values are stored in a dictionary of column:[values]
:
= to.procs.dask_categorify
cat cat.classes
{'a': ['#na#', 0, 1, 2]}
We can provide an exisiting vocab if it exists, for example if pretrained weights will be used for a categorical variable:
= dd.from_pandas(pd.DataFrame({'a':['Cat','Dog','Lion','Leopard','Honey badger']})) ddf
With default vocab:
= TabularDask(ddf, DaskCategorify, 'a')
to = to.procs.dask_categorify
cat cat.classes
{'a': ['#na#', 'Cat', 'Dog', 'Honey badger', 'Leopard', 'Lion']}
With predefined vocab:
= {'a': ['Honey badger', 'Dog', 'Cat', 'Lion','Leopard']}
vocab = TabularDask(ddf, DaskCategorify(cat_vocabs=vocab), 'a')
to = to.procs.dask_categorify
cat cat.classes
{'a': ['Honey badger', 'Dog', 'Cat', 'Lion', 'Leopard']}
DaskNormalize
DaskNormalize (cols=None)
Base class to write a non-lazy tabular processor for dataframes
DaskCategorize
DaskCategorize (vocab=None, sort=True, add_na=False)
A transform with a __repr__
that shows its attrs
DaskFillStrategy
DaskFillStrategy ()
Namespace containing the various filling strategies.
Currently, filling with the median
, a constant
, and the mode
are supported.
DaskFillMissing
DaskFillMissing (fill_strategy=<function median>, add_col=True, fill_vals=None)
Fill the missing values in continuous columns.
DaskRegressionSetup
DaskRegressionSetup (c=None)
A Dask-compatible transform that floatifies targets
We define basic TransformBlock
s that are compatible with the Dask transforms:
DaskCategoryBlock
DaskCategoryBlock (vocab:collections.abc.MutableSequence|pandas.core.ser ies.Series=None, sort:bool=True, add_na:bool=False)
A Dask-compatible TransformBlock
for single-label categorical targets
Type | Default | Details | |
---|---|---|---|
vocab | MutableSequence | pd.Series | None | List of unique class names |
sort | bool | True | Sort the classes alphabetically |
add_na | bool | False | Add #na# to vocab |
DaskRegressionBlock
DaskRegressionBlock (n_out:int=None)
A Dask-compatible TransformBlock
for float targets
Type | Default | Details | |
---|---|---|---|
n_out | int | None | Number of output values |
DaskDataLoader
DaskDataLoader (dataset=None, bs=None, num_workers=0, pin_memory=False, timeout=0, batch_size=None, shuffle=False, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
Iterable dataloader for tabular learning with Dask
Integration example
For a more in-depth explanation, see the BigTabular tutorial
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test 'salary', axis=1, inplace=True)
df_test.drop(= dd.from_pandas(df_main), dd.from_pandas(df_test)
ddf_main, ddf_test ddf_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | <NA> | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | <NA> | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [DaskCategorify, DaskFillMissing, DaskNormalize] procs
= TabularDask(
to ="salary", train_mask_func=RandomTrainMask()
ddf_main, procs, cat_names, cont_names, y_names )
= to.dataloaders()
dls dls.valid.show_batch()
/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Masters | Divorced | Exec-managerial | Not-in-family | White | False | 44.0 | 236745.998760 | 14.0 | >=50k |
1 | Self-emp-inc | Prof-school | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | False | 38.0 | 112847.001163 | 15.0 | >=50k |
2 | Private | HS-grad | Never-married | Handlers-cleaners | Own-child | White | False | 20.0 | 63210.002966 | 9.0 | <50k |
3 | Private | Bachelors | Never-married | #na# | Own-child | Black | False | 23.0 | 529222.995495 | 13.0 | <50k |
4 | Private | Assoc-voc | Married-civ-spouse | Sales | Husband | White | True | 43.0 | 84660.997258 | 10.0 | <50k |
5 | Private | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 49.0 | 247294.000118 | 9.0 | >=50k |
6 | Private | 11th | Married-civ-spouse | #na# | Husband | White | False | 42.0 | 70055.004990 | 7.0 | <50k |
7 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 45.0 | 242390.999669 | 13.0 | >=50k |
8 | Private | Some-college | Never-married | Sales | Not-in-family | Black | True | 41.0 | 140589.998619 | 10.0 | <50k |
to.show()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 101320.0 | 12.0 | >=50k |
1 | Private | Masters | Divorced | Exec-managerial | Not-in-family | White | False | 44.0 | 236746.0 | 14.0 | >=50k |
2 | Private | HS-grad | Divorced | #na# | Unmarried | Black | True | 38.0 | 96185.0 | 10.0 | <50k |
3 | Self-emp-inc | Prof-school | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | False | 38.0 | 112847.0 | 15.0 | >=50k |
4 | Self-emp-not-inc | 7th-8th | Married-civ-spouse | Other-service | Wife | Black | True | 42.0 | 82297.0 | 10.0 | <50k |
5 | Private | HS-grad | Never-married | Handlers-cleaners | Own-child | White | False | 20.0 | 63210.0 | 9.0 | <50k |
6 | Private | Some-college | Divorced | #na# | Other-relative | White | False | 49.0 | 44434.0 | 10.0 | <50k |
7 | Private | 11th | Married-civ-spouse | #na# | Husband | White | False | 37.0 | 138940.0 | 7.0 | <50k |
8 | Private | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 46.0 | 328216.0 | 9.0 | >=50k |
9 | Self-emp-inc | HS-grad | Married-civ-spouse | #na# | Husband | White | True | 36.0 | 216711.0 | 10.0 | >=50k |
We can decode any set of transformed data by calling to.decode_row
with our raw data:
= to.items.head().iloc[0]
row to.decode_row(row)
age 49.0
workclass Private
fnlwgt 101319.997963
education Assoc-acdm
education-num 12.0
marital-status Married-civ-spouse
occupation #na#
relationship Wife
race White
sex Female
capital-gain 0
capital-loss 1902
hours-per-week 40
native-country United-States
salary NaN
_int_train_mask True
education-num_na False
Name: 0, dtype: object
We can make new test datasets based on the training data with the to.new()
Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training
= to.new(ddf_test)
to_tst
to_tst.process() to_tst.items.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | education-num_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.467001 | 5 | 1.320081 | 10 | 1.174229 | 3 | 2 | 1 | 2 | Male | 0 | 0 | 40 | Philippines | 1 |
1 | -0.923288 | 5 | 1.234092 | 12 | -0.424305 | 3 | 15 | 1 | 4 | Male | 0 | 0 | 40 | United-States | 1 |
2 | 1.052386 | 5 | 0.144505 | 2 | -1.223573 | 1 | 9 | 2 | 5 | Female | 0 | 0 | 37 | United-States | 1 |
3 | 0.540174 | 5 | -0.283457 | 12 | -0.424305 | 7 | 2 | 5 | 5 | Female | 0 | 0 | 43 | United-States | 1 |
4 | 0.759693 | 6 | 1.421398 | 9 | 0.374962 | 3 | 5 | 1 | 5 | Male | 0 | 0 | 60 | United-States | 1 |
We can then convert it to a DataLoader
:
= dls.valid.new(to_tst)
tst_dl tst_dl.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | Asian-Pac-Islander | False | 45.000000 | 338104.994720 | 13.0 |
1 | Private | HS-grad | Married-civ-spouse | Transport-moving | Husband | Other | False | 26.000000 | 328663.004174 | 9.0 |
2 | Private | 11th | Divorced | Other-service | Not-in-family | White | False | 53.000000 | 209021.999309 | 7.0 |
3 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | False | 46.000000 | 162030.000870 | 9.0 |
4 | Self-emp-inc | Assoc-voc | Married-civ-spouse | Exec-managerial | Husband | White | False | 49.000000 | 349230.001375 | 11.0 |
5 | Local-gov | Some-college | Married-civ-spouse | Exec-managerial | Husband | White | False | 34.000000 | 124826.998224 | 10.0 |
6 | Self-emp-inc | Some-college | Married-civ-spouse | Sales | Husband | White | False | 53.000000 | 290640.003056 | 10.0 |
7 | Private | Some-college | Never-married | Sales | Own-child | White | False | 19.000000 | 106273.000912 | 10.0 |
8 | Private | Some-college | Married-civ-spouse | Protective-serv | Husband | Black | False | 72.000001 | 53684.002983 | 10.0 |
Other target types
Multi-label categories
one-hot encoded label
def _mock_multi_label(df):
= [],[],[]
sal,sex,white for row in df.itertuples():
== '>=50k')
sal.append(row.salary == ' Male')
sex.append(row.sex == ' White')
white.append(row.race 'salary'] = np.array(sal)
df['male'] = np.array(sex)
df['white'] = np.array(white)
df[return df
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main)
df_main = dd.from_pandas(df_main), dd.from_pandas(df_test) ddf_main, ddf_test
ddf_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | <NA> | Wife | White | Female | 0 | 1902 | 40 | United-States | True | False | True |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | True | True | True |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | <NA> | Unmarried | Black | Female | 0 | 0 | 32 | United-States | False | False | False |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | True | True | False |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | False | False | False |
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [DaskCategorify, DaskFillMissing, DaskNormalize]
procs =["salary", "male", "white"] y_names
= TabularDask(
to =y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names),
ddf_main, procs, cat_names, cont_names, y_names=get_random_train_mask
train_mask_func )
CPU times: user 966 ms, sys: 0 ns, total: 966 ms
Wall time: 1 s
= to.dataloaders()
dls dls.valid.show_batch()
/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 1.013200e+05 | 12.0 | True | False | True |
1 | Private | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 46.0 | 3.282160e+05 | 9.0 | True | True | True |
2 | State-gov | Masters | Divorced | #na# | Not-in-family | White | False | 56.0 | 2.741110e+05 | 14.0 | False | True | True |
3 | Private | Some-college | Married-civ-spouse | #na# | Wife | Black | True | 40.0 | 1.889420e+05 | 10.0 | False | False | False |
4 | Private | HS-grad | Married-spouse-absent | #na# | Own-child | Black | True | 29.0 | 1.268339e+06 | 10.0 | False | True | False |
5 | Self-emp-not-inc | Some-college | Divorced | #na# | Unmarried | White | True | 47.0 | 2.137450e+05 | 10.0 | False | False | True |
6 | Private | 11th | Married-civ-spouse | #na# | Husband | White | False | 42.0 | 7.005500e+04 | 7.0 | False | True | True |
7 | Local-gov | HS-grad | Divorced | Adm-clerical | Unmarried | White | True | 44.0 | 1.501710e+05 | 10.0 | False | False | True |
8 | Private | Masters | Never-married | Exec-managerial | Not-in-family | White | True | 29.0 | 1.572620e+05 | 10.0 | False | False | True |
Regression
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main)
df_main = dd.from_pandas(df_main), dd.from_pandas(df_test) ddf_main, ddf_test
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['fnlwgt', 'education-num']
cont_names = [DaskCategorify, DaskFillMissing, DaskNormalize] procs
= TabularDask(ddf_main, procs, cat_names, cont_names, y_names='age', train_mask_func=get_random_train_mask) to
CPU times: user 860 ms, sys: 7.57 ms, total: 868 ms
Wall time: 871 ms
-1].means to.procs[
fnlwgt 192459.673567
education-num 10.071133
dtype: float64
= to.dataloaders()
dls dls.valid.show_batch()
/tmp/ipykernel_502869/2346102605.py:128: UserWarning: `shuffle` and `drop_last` are currently ignored.
warnings.warn('`shuffle` and `drop_last` are currently ignored.')
workclass | education | marital-status | occupation | relationship | race | education-num_na | fnlwgt | education-num | age | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Bachelors | Married-civ-spouse | #na# | Husband | White | True | 55291.003489 | 10.0 | 30.0 |
1 | Private | HS-grad | Never-married | Handlers-cleaners | Own-child | Black | False | 746431.990793 | 9.0 | 26.0 |
2 | Private | HS-grad | Never-married | Sales | Other-relative | White | False | 91524.997303 | 9.0 | 18.0 |
3 | Private | Masters | Never-married | Exec-managerial | Not-in-family | White | True | 157262.001416 | 10.0 | 29.0 |
4 | Private | HS-grad | Married-civ-spouse | #na# | Husband | Amer-Indian-Eskimo | True | 216811.000529 | 10.0 | 30.0 |
5 | Private | HS-grad | Never-married | Sales | Own-child | White | True | 156084.000249 | 10.0 | 36.0 |
6 | Self-emp-not-inc | Doctorate | Married-civ-spouse | #na# | Husband | White | False | 65278.004714 | 16.0 | 32.0 |
7 | Local-gov | HS-grad | Never-married | Adm-clerical | Own-child | White | False | 129232.000243 | 9.0 | 23.0 |
8 | Private | HS-grad | Married-civ-spouse | Transport-moving | Wife | White | False | 123397.002997 | 9.0 | 31.0 |