Dataframe wrangling#

Wrangling dataframes is a common need. For us this mostly involves transforming manifest files - partitioning, merging, creating train-val-test splits, filtering rows and/or columns, …

We implemented a CLI tool for this. It exposes to the CLI all the transforms defined in the dataframe transforms module, but importantly it also allows the user to chain them in a single CLI call.

Calling a single transform#

$ serotiny dataframe \
      filter_columns my_input.csv --columns='["cellid","crop_seg","crop_raw"]' \
      --output_path my_output.csv

Chaining multiple transforms#

$ serotiny dataframe \
      filter_rows my_input.csv cellid '[1,2,3,4]' --exclude - \
      filter_columns ... --columns='["cellid","crop_seg","crop_raw"]' - \
      split_dataframe ... --train_frac=0.7 --return_splits=false - \
      --output_path my_output.csv

Note the - at the end of the steps. That signifies we’re done providing arguments to the step that precedes it. Note also the ... as the first argument to the second and third steps. That is a placeholder, letting the CLI know which of the input arguments to be provided with the result of the previous step

Testing dataframe wrangling pipelines#

To test and experiment with dataframe wrangling pipelines, we provide a helper transform which generates a random dataframe, which you can use as a starting point:

$ serotiny dataframe \
      make_random_df --columns='["a", "b", "c"]' --n_rows=50 -\
      # you can chain the steps you want to test here!
      --output_path my_output.csv