pd.DataFrame.to_parquet¶
pandas.DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None)
Supported Arguments¶
pathis a required argument and must be a string. When writing distributed dataframes, the path refers to a directory of parquet files.engineargument only supports"auto"and"pyarrow". Default:"auto"which uses the pyarrow engine.compressionargument must be one of:"snappy","gzip","brotli",None. Default:"snappy".indexargument must be a constant bool orNone. Default:None.partition_colsargument is supported in most cases, except when the columns in the DataFrame cannot be determined at compile time. This must be a list of column names orNone. Default:None.storage_optionsargument supports only the default valueNone.row_group_sizeargument can be used to specify the maximum size of the row-groups in the generated parquet files; the actual size of the written row-groups may be smaller then this value. This must be an integer. If not specified, Bodo writes row-groups with 1M rows.
Note
Bodo writes multiple files in parallel (one per core), and the total number of row-groups across all files is roughly max(num_cores, total_rows / row_group_size).
The size of the row groups can affect read performance significantly. In general, the dataset should have at least as many row-groups as the number of cores used for reading, but ideally a lot more.
At the same time, the row-groups shouldn't be too small since this can lead to overheads at read time.
For more details, refer to the parquet file format.