pd.DataFrame.to_parquet
¶
pandas.DataFrame.to_parquet(path, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None)
Supported Arguments¶
path
is a required argument and must be a string. When writing distributed dataframes, the path refers to a directory of parquet files.engine
argument only supports"auto"
and"pyarrow"
. Default:"auto"
which uses the pyarrow engine.compression
argument must be one of:"snappy"
,"gzip"
,"brotli"
,None
. Default:"snappy"
.index
argument must be a constant bool orNone
. Default:None
.partition_cols
argument is supported in most cases, except when the columns in the DataFrame cannot be determined at compile time. This must be a list of column names orNone
. Default:None
.storage_options
argument supports only the default valueNone
.row_group_size
argument can be used to specify the maximum size of the row-groups in the generated parquet files; the actual size of the written row-groups may be smaller then this value. This must be an integer. If not specified, Bodo writes row-groups with 1M rows.
Note
Bodo writes multiple files in parallel (one per core), and the total number of row-groups across all files is roughly max(num_cores, total_rows / row_group_size)
.
The size of the row groups can affect read performance significantly. In general, the dataset should have at least as many row-groups as the number of cores used for reading, but ideally a lot more.
At the same time, the row-groups shouldn't be too small since this can lead to overheads at read time.
For more details, refer to the parquet file format.