Skip to content

pd.read_parquet

pandas.read_parquet

  • example usage and more system specific instructions
  • Arguments path and columns are supported. columns should be a constant list of strings if provided. path can be a string or list. If string, must be a path to a file or a directory, or a glob string. If a list, must contain paths to parquet files (not directories) or glob strings.
  • Argument anon of storage_options is supported for S3 filepaths.
  • If path can be inferred as a constant (e.g. it is a function argument), Bodo finds the schema from file at compilation time. Otherwise, schema should be provided using the numba syntax.

    For example:

    @bodo.jit(locals={'df':{'A': bodo.float64[:],
                            'B': bodo.string_array_type}})
    def impl(f):
      df = pd.read_parquet(f)
      return df
    

  • _bodo_input_file_name_col is a Bodo specific argument. When specified, a column with this name is added to the dataframe consisting of the name of the file the row was read from. This is similar to SparkSQL's input_file_name function.

    For example:

    @bodo.jit
    def impl(f):
      df = pd.read_parquet(f, _bodo_input_file_name_col="fname")
      return df
    

  • _bodo_read_as_dict is a Bodo specific argument which forces the specified string columns to be read with dictionary-encoding. Bodo automatically loads string columns using dictionary encoding when it determines it would be beneficial based on a heuristic. Dictionary-encoding stores data in memory in an efficient manner and is most effective when the column has many repeated values. Read more about dictionary-encoded layout here.

    For example:

    @bodo.jit
    def impl(f):
      df = pd.read_parquet(f, _bodo_read_as_dict=["A", "B", "C"])
      return df