Skip to content

Input/Output

See more in File IO, such as S3 and HDFS configuration requirements.

pd.read_csv

  • pandas.read_csv

    • example usage and more system specific instructions
    • filepath_or_buffer should be a string and is required. It could be pointing to a single CSV file, or a directory containing multiple partitioned CSV files (must have csv file extension inside directory).
    • Arguments sep, delimiter, header, names, index_col, usecols, dtype, nrows, skiprows, chunksize, parse_dates, and low_memory are supported.
    • Either names and dtype arguments should be provided to enable type inference, or filepath_or_buffer should be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants
    • names, usecols, parse_dates should be constant lists.
    • dtype should be a constant dictionary of strings and types.
    • skiprows must be an integer or list of integers and if it is not a constant, names must be provided to enable type inference.
    • chunksize is supported for uncompressed files only.
    • low_memory internally process file in chunks while parsing. In Bodo this is set to False by default.
    • When set to True, Bodo parses file in chunks but like Pandas the entire file is read into a single DataFrame regardless.
    • If you want to load data in chunks, use the chunksize argument.
    • When a CSV file is read in parallel (distributed mode) and each process reads only a portion of the file, reading columns that contain line breaks is not supported.

pd.read_excel

  • pandas.read_excel

    • output dataframe cannot be parallelized automatically yet.
    • only arguments io, sheet_name, header, names, comment, dtype, skiprows, parse_dates are supported.
    • io should be a string and is required.
    • Either names and dtype arguments should be provided to enable type inference, or io should be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants
    • sheet_name, header, comment, and skiprows should be constant if provided.
    • names and parse_dates should be constant lists if provided.
    • dtype should be a constant dictionary of strings and types if provided.

pd.read_sql

  • pandas.read_sql

    • example usage and more system specific instructions
    • Argument sql is supported but only as a string form. SQLalchemy Selectable is not supported. There is no restriction on the form of the sql request.
    • Argument con is supported but only as a string form. SQLalchemy connectable is not supported.
    • Argument index_col is supported.
    • Arguments chunksize, column, coerce_float, params are not supported.

pd.read_parquet

  • pandas.read_parquet

    • example usage and more system specific instructions
    • Arguments path and columns are supported. columns should be a constant list of strings if provided.
    • Argument anon of storage_options is supported for S3 filepaths.
    • If path can be inferred as a constant (e.g. it is a function argument), Bodo finds the schema from file at compilation time. Otherwise, schema should be provided using the numba syntax.

      For example:

      @bodo.jit(locals={'df':{'A': bodo.float64[:],
                              'B': bodo.string_array_type}})
      def impl(f):
        df = pd.read_parquet(f)
        return df
      

pd.read_json

  • pandas.read_json

    • Example usage and more system specific instructions
    • Only supports reading JSON Lines text file format (pd.read_json(filepath_or_buffer, orient='records', lines=True)) and regular multi-line JSON file(pd.read_json(filepath_or_buffer, orient='records', lines=False)).
    • Argument filepath_or_buffer is supported: it can point to a single JSON file, or a directory containing multiple partitioned JSON files. When reading a directory, the JSON files inside the directory must be JSON Lines text file format with json file extension.
    • Argument orient = 'records' is used as default, instead of Pandas' default 'columns' for dataframes. 'records' is the only supported value for orient.
    • Argument typ is supported. 'frame' is the only supported value for typ.
    • filepath_or_buffer must be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants.
    • Arguments convert_dates, precise_float, lines are supported.
Back to top