Input/Output¶
See more in File IO, such as S3 and HDFS configuration requirements.
pd.read_csv¶
pandas. read_csv - example usage and more system specific instructions
filepath_or_buffershould be a string and is required. It could be pointing to a single CSV file, or a directory containing multiple partitioned CSV files (must havecsvfile extension inside directory).- Arguments
sep,delimiter,header,names,index_col,usecols,dtype,nrows,skiprows,chunksize,parse_dates, andlow_memoryare supported. - Argument
anonofstorage_optionsis supported for S3 filepaths. - Either
namesanddtypearguments should be provided to enable type inference, orfilepath_or_buffershould be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants names,usecols,parse_datesshould be constant lists.dtypeshould be a constant dictionary of strings and types.skiprowsmust be an integer or list of integers and if it is not a constant,namesmust be provided to enable type inference.chunksizeis supported for uncompressed files only.low_memoryinternally process file in chunks while parsing. In Bodo this is set toFalseby default.- When set to
True, Bodo parses file in chunks but like Pandas the entire file is read into a single DataFrame regardless. - If you want to load data in chunks, use the
chunksizeargument. - When a CSV file is read in parallel (distributed mode) and each process reads only a portion of the file, reading columns that contain line breaks is not supported.
-
_bodo_read_as_dictis a Bodo specific argument which forces the specified string columns to be read with dictionary-encoding. Dictionary-encoding stores data in memory in an efficient manner and is most effective when the column has many repeated values. Read more about dictionary-encoded layout here.For example:
pd.read_excel¶
pandas. read_excel - output dataframe cannot be parallelized automatically yet.
- only arguments
io,sheet_name,header,names,comment,dtype,skiprows,parse_datesare supported. ioshould be a string and is required.- Either
namesanddtypearguments should be provided to enable type inference, orioshould be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants sheet_name,header,comment, andskiprowsshould be constant if provided.namesandparse_datesshould be constant lists if provided.dtypeshould be a constant dictionary of strings and types if provided.
pd.read_sql¶
pandas. read_sql - example usage and more system specific instructions
- Argument
sqlis supported but only as a string form. SQLalchemySelectableis not supported. There is no restriction on the form of the sql request. - Argument
conis supported but only as a string form. SQLalchemyconnectableis not supported. - Argument
index_colis supported. - Arguments
chunksize,column,coerce_float,paramsare not supported.
pd.read_sql_table¶
pandas. read_sql_table - This API only supports reading Iceberg tables at the moment.
- See Iceberg Section for example usage and more system specific instructions.
- Argument
table_nameis supported and must be the name of an Iceberg Table. - Argument
conis supported but only as a string form in a URL format. SQLalchemyconnectableis not supported. It should be the absolute path to a Iceberg warehouse. If using a Hadoop-based directory catalog, it should start with the URL schemeiceberg://. If using a Thrift Hive catalog, it should start with the URL schemeiceberg+thrift:// - Argument
schemais supported and currently required for Iceberg tables. It must be the name of the database schema. For Iceberg Tables, this is the directory name in the warehouse (specified bycon) where your table exists. - Arguments
index_col,coerce_float,parse_dates,columnsandchunksizeare not supported.
pd.read_parquet¶
-
pandas. read_parquet - example usage and more system specific instructions
- Arguments
pathandcolumnsare supported.columnsshould be a constant list of strings if provided.pathcan be a string or list. If string, must be a path to a file or a directory, or a glob string. If a list, must contain paths to parquet files (not directories) or glob strings. - Argument
anonofstorage_optionsis supported for S3 filepaths. -
If
pathcan be inferred as a constant (e.g. it is a function argument), Bodo finds the schema from file at compilation time. Otherwise, schema should be provided using the numba syntax.For example:
-
_bodo_input_file_name_colis a Bodo specific argument. When specified, a column with this name is added to the dataframe consisting of the name of the file the row was read from. This is similar to SparkSQL'sinput_file_namefunction.For example:
-
_bodo_read_as_dictis a Bodo specific argument which forces the specified string columns to be read with dictionary-encoding. Bodo automatically loads string columns using dictionary encoding when it determines it would be beneficial based on a heuristic. Dictionary-encoding stores data in memory in an efficient manner and is most effective when the column has many repeated values. Read more about dictionary-encoded layout here.For example:
pd.read_json¶
pandas. read_json - Example usage and more system specific instructions
- Only supports reading JSON Lines text file format
(
pd.read_json(filepath_or_buffer, orient='records', lines=True)) and regular multi-line JSON file(pd.read_json(filepath_or_buffer, orient='records', lines=False)). - Argument
filepath_or_bufferis supported: it can point to a single JSON file, or a directory containing multiple partitioned JSON files. When reading a directory, the JSON files inside the directory must be JSON Lines text file format withjsonfile extension. - Argument
orient = 'records'is used as default, instead of Pandas' default'columns'for dataframes.'records'is the only supported value fororient. - Argument
typis supported.'frame'is the only supported value fortyp. filepath_or_buffermust be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants.- Arguments
convert_dates,precise_float,linesare supported. - Argument
anonofstorage_optionsis supported for S3 filepaths.