Input/Output¶
See more in File IO, such as S3 and HDFS configuration requirements.
pd.read_csv¶
pandas. read_csv - example usage and more system specific instructions
filepath_or_buffershould be a string and is required. It could be pointing to a single CSV file, or a directory containing multiple partitioned CSV files (must havecsvfile extension inside directory).- Arguments
sep,delimiter,header,names,index_col,usecols,dtype,nrows,skiprows,chunksize,parse_dates, andlow_memoryare supported. - Either
namesanddtypearguments should be provided to enable type inference, orfilepath_or_buffershould be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants names,usecols,parse_datesshould be constant lists.dtypeshould be a constant dictionary of strings and types.skiprowsmust be an integer or list of integers and if it is not a constant,namesmust be provided to enable type inference.chunksizeis supported for uncompressed files only.low_memoryinternally process file in chunks while parsing. In Bodo this is set toFalseby default.- When set to
True, Bodo parses file in chunks but like Pandas the entire file is read into a single DataFrame regardless. - If you want to load data in chunks, use the
chunksizeargument. - When a CSV file is read in parallel (distributed mode) and each process reads only a portion of the file, reading columns that contain line breaks is not supported.
pd.read_excel¶
pandas. read_excel - output dataframe cannot be parallelized automatically yet.
- only arguments
io,sheet_name,header,names,comment,dtype,skiprows,parse_datesare supported. ioshould be a string and is required.- Either
namesanddtypearguments should be provided to enable type inference, orioshould be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants sheet_name,header,comment, andskiprowsshould be constant if provided.namesandparse_datesshould be constant lists if provided.dtypeshould be a constant dictionary of strings and types if provided.
pd.read_sql¶
pandas. read_sql - example usage and more system specific instructions
- Argument
sqlis supported but only as a string form. SQLalchemySelectableis not supported. There is no restriction on the form of the sql request. - Argument
conis supported but only as a string form. SQLalchemyconnectableis not supported. - Argument
index_colis supported. - Arguments
chunksize,column,coerce_float,paramsare not supported.
pd.read_parquet¶
pandas. read_parquet - example usage and more system specific instructions
- Arguments
pathandcolumnsare supported.columnsshould be a constant list of strings if provided. - Argument
anonofstorage_optionsis supported for S3 filepaths. -
If
pathcan be inferred as a constant (e.g. it is a function argument), Bodo finds the schema from file at compilation time. Otherwise, schema should be provided using the numba syntax.For example:
pd.read_json¶
pandas. read_json - Example usage and more system specific instructions
- Only supports reading JSON Lines text file format
(
pd.read_json(filepath_or_buffer, orient='records', lines=True)) and regular multi-line JSON file(pd.read_json(filepath_or_buffer, orient='records', lines=False)). - Argument
filepath_or_bufferis supported: it can point to a single JSON file, or a directory containing multiple partitioned JSON files. When reading a directory, the JSON files inside the directory must be JSON Lines text file format withjsonfile extension. - Argument
orient = 'records'is used as default, instead of Pandas' default'columns'for dataframes.'records'is the only supported value fororient. - Argument
typis supported.'frame'is the only supported value fortyp. filepath_or_buffermust be inferrable as a constant string. This is required so bodo can infer the types at compile time, see compile time constants.- Arguments
convert_dates,precise_float,linesare supported.