Bodo 2021.11 Release (Date: 11/30/2021)¶
This release includes many new features, optimizations, bug fixes and usability improvements. Overall, 107 code patches were merged since the last release.
New Features and Improvements¶
-
Support for "wide" DataFrames with large number of columns:
-
Bodo compiler is transitioning to a new internal dataframe compilation format that substantially decreases compliation time for dataframes with thousands of columns.
All DataFrame APIs will transition to this new format over time.
-
read_csv,read_parquet,bodo.gathervand dataframe filtering are upgraded to support this new format in this release. - Connectors:
- Significantly improved performance when reading Parquet from S3 (up to 10x faster read depending on the dataset).
- General support for predicate pushdown when reading from Parquet (filtering rows at the storage level).
- Improvements to BodoSQL's filter pushdown, such as higher compiler accuracy in detecting possible filters.
- Faster
read_parquetcompilation time by validating the schema only at runtime. - Faster
pd.read_csv()execution time with large numbers of columns. - Bodo automatically maintains type information when passing DataFrames and Series between Bodo and regular Python. This avoids potential typing issues when parallel data chunks do not have enough non-null data for automatic type inference.
- Improved error messages and documentation.
- Pandas:
- Support for Array of dictionary outputs of
DataFrame.apply()andSeries.apply() - Support for Array of dictionary inputs to
pd.concat() - Support for
Series.astype(str)with Categorical type for non-string categories. - Support for callable arguments to
DataFrame.assign() - Support for passing a list as
skiprowsofpd.read_csv() - Support for
low_memoryargument inpd.read_csv() - Support for using a string label for indexing Series with string Index (for non-parallel Series)
- Support for initializing a Series with a constant dictionary
- Support for
subsetargument toDataFrame.drop_duplicates - Support for
DataFrame.plotwith argumentsx,y,kind,figsize,xlabel,ylabel,title,legend,fontsize,xticks,yticks, andax.DataFrame.plotbehaves the same as Bodo's Matplotlib support. - Support for
DataFrame.groupby.head - Numpy:
- Support for
np.select - ML:
- Support
predict_probaandpredict_log_probaforRandomForestClassifier,SGD ClassifierandLogisticRegression - Support
predict_probafor XGBoostClassifier - Support for
sklearn.metrics.confusion_matrix
-
BodoSQL 2021.11beta Release (Date: 11/30/2021)
This release includes SQL bug fixes and support for Bodo's filter pushdown from BodoSQL. Most of the improvements to BodoSQL are integrating enhancements made to Bodo. Overall, 10 code patches were merged since the last release.
New Features and Improvements¶
-
Support for a new filepath API
bodosql.TablePath. This API takes the path and file type and uses this to load/remove the data within the query.For example:
bc = bodosql.BodoSQLContext("table1": bodosql.TablePath("myfile.pq", "parquet")) return bc.query("Select A from table1")This is functionally equivalent to using the Pandas
read_functions inside a Bodo function, but it may have some additional performance optimizations.Currently only Parquet files are supported.
-
Support for Bodo's filter pushdown when using the
bodosql.TablePathAPI. -
Reduced compliation and execution time when using the
FIRST_VALUEfunction repeatedly on the same exact window. -
SQL Coverage
This release added the following additional SQL coverage to BodoSQL. Please refer to our documentation for more details regarding usage.
- Support for omitting the second argument from the
ROUNDfunction (defaults to 0). - Support for providing an integer as the second argument
DATE_ADDandDATE_SUB. If you pass an integer, it is assigneddaysas its unit.
- Support for omitting the second argument from the