9. Supported Machine Learning

9.1. Supported Scikit-learn

Below is the list of scikit-learn classes and functions that Bodo supports natively inside JIT functions. This list will expand regularly as we add support for more APIs. Optional arguments are not supported unless specified.

9.1.1. Linear Classifiers

sklearn.linear_model.LogisticRegression

This class provides logistic regression classifier.

Methods:

sklearn.linear_model.SGDClassifier

This class provides linear classification models with SGD optimization which allows distributed large-scale learning.

SGDClassifier(loss='hinge') is equivalent to SVM linear classifer.

SGDClassifier(loss='log') is equivalent to logistic regression classifer.

  • Supported loss functions hinge and log.

  • early_stopping is not supported yet.

Methods:

sklearn.svm.LinearSVC

This class provides Linear Support Vector Classification.

Methods:

9.1.2. Linear Regressors

sklearn.linear_model.LinearRegression

This class provides linear regression support. Note: Multilabel targets are not currently supported.

Methods:

sklearn.linear_model.Ridge

This class provides ridge regression support.

Methods:

sklearn.linear_model.SGDRegressor

This class provides linear regression models with SGD optimization which allows distributed large-scale learning.

SGDRegressor(loss='squared_loss', penalty='None') is equivalent to linear regression.

SGDRegressor(loss='squared_loss', penalty='l2') is equivalent to Ridge regression.

SGDRegressor(loss='squared_loss', penalty='l1') is equivalent to Lasso regression.

  • Supported loss function is squared_loss

  • early_stopping is not supported yet.

Methods:

sklearn.linear_model.Lasso

This class provides Lasso regression support.

Methods:

9.1.3. Clustering

sklearn.cluster.KMeans

This class provides K-Means clustering models which allows distributed large-scale unsupervised learning.

Methods:

9.1.4. Ensemble Methods

sklearn.ensemble.RandomForestClassifier

This class provides Random Forest Classifier, an ensemble learning model, for distributed large-scale learning.

  • random_state value is ignored when running on multi-node.

Methods:

sklearn.ensemble.RandomForestRegressor

This class provides Random Forest Regressor, an ensemble learning model, for distributed large-scale learning.

  • random_state value is ignored when running on multi-node.

Methods:

9.1.5. Naive Bayes

sklearn.naive_bayes.MultinomialNB

This class provides Naive Bayes classifier for multinomial models with distributed large-scale learning.

Methods:

9.1.8. Data Preprocessing

sklearn.preprocessing.StandardScaler

This class provides Standard Scaler support to center your data and to scale it to achieve unit variance.

Methods:

sklearn.preprocessing.MinMaxScaler

This class provides MinMax Scaler support to scale your data based on the range of its features.

Methods:

sklearn.preprocessing.LabelEncoder

This class provides LabelEncoder support to encode target labels (y) with values between 0 and n-classes-1.

Methods:

9.1.9. Feature Extraction

sklearn.feature_extraction.text.HashingVectorizer

This class provides HashingVectorizer support to convert a collection of text documents to a matrix of token occurrences.

Methods:

9.1.10. Model Selection

  • sklearn.model_selection.train_test_split()

    • Currently it only supports two inputs of type numpy arrays and/or pandas dataframes.

    • Arguments train_size and test_size accept float between 0.0 and 1.0 or None only.

    • Arguments random_state and shuffle are supported.

    • Argument stratify is not supported yet.

9.2. Supported XGBoost

Below is the list of XGBoost (using the Scikit-Learn-like API) classes and functions that Bodo supports natively inside JIT functions. This list will expand regularly as we add support for more APIs.

9.2.1. XGBClassifier

xgboost.XGBClassifier

This class provides implementation of the scikit-learn API for XGBoost classification with distributed large-scale learning.

Methods:

9.2.2. XGBRegressor

xgboost.XGBRegressor

This class provides implementation of the scikit-learn API for XGBoost regression with distributed large-scale learning.

Methods: