baseline_optimal.class_task

The module currently supports classification tasks given balanced dataset.

Usage

To prepare your data, make sure to:

Remove features that machine learning models can’t process, such as names and zip codes.
Split the training and the test data, and encode the target variable if necessary,

To start with, declare a ClassTask object:

from baseline_optimal import ClassTask

task = ClassTask()

Search through the parameter space and optimize for the accuracy score by default. Other supported metrics for classification tasks can be found through sklearn.

Set select=True when the dataset is large to enable feature selection prior to fitting models. This can speed up the optimization process by using a subset of the feature space, but it may lead to suboptimal results given relatively smaller dataset.

cv defines the number of folds within each iteration, and it can also take a CV splitter. The optimize function aims to discover the optimal machine learning pipeline, incorporating the best data processor and estimator with optimized hyperparameters. Subsequently, it fits this optimal pipeline to the training data.

task.optimize(X=X_train, y=y_train, metric='accuracy', select=False,
              study_name='optimization', cv=5, n_trials=100)

Access the best CV score.

print(task.best_score)

After fitting the data, one option is to use the ClassTask object for probability estimates and evaluation on the test set. You may customize the classification threshold.

pred_prob = task.predict(X=X_test)

scores = task.evaluate(X=X_test, y_true=y_test, threshold=0.5)

Or, you may obtain the optimal imblearn.pipeline.Pipeline object. Check here.

best_pipeline = task.best_pipeline

pred_prob = best_pipeline.predict_proba(X_test)

best_processor = best_pipeline.named_steps['processor']
best_estimator = best_pipeline.named_steps['estimator']

Obtain the imblearn.pipeline.Pipeline object will allow you to do much more. The imblearn.pipeline.Pipeline object is returned here to allow resampling for imbalanced data during optimization in future versions. It shares similar API with sklearn.pipeline.Pipeline.

Documentation

class baseline_optimal.class_task.ClassTask[source]

Bases: object

A ClassTask object manages the optimization process given classification tasks and provides methods to evaluate performance of the optimal machine learning pipeline on unforeseen data.

property best_pipeline: Pipeline

Get the optimal machine learning pipeline obtained from optimization.

Returns:: The optimal machine learning pipeline.
Return type:: imblearn.pipeline.Pipeline

property best_score: float

Get the best CV score achieved during optimization.

Returns:: The best CV score.
Return type:: float

evaluate(X: DataFrame, y_true: array, threshold: float = 0.5) → DataFrame[source]

Evaluate performance of the optimal pipeline on the test data using threshold-based and ranking-based metrics. The classification threshold is set to be 0.5 by default.

Parameters:

X (pd.DataFrame) – Test features.
y_true (np.array) – Test labels.
threshold (float) – Binary classification threshold.

Returns:

A pd.DataFrame object containing evaluation results.

Return type:

pd.DataFrame

optimize(X: DataFrame, y: ndarray, metric: str = 'accuracy', select: bool = False, study_name: str = 'optimization', cv: int | Iterable = 5, n_trials: int = 100) → None[source]

Optimize the machine learning pipeline by tuning its components and estimator hyperparameters through, by default, 100 trials based on the specified evaluation metric using Bayesian optimization with the Optuna library. Use 5-fold cross validation by default.

Parameters:

X (pd.DataFrame) – Training features.
y (np.ndarray) – Training labels.
metric (str) – Classification metric.
select (bool) – Whether to perform feature selection prior to fitting models. Set True to save computing resources.
study_name (str) – Name of the optimization study.
cv (Union[int, Iterable]) – Cross-validation strategy.
n_trials (int) – Number of optimization trials.

plot_confusion_matrix(X: DataFrame, y_true: array, threshold: float = 0.5) → None[source]

Plot the confusion matrix.

Parameters:

X (pd.DataFrame) – Features.
y_true (np.array) – Labels.
threshold (float) – Binary classification threshold.

plot_feature_directionality(X: DataFrame) → None[source]

Plot how features influence the predictions based on SHAP values.

Parameters:: X (pd.DataFrame) – Features.

plot_feature_importances(X: DataFrame) → None[source]

Plot the feature importances based on SHAP values.

Parameters:: X (pd.DataFrame) – Features.

plot_optimization_history() → None[source]: Plot the optimization history showing the performance over trials.

plot_param_importances() → None[source]: Plot the parameter importances during optimization.

plot_pr_curve(X: DataFrame, y_true: array) → None[source]

Plot the Precision-Recall curve.

Parameters:

X (pd.DataFrame) – Features.
y_true (np.array) – Labels.

plot_roc_curve(X: DataFrame, y_true: array) → None[source]

Plot the Receiver Operating Characteristic (ROC) curve.

Parameters:

X (pd.DataFrame) – Features.
y_true (np.array) – Labels.

predict(X: DataFrame) → ndarray[source]

Probability estimates of the new data using the optimal pipeline.

Parameters:: X (pd.DataFrame) – Test features.
Returns:: Probability estimates.
Return type:: np.ndarray