Split NASA Asteroids with DataSAIL

In this notebook, we will investigate the transferability of DataSAIL to non-biomedical datasets. This notebook will be very similar to the notebook where we show how to split the BACE dataset by weights. Here, we will use the NASA Asteroids dataset, which contains information about asteroids and whether they are hazardous or not. We will use DataSAIL to split the dataset into training and testing sets and compare the performance of a Random Forest classifier on the two splits.

As always, we first import the necessary libraries. The dataset is a CSV file in the same folder which can be downloaded from kaggle: https://www.kaggle.com/datasets/lovishbansal123/nasa-asteroids-classification.

[1]:
%%capture
import pandas as pd
import numpy as np
from datasail.sail import datasail
from sklearn.ensemble import RandomForestClassifier
/home/roman/miniconda3/envs/sail38/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

Load the NASA Asteroids dataset

Next, we load the dataset and clean it as usually. The dataset contains the many columns, but for demonstration purposes, we will focus on the two main diameters and the missing distance to earth in lunar metric. We will also rename the columns to make them easier to work with.

[2]:
df = pd.read_csv("nasa.csv")[['Est Dia in M(min)', 'Est Dia in M(max)', 'Miss Dist.(lunar)', 'Hazardous']].rename(columns={'Est Dia in M(min)': 'dia_min', 'Est Dia in M(max)': 'dia_max', 'Miss Dist.(lunar)': "dist"})
print(df.shape)
df.head()
(4687, 4)
[2]:
dia_min dia_max dist Hazardous
0 127.219879 284.472297 163.178711 True
1 146.067964 326.617897 148.992630 False
2 231.502122 517.654482 19.821890 True
3 8.801465 19.680675 110.990387 False
4 127.219879 284.472297 158.646713 True

Define the distance metric

We will define the distance between two asteroids in the dataset as the difference of their volume and the difference of their distance to earth. This is computed in ellipsoid_distance. A subfunction of this is ellipsoid_volume which computes the volume of an ellipsoid given its three axes. The formula for the volume of an ellipsoid is \(V = \frac{4}{3} \pi a b c\) where \(a\), \(b\), and \(c\) are the three axes of the ellipsoid. If the third axis is not provided, we will assume it to be the average of the first two axes.

[3]:
def ellipsoid_volume(a: float, b: float, c=None):
    if c is None:
        c = (a + b) / 2
    return 4/3 * np.pi * a * b * c


def ellipsoid_distance(a, b):
    return abs(ellipsoid_volume(*a[:2]) - ellipsoid_volume(*b[:2])) + abs(a[2] - b[2])

Compute the distance matrix

Now, similar to the BACE notebook, we compute the pairwise distances for the samples and normalize them to be between 0 and 1. We will use this distance matrix as the input to DataSAIL.

[4]:
data = df.values
dist = np.zeros((df.shape[0], df.shape[0]))
for i in range(len(data)):
    for j in range(i + 1, len(data)):
        dist[i, j] = ellipsoid_distance(data[i, :-1], data[j, :-1])
        dist[j, i] = dist[i, j]
dist /= np.max(dist)
dist[:4, :4]
[4]:
array([[0.00000000e+00, 2.79635710e-07, 2.73645278e-06, 5.44323262e-07],
       [2.79635710e-07, 0.00000000e+00, 2.45681707e-06, 8.23958476e-07],
       [2.73645278e-06, 2.45681707e-06, 0.00000000e+00, 3.28077422e-06],
       [5.44323262e-07, 8.23958476e-07, 3.28077422e-06, 0.00000000e+00]])

Split the dataset

We will now split the dataset using DataSAIL. We will use the I1e technique to split the dataset by the diameters and the C1e technique to split the dataset by the distance to earth.

Given there exist files storing the data and distance as described in the documentation, the according call to DataSAIL in the commandline would be:

$ datasail -t I1e C1e -s 8 1 -n train test -r 1 --solver SCIP --e-type O --e-data <filepath> --e-dist <filepath>
[5]:
%%capture
e_splits, f_splits, inter_splits = datasail(
    techniques=["I1e", "C1e"],
    splits=[8, 2],
    names=["train", "test"],
    runs=1,
    epsilon=0.1,
    solver="SCIP",
    e_type="O",
    e_data={i: (row['dia_min'], row['dia_max']) for i, (_, row) in enumerate(df.iterrows())},
    e_dist=(list(range(len(df))), dist),
)
(CVXPY) Apr 03 04:18:30 PM: Your problem has 538 variables, 3 constraints, and 0 parameters.
(CVXPY) Apr 03 04:18:30 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Apr 03 04:18:30 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Apr 03 04:18:30 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Apr 03 04:18:30 PM: Compiling problem (target solver=SCIP).
(CVXPY) Apr 03 04:18:30 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Apr 03 04:18:30 PM: Applying reduction Dcp2Cone
(CVXPY) Apr 03 04:18:30 PM: Applying reduction CvxAttr2Constr
(CVXPY) Apr 03 04:18:30 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Apr 03 04:18:30 PM: Applying reduction SCIP
(CVXPY) Apr 03 04:18:30 PM: Finished problem compilation (took 1.264e-01 seconds).
(CVXPY) Apr 03 04:18:30 PM: Invoking solver SCIP  to obtain a solution.
(CVXPY) Apr 03 04:18:30 PM: Problem status: optimal
(CVXPY) Apr 03 04:18:30 PM: Optimal value: 1.000e+00
(CVXPY) Apr 03 04:18:30 PM: Compilation took 1.264e-01 seconds
(CVXPY) Apr 03 04:18:30 PM: Solver (including time spent in interface) took 2.751e-01 seconds
(CVXPY) Apr 03 04:18:31 PM: Your problem has 1325 variables, 1228 constraints, and 0 parameters.
(CVXPY) Apr 03 04:18:32 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Apr 03 04:18:32 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Apr 03 04:18:32 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Apr 03 04:18:33 PM: Compiling problem (target solver=SCIP).
(CVXPY) Apr 03 04:18:33 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Apr 03 04:18:33 PM: Applying reduction Dcp2Cone
(CVXPY) Apr 03 04:18:33 PM: Applying reduction CvxAttr2Constr
(CVXPY) Apr 03 04:18:33 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Apr 03 04:18:35 PM: Applying reduction SCIP
(CVXPY) Apr 03 04:18:36 PM: Finished problem compilation (took 3.702e+00 seconds).
(CVXPY) Apr 03 04:18:36 PM: Invoking solver SCIP  to obtain a solution.
(CVXPY) Apr 03 04:18:37 PM: Problem status: optimal
(CVXPY) Apr 03 04:18:37 PM: Optimal value: -5.266e+01
(CVXPY) Apr 03 04:18:37 PM: Compilation took 3.702e+00 seconds
(CVXPY) Apr 03 04:18:37 PM: Solver (including time spent in interface) took 4.942e-01 seconds

Investigate the splits

Finally, we inspect the e_split object as this holds all the assignments of the datapoints to the splits, for each run and each technique. First, the overall architecture is described, lastly we look at the first five assignments of the C1 run 0.

[6]:
print(type(e_splits))
for key in e_splits.keys():
    print(f"{key} - Type: {type(e_splits[key])} - Length: {len(e_splits[key])}")
    for run in range(len(e_splits[key])):
        print(f"\tRun {run + 1} - Type: {type(e_splits[key][run])} - {len(e_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(e_splits[key][0].items())[:5]))
<class 'dict'>
I1e - Type: <class 'list'> - Length: 1
        Run 1 - Type: <class 'dict'> - 4687 assignments
C1e - Type: <class 'list'> - Length: 1
        Run 1 - Type: <class 'dict'> - 4687 assignments

ID: 0 - Split: train
ID: 4 - Split: train
ID: 59 - Split: train
ID: 151 - Split: train
ID: 393 - Split: train

Train and test a Random Forest classifier

Finally, we train and test a Random Forest classifier on the two splits. We will first extract the indices of the training and testing samples for the two splits and then train and test the classifier on the two splits.

[7]:
rand_train = [i for i in range(len(df)) if e_splits['I1e'][0][i] == "train"]
rand_test = [i for i in range(len(df)) if e_splits['I1e'][0][i] == "test"]
sail_train = [i for i in range(len(df)) if e_splits['C1e'][0][i] == "train"]
sail_test = [i for i in range(len(df)) if e_splits['C1e'][0][i] == "test"]
[8]:
X_rand_train = df.iloc[rand_train].drop(columns='Hazardous')
y_rand_train = df.iloc[rand_train]['Hazardous']
X_rand_test = df.iloc[rand_test].drop(columns='Hazardous')
y_rand_test = df.iloc[rand_test]['Hazardous']

rand_clf = RandomForestClassifier()
rand_clf.fit(X_rand_train, y_rand_train)
rand_score = rand_clf.score(X_rand_test, y_rand_test)
rand_score
[8]:
0.7608530083777608

As we can see, we reach an accuracy of 0.76 on a random split. Let’s see how well we can do on the DataSAIL split.

[9]:
X_sail_train = df.iloc[sail_train].drop(columns='Hazardous')
y_sail_train = df.iloc[sail_train]['Hazardous']
X_sail_test = df.iloc[sail_test].drop(columns='Hazardous')
y_sail_test = df.iloc[sail_test]['Hazardous']

sail_clf = RandomForestClassifier()
sail_clf.fit(X_sail_train, y_sail_train)
sail_score = sail_clf.score(X_sail_test, y_sail_test)
sail_score
[9]:
0.6046242774566474