Split an RNA dataset
In this example notebook, we will discuss how to use DataSAIL to split a dataset of RNAs. We downloaded the FASTA file from the nRC GitHub repository. As always, we first import all necessary tools.
[5]:
from datasail.sail import datasail
Run DataSAIL
As there is no dataset to be loaded, we directly use DataSAIL to split the data into an identity-based single cold split (I1e) and a cluster-based single cold split (C1e). We define - the techniques as list: I1e abd C1e The e in the end is important to split the e-data. - the spits as list. The values will be normalized to ratios. - the names as list. Similarly to the list of split sizes, DataSAIL needs names to name the splits. - the number of runs. This will determine how many different splits to compute per technique to compute. - the solving algorithm for optimizing the final problem formulation. - the type of the dataset in the first axis. - the data as mapping from IDs to SMILES strings.
For an extensive description of the arguments please refer to the according pages of the documentation.
Given there exist files storing the data as described in the documentation, the according call to DataSAIL in the commandline would be:
$ datasail -t I1e C1e -s 7 2 1 -n train val test -r 3 -i inter.tsv --solver SCIP --e-type G --e-data <path/dataset_Rfam_6320_13classes.fasta>
[6]:
%%capture
e_splits, f_splits, inter_splits = datasail(
techniques=["I1e", "C1e"],
splits=[7, 2, 1],
names=["train","val", "test"],
runs=3,
epsilon=0.2,
solver="SCIP",
e_type="G",
e_data="dataset_Rfam_6320_13classes.fasta",
)
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), May 15 2023, 22:49:31
Command: cd-hit-est -i ../cdhitest.fasta -o clusters -d 0 -T 1
-c 0.9 -n 10 -l 9
Started: Tue Mar 26 16:30:55 2024
================================================================
Output
----------------------------------------------------------------
total seq: 6320
longest and shortest : 1136 and 38
Total letters: 1024845
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 2M
Buffer : 1 X 17M = 17M
Table : 1 X 16M = 16M
Miscellaneous : 4M
Total : 41M
Table limit with the given memory limit:
Max number of representatives: 2873410
Max number of word counting entries: 94871404
comparing sequences from 0 to 6320
......
6320 finished 6319 clusters
Approximated maximum memory consumption: 48M
writing new database
writing clustering information
program completed !
Total CPU time 2.49
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), May 15 2023, 22:49:31
Command: cd-hit-est -i ../cdhitest.fasta -o clusters -d 0 -T 1
-c 0.8 -n 5 -l 4
Started: Tue Mar 26 16:30:58 2024
================================================================
Output
----------------------------------------------------------------
total seq: 6320
longest and shortest : 1136 and 38
Total letters: 1024845
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 2M
Buffer : 1 X 17M = 17M
Table : 1 X 0M = 0M
Miscellaneous : 0M
Total : 20M
Table limit with the given memory limit:
Max number of representatives: 2952729
Max number of word counting entries: 97490284
comparing sequences from 0 to 6320
......
6320 finished 6309 clusters
Approximated maximum memory consumption: 26M
writing new database
writing clustering information
program completed !
Total CPU time 88.15
2024-03-26 16:32:26,501 cdhit_est cannot optimally cluster the data. The minimal number of clusters is 6309.
(CVXPY) Mar 26 04:32:39 PM: Your problem has 18960 variables, 4 constraints, and 0 parameters.
(CVXPY) Mar 26 04:32:39 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Mar 26 04:32:39 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Mar 26 04:32:39 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Mar 26 04:32:39 PM: Your problem is compiled with the CPP canonicalization backend.
(CVXPY) Mar 26 04:32:39 PM: Compiling problem (target solver=SCIP).
(CVXPY) Mar 26 04:32:39 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Mar 26 04:32:39 PM: Applying reduction Dcp2Cone
(CVXPY) Mar 26 04:32:39 PM: Applying reduction CvxAttr2Constr
(CVXPY) Mar 26 04:32:39 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Mar 26 04:32:39 PM: Applying reduction SCIP
(CVXPY) Mar 26 04:32:39 PM: Finished problem compilation (took 2.108e-02 seconds).
(CVXPY) Mar 26 04:32:39 PM: Invoking solver SCIP to obtain a solution.
(CVXPY) Mar 26 04:32:43 PM: Problem status: optimal
(CVXPY) Mar 26 04:32:43 PM: Optimal value: 1.000e+00
(CVXPY) Mar 26 04:32:43 PM: Compilation took 2.108e-02 seconds
(CVXPY) Mar 26 04:32:43 PM: Solver (including time spent in interface) took 4.125e+00 seconds
(CVXPY) Mar 26 04:32:44 PM: Your problem has 1375 variables, 1229 constraints, and 0 parameters.
(CVXPY) Mar 26 04:32:44 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Mar 26 04:32:44 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Mar 26 04:32:44 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Mar 26 04:32:44 PM: Your problem is compiled with the CPP canonicalization backend.
(CVXPY) Mar 26 04:32:44 PM: Compiling problem (target solver=SCIP).
(CVXPY) Mar 26 04:32:44 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Mar 26 04:32:44 PM: Applying reduction Dcp2Cone
(CVXPY) Mar 26 04:32:44 PM: Applying reduction CvxAttr2Constr
(CVXPY) Mar 26 04:32:44 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Mar 26 04:32:45 PM: Applying reduction SCIP
(CVXPY) Mar 26 04:32:46 PM: Finished problem compilation (took 1.571e+00 seconds).
(CVXPY) Mar 26 04:32:46 PM: Invoking solver SCIP to obtain a solution.
(CVXPY) Mar 26 04:33:05 PM: Problem status: optimal
(CVXPY) Mar 26 04:33:05 PM: Optimal value: 1.430e+02
(CVXPY) Mar 26 04:33:05 PM: Compilation took 1.571e+00 seconds
(CVXPY) Mar 26 04:33:05 PM: Solver (including time spent in interface) took 1.926e+01 seconds
(CVXPY) Mar 26 04:33:05 PM: Your problem has 18960 variables, 4 constraints, and 0 parameters.
(CVXPY) Mar 26 04:33:05 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Mar 26 04:33:05 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Mar 26 04:33:05 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Mar 26 04:33:05 PM: Your problem is compiled with the CPP canonicalization backend.
(CVXPY) Mar 26 04:33:05 PM: Compiling problem (target solver=SCIP).
(CVXPY) Mar 26 04:33:05 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Mar 26 04:33:05 PM: Applying reduction Dcp2Cone
(CVXPY) Mar 26 04:33:05 PM: Applying reduction CvxAttr2Constr
(CVXPY) Mar 26 04:33:05 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Mar 26 04:33:05 PM: Applying reduction SCIP
(CVXPY) Mar 26 04:33:05 PM: Finished problem compilation (took 2.043e-02 seconds).
(CVXPY) Mar 26 04:33:05 PM: Invoking solver SCIP to obtain a solution.
(CVXPY) Mar 26 04:33:09 PM: Problem status: optimal
(CVXPY) Mar 26 04:33:09 PM: Optimal value: 1.000e+00
(CVXPY) Mar 26 04:33:09 PM: Compilation took 2.043e-02 seconds
(CVXPY) Mar 26 04:33:09 PM: Solver (including time spent in interface) took 4.122e+00 seconds
(CVXPY) Mar 26 04:33:09 PM: Your problem has 1375 variables, 1229 constraints, and 0 parameters.
(CVXPY) Mar 26 04:33:10 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Mar 26 04:33:10 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Mar 26 04:33:10 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Mar 26 04:33:10 PM: Your problem is compiled with the CPP canonicalization backend.
(CVXPY) Mar 26 04:33:10 PM: Compiling problem (target solver=SCIP).
(CVXPY) Mar 26 04:33:10 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Mar 26 04:33:10 PM: Applying reduction Dcp2Cone
(CVXPY) Mar 26 04:33:10 PM: Applying reduction CvxAttr2Constr
(CVXPY) Mar 26 04:33:10 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Mar 26 04:33:11 PM: Applying reduction SCIP
(CVXPY) Mar 26 04:33:11 PM: Finished problem compilation (took 1.505e+00 seconds).
(CVXPY) Mar 26 04:33:11 PM: Invoking solver SCIP to obtain a solution.
(CVXPY) Mar 26 04:33:30 PM: Problem status: optimal
(CVXPY) Mar 26 04:33:30 PM: Optimal value: 1.430e+02
(CVXPY) Mar 26 04:33:30 PM: Compilation took 1.505e+00 seconds
(CVXPY) Mar 26 04:33:30 PM: Solver (including time spent in interface) took 1.905e+01 seconds
(CVXPY) Mar 26 04:33:30 PM: Your problem has 18960 variables, 4 constraints, and 0 parameters.
(CVXPY) Mar 26 04:33:30 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Mar 26 04:33:30 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Mar 26 04:33:30 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Mar 26 04:33:30 PM: Your problem is compiled with the CPP canonicalization backend.
(CVXPY) Mar 26 04:33:30 PM: Compiling problem (target solver=SCIP).
(CVXPY) Mar 26 04:33:30 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Mar 26 04:33:30 PM: Applying reduction Dcp2Cone
(CVXPY) Mar 26 04:33:30 PM: Applying reduction CvxAttr2Constr
(CVXPY) Mar 26 04:33:30 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Mar 26 04:33:30 PM: Applying reduction SCIP
(CVXPY) Mar 26 04:33:30 PM: Finished problem compilation (took 6.218e-02 seconds).
(CVXPY) Mar 26 04:33:30 PM: Invoking solver SCIP to obtain a solution.
(CVXPY) Mar 26 04:33:34 PM: Problem status: optimal
(CVXPY) Mar 26 04:33:34 PM: Optimal value: 1.000e+00
(CVXPY) Mar 26 04:33:34 PM: Compilation took 6.218e-02 seconds
(CVXPY) Mar 26 04:33:34 PM: Solver (including time spent in interface) took 4.027e+00 seconds
(CVXPY) Mar 26 04:33:35 PM: Your problem has 1375 variables, 1229 constraints, and 0 parameters.
(CVXPY) Mar 26 04:33:35 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Mar 26 04:33:35 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Mar 26 04:33:35 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Mar 26 04:33:35 PM: Your problem is compiled with the CPP canonicalization backend.
(CVXPY) Mar 26 04:33:35 PM: Compiling problem (target solver=SCIP).
(CVXPY) Mar 26 04:33:35 PM: Reduction chain: Dcp2Cone -> CvxAttr2Constr -> ConeMatrixStuffing -> SCIP
(CVXPY) Mar 26 04:33:35 PM: Applying reduction Dcp2Cone
(CVXPY) Mar 26 04:33:35 PM: Applying reduction CvxAttr2Constr
(CVXPY) Mar 26 04:33:35 PM: Applying reduction ConeMatrixStuffing
(CVXPY) Mar 26 04:33:36 PM: Applying reduction SCIP
(CVXPY) Mar 26 04:33:37 PM: Finished problem compilation (took 1.536e+00 seconds).
(CVXPY) Mar 26 04:33:37 PM: Invoking solver SCIP to obtain a solution.
(CVXPY) Mar 26 04:33:56 PM: Problem status: optimal
(CVXPY) Mar 26 04:33:56 PM: Optimal value: 1.430e+02
(CVXPY) Mar 26 04:33:56 PM: Compilation took 1.536e+00 seconds
(CVXPY) Mar 26 04:33:56 PM: Solver (including time spent in interface) took 1.937e+01 seconds
The output
Finally, we inspect the e_split object as this holds all the assignments of the datapoints to the splits, for each run and each technique. First, the overall architecture is described, lastly we look at the first five assignments of the C1 run 0.
[7]:
print(type(e_splits))
for key in e_splits.keys():
print(f"{key} - Type: {type(e_splits[key])} - Length: {len(e_splits[key])}")
for run in range(len(e_splits[key])):
print(f"\tRun {run + 1} - Type: {type(e_splits[key][run])} - {len(e_splits[key][run])} assignments")
print("\n" + "\n".join(f"ID: {idx} - Split: {split}" for idx, split in list(e_splits[key][0].items())[:5]))
<class 'dict'>
I1e - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 6320 assignments
Run 2 - Type: <class 'dict'> - 6320 assignments
Run 3 - Type: <class 'dict'> - 6320 assignments
C1e - Type: <class 'list'> - Length: 3
Run 1 - Type: <class 'dict'> - 6320 assignments
Run 2 - Type: <class 'dict'> - 6320 assignments
Run 3 - Type: <class 'dict'> - 6320 assignments
ID: RF00001_AF095839_1_346-228_5S_rRNA - Split: train
ID: RF00001_AY245018_1_1-119_5S_rRNA - Split: train
ID: RF00001_X52048_1_2-120_5S_rRNA - Split: train
ID: RF00001_M28193_1_1-119_5S_rRNA - Split: train
ID: RF00001_X14816_1_860-978_5S_rRNA - Split: test