Frequently Asked Questions
Many questions are already answered in the Workflow section of this documentation. Examples of how to use DataSAIL as a package or commandline tool are given in the Example section. Here, we collect and answer some frequently asked questions that are not covered in the other sections and arose on conference discussions, GitHub issues, or other occasions. If you don’t find help here, check the GitHub Issue Tracker. and consider opening a new issue if your question is not covered.
Theoretical and Conceptional Questions
Does training on DataSAIL splits produce better generalizing models?
Yes, training on DataSAIL splits generally leads to better generalizing models. The DataSAIL splits are designed to reduce information leakage between splits. Therefore, when used for hyperparameter tuning, they help in selecting models (and their hyperparameter) that generalize better to unseen data.
What are the limitations of DataSAIL?
The most time and memory consuming step in DataSAIL is the clustering of the data. For most datatypes, this is done by third-party programms such as FoldSeek, DIAMOND, or MASH. In that case, DataSAIL has no influence on the runtime and memory consumption. The user may provide their own commandline arguments to these programs.
Practical Questions
How can I relax the split constraints if DataSAIL fails to find a solution?
Sometimes, DataSAIL is unable to solve the split problem and might output a message like:
GUROBI cannot solve the problem. Please consider relaxing split restrictions, e.g., less splits, or a higher tolerance level for exceeding cluster limits.
DataSAIL compiles your input into multiple variables and constraints that for a constrained optimization problem. There are some options to solve this problem:
Check the DataSAIL version. In
v1.2.0we added handling for too large clusters. For example, 80% of your data is in cluster A but you want a 5 splits with 20% of the data for a 5-fold cross-validation. This is impossible to solve. Inv1.2.0we introduced theoverflowoption to eitherbreaklarge clusters into smaller parts to fit the splits, orassignthe whole large cluster to one split and allow that split to exceed its size limit.
If you are already on
v1.2.0or newer, you can set theepsilonvalue to higher numbers. Default is0.05but anything up to0.2or0.3is totally reasonable. If you use stratification, you also need to setdeltato a higher value as both values are connected in that scenario.