Many SIMON ML methods fail
Posted: Wed May 26, 2021 10:52 am
Dear SIMON team,
Thank you for your great work on designing and implementing SIMON.
We are exploring and benchmarking your platform with some internal data sets.
The first data set which we have used for a test had ca. 1K rows and ca. 800 columns (all with binary values 0/1). The output attribute is also a binary value (0/1).
We have used all 184 available ML algorithms to check if they could build models. Ca. 54 ML algorithms produced valid models with AUC score for the training > 0.80.
We selected 19 diverse algorithms from those and applied on other larger pool of data sets (same type of binary data and binary output attribute).
The results of SIMON modelling have puzzled us. The larger is the data set the less ML algorithm are capable to produce valid models (please see a heatmap plot below).
(Heat map values correspond to AUC values for the test set). Data set size corresponds to a number of rows in the data set. The number of columns is ca 500-1000 for all data sets.
For many ML algorithms we get “N/A” values as statistical parameters. I would expect to see some lower values in case if an algorithm can’t build a valid model, but not “N/A”.
We have installed SIMON as a Docker image on an in-house linux machine with the following specifications:
RAM – 64Gb with 16 CPU cores, sufficient available disk space
Yesterday we repeated SIMON testing using the largest data set (DS21) with 33.5K rows. We have allowed all 184 available ML algorithms to be tried. After 5h of calculations 77 out of 184 ML algorithms have completed but the overall modelling process has stalled afterwards. Only 6 algorithms returned valid models. The remaining 71 ML algorithms had “N/A” as statistical parameters.
All of them also had an error message “Error: reached elapsed time limit (cpu=300s, elapsed=300s)”
I am also sharing SIMON log file (zipped) (check for timestamp 2021-05-25 15:54:19 in there).
Please advise us how to use SIMON for this size of data sets? Can we model XL data sets, e.g. with 100K data points? What are current SIMON limitations?
Few other practical questions:
- If I would like to test only few ML algorithms (e.g. 20 out of 184), I need to search for each of them and then add one by one to the list. This takes a lot of time and should be repeated for each data sets from scratch. Is there a better way how to select user-defined list of ML algorithms and apply to all data sets which are loaded to the system?
- Also once a ML algorithm is added to the list, it is impossible to delete it from the list any more. I have faced this situation few times when a ML method is added to the list by accident and then I can’t get rid off it. This should be changed of course.
- Is there a way how to run SIMON in a batch mode (or from command line) without GUI and specify all parameters on a command line? That would be a great help for automation and benchmarking.
- How to apply generated SIMON models for the external validation sets?
Looking forward for your comments!
Thank you!
Thank you for your great work on designing and implementing SIMON.
We are exploring and benchmarking your platform with some internal data sets.
The first data set which we have used for a test had ca. 1K rows and ca. 800 columns (all with binary values 0/1). The output attribute is also a binary value (0/1).
We have used all 184 available ML algorithms to check if they could build models. Ca. 54 ML algorithms produced valid models with AUC score for the training > 0.80.
We selected 19 diverse algorithms from those and applied on other larger pool of data sets (same type of binary data and binary output attribute).
The results of SIMON modelling have puzzled us. The larger is the data set the less ML algorithm are capable to produce valid models (please see a heatmap plot below).
(Heat map values correspond to AUC values for the test set). Data set size corresponds to a number of rows in the data set. The number of columns is ca 500-1000 for all data sets.
For many ML algorithms we get “N/A” values as statistical parameters. I would expect to see some lower values in case if an algorithm can’t build a valid model, but not “N/A”.
We have installed SIMON as a Docker image on an in-house linux machine with the following specifications:
RAM – 64Gb with 16 CPU cores, sufficient available disk space
Yesterday we repeated SIMON testing using the largest data set (DS21) with 33.5K rows. We have allowed all 184 available ML algorithms to be tried. After 5h of calculations 77 out of 184 ML algorithms have completed but the overall modelling process has stalled afterwards. Only 6 algorithms returned valid models. The remaining 71 ML algorithms had “N/A” as statistical parameters.
All of them also had an error message “Error: reached elapsed time limit (cpu=300s, elapsed=300s)”
I am also sharing SIMON log file (zipped) (check for timestamp 2021-05-25 15:54:19 in there).
Please advise us how to use SIMON for this size of data sets? Can we model XL data sets, e.g. with 100K data points? What are current SIMON limitations?
Few other practical questions:
- If I would like to test only few ML algorithms (e.g. 20 out of 184), I need to search for each of them and then add one by one to the list. This takes a lot of time and should be repeated for each data sets from scratch. Is there a better way how to select user-defined list of ML algorithms and apply to all data sets which are loaded to the system?
- Also once a ML algorithm is added to the list, it is impossible to delete it from the list any more. I have faced this situation few times when a ML method is added to the list by accident and then I can’t get rid off it. This should be changed of course.
- Is there a way how to run SIMON in a batch mode (or from command line) without GUI and specify all parameters on a command line? That would be a great help for automation and benchmarking.
- How to apply generated SIMON models for the external validation sets?
Looking forward for your comments!
Thank you!