Many SIMON ML methods fail

You found maybe a bug in SIMON or you're missing a major feature? Just ask and discuss
Post Reply
ARC_user
Newbie
Posts: 3
Joined: Fri May 21, 2021 6:17 am

Wed May 26, 2021 10:52 am

Dear SIMON team,

Thank you for your great work on designing and implementing SIMON.
We are exploring and benchmarking your platform with some internal data sets.

The first data set which we have used for a test had ca. 1K rows and ca. 800 columns (all with binary values 0/1). The output attribute is also a binary value (0/1).
We have used all 184 available ML algorithms to check if they could build models. Ca. 54 ML algorithms produced valid models with AUC score for the training > 0.80.
We selected 19 diverse algorithms from those and applied on other larger pool of data sets (same type of binary data and binary output attribute).

The results of SIMON modelling have puzzled us. The larger is the data set the less ML algorithm are capable to produce valid models (please see a heatmap plot below).
HEATMAP.png
HEATMAP.png (389.67 KiB) Viewed 808 times
(Heat map values correspond to AUC values for the test set). Data set size corresponds to a number of rows in the data set. The number of columns is ca 500-1000 for all data sets.

For many ML algorithms we get “N/A” values as statistical parameters. I would expect to see some lower values in case if an algorithm can’t build a valid model, but not “N/A”.

We have installed SIMON as a Docker image on an in-house linux machine with the following specifications:
RAM – 64Gb with 16 CPU cores, sufficient available disk space

Yesterday we repeated SIMON testing using the largest data set (DS21) with 33.5K rows. We have allowed all 184 available ML algorithms to be tried. After 5h of calculations 77 out of 184 ML algorithms have completed but the overall modelling process has stalled afterwards. Only 6 algorithms returned valid models. The remaining 71 ML algorithms had “N/A” as statistical parameters.
ERROR_message.png
ERROR_message.png (35.12 KiB) Viewed 808 times
All of them also had an error message “Error: reached elapsed time limit (cpu=300s, elapsed=300s)”
I am also sharing SIMON log file (zipped) (check for timestamp 2021-05-25 15:54:19 in there).

Please advise us how to use SIMON for this size of data sets? Can we model XL data sets, e.g. with 100K data points? What are current SIMON limitations?

Few other practical questions:

- If I would like to test only few ML algorithms (e.g. 20 out of 184), I need to search for each of them and then add one by one to the list. This takes a lot of time and should be repeated for each data sets from scratch. Is there a better way how to select user-defined list of ML algorithms and apply to all data sets which are loaded to the system?

- Also once a ML algorithm is added to the list, it is impossible to delete it from the list any more. I have faced this situation few times when a ML method is added to the list by accident and then I can’t get rid off it. This should be changed of course.

- Is there a way how to run SIMON in a batch mode (or from command line) without GUI and specify all parameters on a command line? That would be a great help for automation and benchmarking.

- How to apply generated SIMON models for the external validation sets?

Looking forward for your comments!
Thank you!
Attachments
simon-cron.zip
(11.6 KiB) Downloaded 37 times
ARC_user
Newbie
Posts: 3
Joined: Fri May 21, 2021 6:17 am

Wed May 26, 2021 1:01 pm

Just a quick update. We have found that by default only 5 min. are allowed for a ML method to complete. If the model is not completed by that time it is considered as "not successful". So we have changed this parameter to 1h and testing SIMON run again. This might be the reason why many ML algorithms fail when data set size increases. The larger data sets the more time is needed for the methods to converge. Please let us know if you agree with our conclusion. Thank you!
User avatar
LogIN
Admin
Posts: 16
Joined: Wed Feb 13, 2019 7:47 pm
Location: Palo Alto, CA
Contact:

Wed May 26, 2021 2:28 pm

Dear ARC_user,

First of all thanks for such nicely written post! I dont see that as much as I would want to.

So yes, I just wanted to write you but you figure it out by yourself already. There is default parameter of 300s as a training limit for each model. This should definitely be configurable on global and per model setting and it is in my roadmap.
I am just in preparation of new release "0.3.0" soon and that will be included as -well.

Otherwise it is natural that many models will fail but this is not your issue here, the hard-coded time limit is.

Regarding your practical questions:
1 & 2nd - I completely agree this small GUI improvements would enhance user experience significantly and we should implement it in new release.
Can you please open two new feature requests for each here with desired specification on GitHub: https://github.com/genular/simon-frontend/issues ?

3. I have "API Feature" in development backlog and hopefully we could implement it by current roadmap in next 2 months.
I was thinking about REST API here: https://github.com/genular/simon-frontend/issues/61 what do you think, if you have some specification ideas feel free to comment on this ticket as-well?

4. Internally we prepared the system for use of 3rd independent validation set, but this is currently still in development. Now is just Training and Testing sets with some split of original dataset lets say 75%
Hopefully we will enable this before v1. For now your only option is to download RData file for specific model and apply Validation dateset directly from custom R script.

In any case I am really curious about your findings and fell free to drop me your results by email when done! ;)

Happy modelling!
“We can not solve our problems with the same level of thinking that created them” A.E.
User avatar
LogIN
Admin
Posts: 16
Joined: Wed Feb 13, 2019 7:47 pm
Location: Palo Alto, CA
Contact:

Wed May 26, 2021 2:34 pm

Just one more thing regarding your second point:
"- Also once a ML algorithm is added to the list, it is impossible to delete it from the list any more. I have faced this situation few times when a ML method is added to the list by accident and then I can’t get rid off it. This should be changed of course."

You can just drag & drop it back right, than it will not be selected anymore?
“We can not solve our problems with the same level of thinking that created them” A.E.
ARC_user
Newbie
Posts: 3
Joined: Fri May 21, 2021 6:17 am

Thu May 27, 2021 5:06 pm

Hi again,

Thank you for your replies.

Yes, you are right it is possible to drag and drop accidentally added ML algorithm from the selected list of algorithms back to the full list of algorithms.

We installed SIMON as the genular container with docker on a Ubuntu 20.04 computer with 16 CPU cores and 64GB of RAM. We would like to report some further issues we found using SIMON:

Issue 1
The first thing we noticed is that SIMON is using the hardware resources extensively, with the computer “load average”, from the “top” command line, up to 150 instead of 16 (1 per core).
This behavior does not depend on the data set size but it seems on the method used, for example with “wsrf” or “RRF”.

Issue 2
We try to monitor the system to understand how it uses the resources but we could not find much information and basically the only log file with relevant information to use is /var/log/simon-cron.log.
But even here the messages are few and not directly understandable, and there are very few timestamps that would be nice to trace the activities.
For example:
- Warning in (function (e) : You have a leaked pooled object.
- System has not been booted with systemd as init system (PID 1). Can't operate.
- Failed to create bus connection: Host is down
Are those messages normal or should we do something about it?
Are there more log files to check the outcomes of a method, the status of the whole analysis, etc.?

Issue 3
We changed the time limit per method on the file /var/www/genular/simon-backend/cron/main.R and in general is working but sometime we still get the message “reached CPU time limit [cpu=1800s, elapsed=1800s]” even though the method took much less time. Again is there any information available about a method outcome? To understand, for example, why a method failed.

Some illustrations:
We used a small data set and 15 random forest methods:
Picture1.png
Picture1.png (169.9 KiB) Viewed 782 times
During this test time limit was set to 600 sec. As you can see 9 methods failed. All of them report the same error message (elapsed time limit of 600 sec), but for few of them processing time is much less. Why do they get the same error message, e.g. parRF? 6 methods have resulted in valid models (see green labels).

In the next round, we decided to repeat this exercise and used only those 5 “successful” ML algorithms (excluding RRF) on the same data set. We included PLS as well & increased time limit to 1800 sec.
To our surprise 2 out of 6 methods failed this time! See a screenshot below:
Picture2.png
Picture2.png (104.81 KiB) Viewed 782 times
Any idea why? Again, we got error “reached CPU time limit” but actual run time was just 2-3 min only.

Another issue is that it is impossible to explore models when all ML algorithms fail (to explore why):
Picture3.png
Picture3.png (338.99 KiB) Viewed 782 times

For example, in this screenshot, clicking on the green information button fires up an “orange” message.

We have repeated modelling again (using the same data set, same parameters, same ML methods).
This time only 3 out of 6 ML method succeeded to build the model:

So there is a lack of consistency across the same ML methods even when we use the same data set and all other parameters. This does not add much confidence in any new potential runs.

Issue 4
Sometimes we also experienced that methods are hanging for longer and cannot be recovered. We can see that in the list of processes we get all the “/usr/local/R/3.6.3/lib/R/bin/exec/R --slave --no-restore --file=/var/www/genular/simon-backend/cron/main.R” but they take no resources and are hanging there for long periods. The only way we found to go on is to kill all the R processes and delete the /tmp/eZF0b0sf/uptime_cron_analysis.pid file. Is there a better way to stop a method while running?

I hope that you can suggest some ways to overcome these limitations so we can proceed with SIMON testing more efficiently.
If this is not possible in the current version, please fix them in the next SIMON release.
Thank you.
User avatar
LogIN
Admin
Posts: 16
Joined: Wed Feb 13, 2019 7:47 pm
Location: Palo Alto, CA
Contact:

Wed Jun 02, 2021 9:22 am

Hi ARC_user,

Let me try to address your questions:

1. Yes, depending on algorithm there will be maximum CPU usage on your SIMON instance allocated CPUs. You will notice that a bit more often as you said in RFF models since number of features that you have in your dataset and the number of trees in the forest are the main factor in high memory/cpu usage.

2. Those messages are normal and related to docker specific issues. We will try to have them cleared in the future but they are not priority at this point.

3. It is possible that there is "wrong" processing time (less number from defined timeout since cpu & wall time can be different) for failed models, also it could happen process is
stuck waiting for some child process or socket and timeout function fails. Changing timeout is still something that we just integrated in our new version that will be published 15th of June.

Successful models should have correct processing time on other hand.

--- It is hard to say why some models failed on your rerun without taking a look into processing configuration and other log files.

4. Currently there is no better way to stop method while its running. This is why we have timeout configured. In new version old processes will be killed automatically on rerun so this should be automatic process.

I suggest trying "development" version of SIMON so you will be on top with newest features and improvements

docker run --rm --detach --name genular --tty --interactive --env IS_DOCKER='true' --env TZ=Europe/London --oom-kill-disable --volume genular_data_latest:/mnt/usrdata --publish 3010:3010 --publish 3011:3011 --publish 3012:3012 --publish 3013:3013 genular/simon:latest

Please be sure to delete all docker containers and data volumes before running above command, otherwise you may have newest source code but old database etc..
“We can not solve our problems with the same level of thinking that created them” A.E.
Post Reply