Friday, October 30, 2020
The Art of Utilizing Connections In Your Data

Recent Posts

Recent Comments




Pulling Out All The Stops

September 29th, 2018 by jwubbel-admin

It would seem on the outside that mining Big Data and machine learning from it is very possibly the picking of the low hanging fruit. What about small data from complex processes or methods? I am not so sure much has been written about the more difficult problems regarding success or failures in learning or discovery.

I allude to this notion of small data in my book JMP CONNECTIONS. If you choose to tackle such a scenario, one really must pull out all the stops. Assuming you have achieved a very clean large data set from a lengthy complex process containing many sub-processes, here is how I have gone about putting the squeeze on my small data sets.

1. Know that you will be required to have perhaps several series of meetings with your data science team.
2. Know that in those meetings you will have to bring in certain people “Just-In-Time” fashion such as statisticians, subject matter experts, scientists and engineers to supplement knowledge, confirm assumptions and asking for evaluations.
3. Pick a realistic goal, are we trying to find insight about the process that is not currently known? Are we wanting to provide an analytical tool to production folks? Maybe maintain a consist level of yields even though the process has potential variability at times? Everyone likes it when yields are high. What is it do we want to predict?

As it turns out one might have thousands of parameters collected so consequently you have to go about the work of reducing your feature set. Initially you can do this in JMP with their Process Screening platform because it gives you the opportunity to learn more about your data. Once a goal has been selected, the team might zero in on a particular sub-process at which point the data set can all of sudden become smaller. The subject matter experts can explain the sub-process and the associated data, but the scientist or engineer will want to identify several y Response variables to make predictions on.

So the smaller data set now as potential factors may or may not have anything to do with what it is you want to predict when it comes to building a model. You could import the data set into a Multivariate Analysis to visually get a graphical picture or use the JMP PCA at which point you may discover an outlier that needs to be explained or excluded from the data set. There also happens to be a Prediction Screening tool in JMP that may quickly show you what likely factors have no impact when building a model on the y Response variable.

Now it happens that one is faced with a wide array of methods to choose from. Do not let that intimidate you. Pick one to start off like the JMP Neural Net (NN). You can use it with categorical or numeric continuous data sets to build your first model. Specify your y Response and the set of factors from your small data set and go to the model launcher to configure it prior to running the NN algorithm. Given the data set is small select kfold as the first approach to avoid overfitting the data.

It is important early on to start conducting demos with the appropriate team members and end users. Remember, they have never had a different look at their data nor how close the prediction is to the actual where the JMP PRO does the validation or training on the model. Know that model evaluation is a must prior to a decision to use the model in production.

When the team gets a chance to experiment with the Prediction Profiler inevitably someone will remember something data wise that is missing, something upstream to the sub-process. Or, two additional variables that could also be predicted. This is the time to start making a list of all the possible models that could enhance and control the sub-process to optimize yields. It gives everyone a chance to look at it from different angles. Think of it as the analog to a music recording studio where getting the setting on the mixers just right to make the perfect recording. It could be a transfer learning model set is needed where a prediction from one model is an input to the second model. Or, someone will speak up to say that in order to find the optimal settings for 3 predictor responses one needs a simulator. Low and behold the Prediction Profiler in JMP has a Simulator function.

Pulling out all the stops is the best way of putting this into perspective when data sets are small around a complex process. Once a model or several models are deployed to production evaluation continues as they are in use. Comparative study will help refine or fine tune model configurations for future build and training. The basis of course for doing all this work is a good business case that has a positive financial measurable component to justify your investment of time and talent.

Posted in Factors, Feature Reduction, Model Building, Neural Net, Prediction, Response Variable, Screening, Small Data | No Comments »

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.