Modeling |
Alt-M |
[Generate PCA model][PCA Rank Validation][Generate PLS model][Validate PLS model][Generate CPCA model]
In the Modeling menu the User can find the commands to perform Principal Components Analysis (PCA), validate the rank of PCA models and to generate and validate Partial Least Squares (PLS) models.
Alt-C |
PCA is carried out on the whole X-matrix. Variables will be automatically pretreated as defined in the Pretratment>>>Classic Pretreatment>>>Set-up pretreatment. Even if no set-up was performed to, a default pretreatment (which leaves the data unchanged) will be applied to the data.
The number of PC's calculated for the model can be selected by the User.
Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects
The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog in which the number of components processed are shown.
After a while GOLPE will display in the main window the results of the PCA:
Principal Component Analysis (PCA) 24 objects 24 X-var components XVarExp XAccum 1 59.7849 59.7849 2 39.9664 99.7513 3 0.1676 99.9188 4 0.0466 99.9654 5 0.0250 99.9905
For each component it is shown:
XVarExp Percentage of the X-matrix variance explained by this component.
XAccum Accumulative percentage of the X-matrix variance explained by the model.
Modeling>>>PCA Rank Validation
Quite often, when making a PCA, it would be interesting to know how many PC's are actually significative. There is no simple answer to this question and even the definition of "significative" can be open to discussion. GOLPE incorporates a crossvalidation technique for the assessing the significance of sucesive dimension in PCA models.
The method works dividing the dataset randomly into G groups. Each group consist on a set of values extracted regularly from the matrix as in the table:
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 |
5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 |
4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 |
3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 |
2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 |
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 |
Then the first groups (G1) is taken out, computing a reduced model which is used to "predict" the values for the objects in the deleted group. The error in the prediction is measured in terms of a sum of squares of the prediction errors (PRESS) for this reduced models. The whole procedure is repeated removing G2, G3... and accumulating all the partial prediction errors in a total PRESS. The value of this error is compared with the data sum of squares (Seps) as
R=PRESS/Seps
where Seps, for dimension a, is computed as the sum of squares of the X matrix after removing the variance explained by the previous (a-1) PC's.
The R value is calculated for every model dimensionality. When the value of R obtained is larger than 1.00, the incorporation of this PC does not improve the predictions and therefore this PC should not be included.
A detailed description of the method can be found in: S. Wold, Cross-Validatory Estimation of the Number of Components in Factor and Principal Component Models, Technometrics 20, 397-405 (1978).
This command can be accessed only after a PCA model has been generated. It opens a dialog like this:
Max. dimensionality
Max dimensionality of the PCA model that will be validated.
Validation Groups
Number of validation groups using in the procedure. It can be set between 4 and 7 but avoiding the numbers which are exact divisors of the numbers of variables.
Press OK to start the validation or or Cancel to start the computation. The Defaults button will load the pre-set values. The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog where is shown the percentage of the work completed.
After a while GOLPE will display in the main window the results of the PCA validation:
PCA Rank Validation - using 5 random groups components PRESS Seps R 1 6.1742e+05 8.3696e+05 0.7377 2 4.5163e+05 6.0483e+05 0.7467 3 3.5871e+05 4.4333e+05 0.8091 4 2.9929e+05 3.4783e+05 0.8604 5 2.6812e+05 2.9408e+05 0.9117
PRESS Sum of squares of the errors of the PCA predictions, computed as explained above.
Seps Sum of squares of the data computed as explained above.
R Ratio PRESS/Seps. A component is considered significative when R<1.0
However, the validity of the method is relative. If the User is making the PCA mainly for visualizing the data, an obvious limit to the complexity of the model are 3 PC's, since more complex models would be difficult to represent. Moreover, datasets containing outlayers and or clusters of objects can produce misleading results. Therefore our advise is to apply common sense and take the results of this test only as a hint for selecting the right dimensionality of the model.
Alt-G |
This command generates the PLS model in fitting, i.e. all available objects (molecules) are used to build the model. The item in the menu is insensitive when the data file does not contains Y-variables.
GOLPE will use the whole X-matrix and all the variables defined as Y's. Variables will be automatically pretreated as defined in the Pretreatment>>>Classic Pretreatment>>>Set-up pretreatment. Even if no set-up was performed to, a default pretreatment (which leaves the data unchanged) will be applied to the data.
The number of PC's calculated for the model can be selected by the User.
Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When the desired model dimensionality appears in the dialog window, press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects
The calculation will take between a few seconds to several minutes, depending on the number of X and Y-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog window where the number of components processed are shown.
After a while GOLPE will display in the main window the results of the PLS:
Partial Least Squares (PLS) 15 objects 449 X-var 1 Y-var Y1 components XVarExp XAccum SDEC r2 0 0.0000 0.0000 1.0675 0.0000 1 18.7309 18.7309 0.5703 0.7146 2 12.7664 31.4973 0.4179 0.8468 3 19.7530 51.2503 0.3586 0.8871 4 10.4417 61.6920 0.3052 0.9183 5 14.4762 76.1682 0.2760 0.9331
For each component it is show:
XVarExp Percentage of the X-matrix variance explained by this component.
XAccum Accumulative percentage of the X-matrix variance explained by the model.
SDEC Standard Deviation of Error of Calculations.
r2 Squared Correlation coefficient.
Y : Experimental value
Y' : Value calculated by the model
: Average value
N : Number of objects
Alt-V |
The way of validating of PLS models is one of the most important features of GOLPE. This command can be accessed only after a PLS model in fitting has been generated.
Max. dimensionality
Selects the maximum dimensionality of the PLS model to validate. The optimal dimensionality of the model may be less or equal to this maximum dimension number.
Validation mode
Select the crossvalidation method reported in order to validate the model. It is possible to choose between:
Only in this last option, immediately after pressing the OK button, the User will be prompted to define the groups in a dialog window like this:
The User should proceed as follows:
When all the objects were assigned to a group, press the OK button, to proceed with the validation, or Cancel to abort it.
Num. of SDEP
This scale is sensitive only when the option Random Groups is selected. The number shown in the scale indicates the number of times that the whole validation procedure will be repeated, as it was stated above. The default is 20 times.
Number of groups
This control is sensitive only when the options Random Groups or Specific Groups are selected. Specifies the number of groups in which the objects in the data file will be split. We suggest using 5 groups when the number of objects is 20 or larger, and less groups when the number of objects is smaller.
Recalculate weights
Selecting yes will force GOLPE to recalculate the variable weights in each computation. The results are more reliable and stable although the computation is slightly slower.
When all the settings are correct press the OK button to start the computation. Press the Cancel button to abort the validation or the Defaults button to change all the settings in this dialog window with the default values. Remember that, when the validation uses selected groups, a new dialog window will appear to define the groups.
The calculation will take from a few minutes to several minutes, depending on the number of X and Y-variables, the number of objects and, mainly, on the validation procedure chosen. Random Groups is the most time consuming procedure, depending also of the Num. of SDEP defined. GOLPE will inform on the progress of the validation by a working dialog where the percentage of the calculation completed is shown.
After a while GOLPE will display in the main window the results of the PLS validation:
PLS Model Validation - 5 Random Groups 20 SDEP-calc Y1 components SDEP SDEV(sdep) q2 0 1.1599 0.0417 -0.1807 1 0.9637 0.0592 0.1850 2 0.9217 0.0888 0.2544 3 0.8738 0.1087 0.3300 4 0.8607 0.0933 0.3498 5 0.8639 0.0732 0.3451
For each component it is shown:
SDEP Standard Deviation of Error of Predictions.
SDEV(sdep) Standard Deviation of SDEP
q2 Squared Predictive correlation coefficient
Y : Experimental value
Y' : Predicted value
: Average value
N : Number of objects
Modeling>>>Generate CPCA model
When the data file contains at least two X blocks, GOLPE can generate a Consensus Principal Component Analysis (CPCA) model. Please refer to the background section for information about the particular implementation of CPCA in GOLPE.
PCA is carried out on the whole X-matrix. Variables will be automatically pretreated as defined in the Pretratment>>>Classic Pretreatment>>>Set-up pretreatment. Even if no set-up was performed to, a default pretreatment (which leaves the data unchanged) will be applied to the data.
The number of PC's calculated for the CPCA model can be selected by the User.
Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects
The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog in which the number of components processed are shown.
After a while GOLPE will display in the main window the results of the CPCA:
Consensus Principal Component Analysis (CPCA) 14 objects 9118 X-var block var act %SS 1 12144 260 16.7 2 12144 1584 16.7 3 12144 1784 16.7 4 12144 1826 16.7 5 12144 1967 16.7 6 12144 1697 16.7
For each block it is shown the number of variables (var), the number of active variables (act) and the percentage of the total sum of squares account by this block (%SS). Then it is presented a summary of the analysis:
comp XVarExp XAccum XAccum[1] XAccum[2] XAccum[3] XAccum[4] XAccum[5] XAccum[6] 1 27.5555 27.5555 10.2330 26.4529 33.1077 33.3900 34.0595 32.2891 2 24.6665 52.2220 39.7205 53.6389 54.2596 54.9317 55.7070 57.0575 3 5.7121 57.9341 49.3022 59.8720 58.6231 59.9664 60.2219 62.4670 4 5.6438 63.5779 57.9962 65.0712 64.0167 65.0680 65.0620 65.7364 5 4.8046 68.3825 62.1375 70.2911 68.5776 69.6260 70.8902 69.0693
For each principal component extracted it is shown information regarding mainly the "superblock level" of the CPCA model. This information is similar to the information obtained for a regular PCA model.
XVarExp Percentage of the X-matrix variance explained by this component.
XAccum Accumulative percentage of the X-matrix variance explained by the model.
XAccum[i] Accumulative percentage of the block i variance explained by the local model (using block scores and block loadings).
Then, for each block it is presented some more information regarding the block level of the CPCA model. This information is obtained using the block scores and block loadings and the percentages refers to the block variance and not to the overall X variance. Refer to the background section for a discussion of the meaning of these figures.
Block [1], 12144 X-var 260 Active 16.7 %SS comp XVarExp[1] XAccum[1] 1 10.2330 10.2330 2 29.4874 39.7205 3 9.5817 49.3022 4 8.6940 57.9962 5 4.1413 62.1375 ...