Help for package DEMOVA

Type:

Package

Title:

DEvelopment (of Multi-Linear QSPR/QSAR) MOdels VAlidated using Test Set

Version:

1.0

Date:

2016-03-15

Author:

Vinca Prana

Maintainer:

Vinca Prana <vinca.prana@free.fr>

Description:

Tool for the development of multi-linear QSPR/QSAR models (Quantitative structure-property/activity relationship). Theses models are used in chemistry, biology and pharmacy to find a relationship between the structure of a molecule and its property (such as activity, toxicology but also physical properties). The various functions of this package allows: selection of descriptors based of variances, intercorrelation and user expertise; selection of the best multi-linear regression in terms of correlation and robustness; methods of internal validation (Leave-One-Out, Leave-Many-Out, Y-scrambling) and external using test sets.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Depends:

leaps

Suggests:

testthat

NeedsCompilation:

Packaged:

2016-03-15 16:07:15 UTC

Repository:

CRAN

Date/Publication:

2016-03-15 19:54:06

DEvelopment of (multi-linear QSPR/QSAR) MOdels VAlidated using test set.

Description

Details

Package:	DEMOVA
Type:	Package
Version:	1.0
Date:	2016-03-15
License:	GPL (>= 2)

Example of input files are avaible into the floder "tests".
# data<-read.csv("NameOfInputFile.csv",header = TRUE , sep=" ")
# mydesc<-data[,3:dim[2]]
Functions should be use in this order:
- preselection
- select_variables
- select_MLR
- fit
- LOO / LMO / Scramb (No specific order between these ones. Optional to do the rest)
- prediction
- graphe_3Sets

Author(s)

Vinca Prana
Maintainer: Vinca Prana <vinca.prana@free.fr>

References

1. Selassie, C. D. History of Quantitative Structure-Activity Relationship; Burger's Medicinal Chemistry and Drug Discovery Sixth Edition; John Wiley & Sons Inc., 2002; Vol. 1. (2)
2. Willett, P. Chemoinformatics: a History. Wiley Interdisciplinary Reviews: Computational Molecular Science 2011, 1, 46-56.

Leave Many Out

Description

Calculate the robustness of the equation using the leave many out method.

Usage

LMO(mydata, cv, n)

Arguments

mydata

Dataframe containing names and values of response and descriptors

cv

Numbers of fold

n

Numbers of selected descriptors of the regression (determined using Select_MLR)

Value

return Q2, the coefficient that measure the robstness

References

1. Gramatica, P. Principles of QSAR Models Validation: Internal and External. Qsar & Combinatorial Science 2007, 26, 694-701.
2. Golbraikh, A.; Tropsha, A. Beware of Q(2)! Journal of Molecular Graphics & Modelling 2002, 20, 269-276.

Examples

# First run Select_MLR to define n

#LMO(mydata,5,dim(MLR)[2])
#LMO(mydata,10,dim(MLR)[2])

Leave One Out

Description

Calculate the robustness of the equation using the leave one out method.

Usage

LOO(mydata, n)

Arguments

mydata

Dataframe containing names and values of response and descriptors

n

Numbers of selected descriptors of the regression (determined using Select_MLR)

Value

return Q2, the coefficient that measure the robstness

References

Examples

# First run Select_MLR to define n

# LOO(mydata,dim(MLR)[2])

Performance of selected model

Description

Perform a multi linear regression between property and previously selected descriptors (using select_MLR function).
Calculate R2 coefficient and the predicted values from the MLR. Trace the graph experimental values vs predicted values.

Usage

fitting(mydata, n, property)

Arguments

mydata

Dataframe containing names and values of response and descriptors

n

Number of selected descriptors of the regression (determined using select_MLR function)

property

Name of the studied proterty

Value

prediction_TrainSet_Y.csv

File containing prediction obtained using the fitting

Y_TrainingSet.tiff

Image representing experimental values vs predicted values for the training set

fit

lm object return by the function

Examples

# First run select_MLR to define n
# y<-data[,2]
# mydata<-cbind(y,MLR)
# fit<-fitting(data,dim(MLR)[2],"Name of property")

Predictions for the external validation set and graph

Description

Calulate the predicted values for the external validation set and trace the graph experimental values vs predicted values for training, test and external validation sets.

Usage

graphe_3Sets(fit, mydata, mynewdata, mynewdata2, n)

Arguments

fit

Multi linear regression between property and selected descriptors (lm object)

mydata

Dataframe containing names and values of response and descriptors

mynewdata

Dataframe containing property and selected descriptors values for the test set

mynewdata2

Dataframe containing property and selected descriptors values for the external validation set

n

Numbers of selected descriptors of the regression (determined using select_MLR)

Value

Rext, Rext2

return a list containing the value of the determination coefficient of the test set and of the external validation set

Graphe_3sets.tiff

Image representing experimental values vs predicted values for the all three sets

Examples

# This function have to be run last!

## "Test_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

# new_nom<-'Test_set.csv'
# newdata<-read.csv(new_nom,header=TRUE , sep=" ")
# mynewdata=newdata[,2:dim[2]]


## "External_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

# new_nom2<-'External_set.csv'
# newdata2<-read.csv(new_nom2,header=TRUE , sep=" ")
# mynewdata2=newdata2[,2:dim[2]]

#graphe_3Sets(fit,mynewdata,mynewdata2,dim(MLR)[2])

Predictions for the test set and graph

Description

Calulate the predicted values for the test set and trace the graph experimental values vs predicted values for both training and test sets. This function also give the R2 test coefficent.

Usage

prediction(fit, mydata, mynewdata, n)

Arguments

fit

Multi linear regression between property and selected descriptors

mydata

Dataframe containing names and values of response and descriptors

mynewdata

Dataframe containing property and selected descriptors values for the test set

n

Numbers of selected descriptors of the regression (determined using Select_MLR)

Value

Exp.vs.Pred.tiff

Image representing experimental values vs predicted values for the both sets

Rext

return the value of the determination coefficient of the test set

Examples

# This function have to be run after choise of the model.

## "Test_set.csv" should be with the following form
## ID property SelectedDesc1 SelectedDesc2 ... 

#new_nom<-'Test_set.csv'
#newdata<-read.csv(new_nom,header=TRUE , sep=" ")
#mynewdata=newdata[,2:dim[2]]

#prediction(fit,mynewdata,dim(MLR)[2])

Suppression of missing or constant descriptors

Description

Remove descriptors with missing values and a variance lower than 0.001.

Usage

preselection(desc)

Arguments

desc

Dataframe containing the names of desciptors and their values

Value

return a dataframe without the removed variables

Examples

## The input file should be with the following form
## id_molecule propriete x1 x2 x3 ... # Header line
## molecule1 1 0.02 500 ...
## molecule2 5 0.06 600 ...

# nom<-"NameOfInputFile.csv"
# data<-read.csv(nom,header = TRUE , sep=" ") 
# dim<-dim(data)
# mydesc<-data[,3:dim[2]]
# id<-data[,1]
# y<-data[,2] 

# d<-preselection(mydesc)

scrambling

Description

Perform the y-scrambling method that consit to permute y values and try to develop new models. They have to be unperformants in order to validate the original one. The graph R2 vs r(y,yrandom) is created.

Usage

scramb(mydata, k, n, cercle = FALSE)

Arguments

mydata

Dataframe containing names and values of response and descriptors

k

Number of random run

n

Number of selected descriptors of the regression (determined using Select_MLR)

cercle

Value is TRUE or FALSE (by default) . If it TRUE it's draw a circle around the point representinf the original model

Value

Return a list of

mean

Mean of R^2 new model

sd

RStandard deviation of R^2 new model

And also

Scramb.tiff

Description of 'comp1'

Scramb.csv

Description of 'comp2'

References

Tropsha, A.; Gramatica, P.; Gombar, V. K. The Importance of Being Earnest: Validation Is the Absolute Essential for Successful Application and Interpretation of QSPR Models. Qsar \& Combinatorial Science 2003, 22, 69-77.
Rucker, C.; Rucker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345-2357.
Lindgren, F.; Hansen, B.; Karcher, W.; Sjostrom, M.; Eriksson, L. Model Validation by Permutation Tests: Applications to Variable Selection. Journal of Chemometrics 1996, 10, 521-532.

Examples

# First run Select_MLR to define n

# scramb(mydata,1000,nom,dim(MLR)[2])

Development of the model (multi linear regression)

Description

From a list of descriptors and responses values, this function choose the best compromise between correlation and robustness to select the best model.

Usage

select_MLR(y, desc, n, method = "forward")

Arguments

y

Vector with values of the property/response

desc

Dataframe containing the names of desciptors and their values

n

Maximal number (integer) of desciptors for the final equation

method

Determine the method used to build the regression. Can be: "backward", "forward" (by default) or "seqrep". For more info see leaps package.

Value

Return the list of selected variables for the choosen MLR.

Examples

# First run Select_variables to remove descriptors with missing or constant values.
# MLR<-select_MLR(y,desc,5)

Selection of descriptors

Description

This function allow the user to select wanted descriptors between both that are intercorrelated with a correlation coefficent higher that ThresholdInterCor. The selection can also be automatic based on the correlation with the property of each variables.

Usage

select_variables(id, y, d, ThresholdInterCor, auto = FALSE)

Arguments

id

List of the names of observations

y

List of the values of the property/response

d

Dataframe containing the names of desciptors and their values (without missing or constant values)

ThresholdInterCor

Threshold value (double) of the accepted intercorrelation between descriptors (should be between 0 and 1)

auto

Two possible values: TRUE or FALSE (by default). The selection of descriptors is done automatically based on the correlation between descriptor and property (auto=TRUE) or is done manually by user (auto=FALSE)

Value

return a dataframe containing only of non intercorrelated variables

Examples

# Run after Preselection : d<-Preselection(desc)

# desc<-select_variables(id,y,d,0.95)