User:Aditya06k

CoMFA

CoMFA (Comparative Molecular Field Analysis) is a 3D QSAR technique based on data from known active molecules. CoMFA can be applied, as it often is, when the 3D structure of the receptor is unknown. To apply CoMFA, all that is needed are the activities and the 3D structures of the molecules. Of course, activities have to be measured, but 3D structures can be determined either by measurement (crystal X-ray analysis) or by calculation from the 2D diagram and (optionally) subsequent optimization.

1.Introduction 2.Setup the working environment 3.Create a molecular database and spreadsheet 4.Add a CoMFA column to the spreadsheet 5.Perform a cross-validated PLS analysis

INTRODUCTION The aim of CoMFA is to derive a correlation between the biological activity of a set of molecules and their 3D shape, electrostatic and hydrogen bonding characteristics. This correlation is derived from a series of superimposed conformations, one for each molecule in the set. These conformations are presumed to be the biologically active structures, overlaid in their common binding mode. Each conformation is taken in turn, and the molecular fields around it are calculated. The fields, usually electrostatic and steric (van der Waals interactions), are measured at the lattice points of a regular Cartesian 3D grid; the lattice spacing is typically 2 Å. The "measured" interaction is between the molecule and a probe atom (an sp3-hybridized carbon with +1 charge).

How does CoMFA work?

Active molecules are placed in a three-dimensional grid (2-Å spacing) encompassing all of the molecules. At each grid point, steric energy (Lennard-Jones potential) and electrostatic energy are measured for each molecule by a probe atom (sp3-hybridized carbon with +1 charge). To minimize domination by large steric and electrostatic energies, all energies that exceed a specified value (default 30 kcal/mol) are set to the cutoff value. CoMFA uses a partial least-squares (PLS) analysis to predict activity from energy values at the grid points. What is needed before doing CoMFA?

Molecules with activities spanning about three log units of KI or IC50 values are required. Charges should be added to the molecules so that electrostatic energy can be determined. A good alignment is the single most important part of doing a CoMFA analysis. The common substructure should have the same conformation in all molecules, and other parts should be superimposed as much as possible by adjusting internal torsional angles. In this tutorial, you will create a CoMFA model and study its application.

Setup the working environment

If necessary, check the X-windows start-up page for detailed instructions on how to set up the X-windows environment and to access CMBI's Unix machine you need for running Sybyl, cheminf.cmbi.ru.nl. Then, from the Unix shell (command prompt):

Change directory to data/bioinf4/comfa by typing cd data/bioinf4/comfa and call Sybyl by typing sybyl The Sybyl menu appears. Note two general tools on the vertical icon bar at the left: the reset tool to reset all items (such as molecules) in the display area, and the stack tool  to get the current Sybyl window in front (if you lost track of it). Create a molecular database and spreadsheet

A series of 31 compounds with moderate to high activity for the estrogen receptor (ER-α) (Endocrinology 1997, 138(9), 4022-4025) have been constructed and provided with charges according to the Gasteiger-Huckel model. This set, along with the measured activity data, will be used for training in a CoMFA analysis. The molecules have already been aligned.

First, create a molecular database (MDB) and fill it with molecules:

Select File >> Database >> New.... Type estrog_act for the database name. Select File >> Read.... Type (or select) estrog_act.mol2 as the file name. Select File >> Database >> Put Molecule...; click on All and then on OK. Select File >> Database >> Close to close the database. Take a look at the aligned collection of molecules; rotate them by using the mouse (keep right mouse button pressed down). The majority of the molecules are steroids; do you recognize the steroid skeleton? Note the flat shape of the steroid A-ring region. The non-steroidal molecules have been aligned so as to fit their "A-ring" into the same flat region. If you have finished viewing the molecules, you can clear them and create the molecular spreadsheet (MSS). The spreadsheet is filled from the database you just created:

Select Build/Edit >> Zap Molecule; click on All and then on OK. Select Tools >> QSAR >> New spreadsheet.... Select database as the source. Type (or select) estrog_act.mdb as the file name.

Now the activity data are imported from a text file into the spreadsheet. The text file is in a simple, space-delimited format:

Select (empty) column 1 in the spreadsheet. Select (in the spreadsheet) File >> Import.... Type (or select) estrog_act.txt as the file name and click on OK. Choose Delimited by Space as the input format. Click on "File contains row names in field". For rows: type * and for columns: select new click on OK. A column of numbers should now appear behind the compound names. These data are relative binding affinities, expressed as (nanomolar!) concentration values. In QSAR the logarithms of concentrations are used, similar to using pH for the acidity of a solution. So the values we'll be using are -log(concentration*10-9). First, column 1 is given a new name ("RBA" for Relative Binding Affinity), and then Autofill/Functional data is used to create a column with converted values:

Select column 1 in the spreadsheet. Select (in the spreadsheet) Edit >> Rename >> Column..., type RBA and click on OK. Select (empty) column 2 in the spreadsheet and click on Autofill. Select FUNCTIONAL_DATA as the new column type and click on OK; type LOGRBA type 9-LOG("RBA") for the functional specification and click on OK. Select column 2 in the spreadsheet. A column of numbers should now appear behind the compound names. These data are relative binding affinities, expressed as (nanomolar!) concentration values. In QSAR the logarithms of concentrations are used, similar to using pH for the acidity of a solution. So the values we'll be using are -log(concentration*10-9). First, column 1 is given a new name ("RBA" for Relative Binding Affinity), and then Autofill/Functional data is used to create a column with converted values:

Select column 1 in the spreadsheet. Select (in the spreadsheet) Edit >> Rename >> Column..., type RBA and click on OK. Select (empty) column 2 in the spreadsheet and click on Autofill. Select FUNCTIONAL_DATA as the new column type and click on OK; type LOGRBA type 9-LOG("RBA") for the functional specification and click on OK. Select column 2 in the spreadsheet. Select (in the spreadsheet) Edit >> Rename >> Column..., type LOGRBA and click on OK. Verify that the functional specification you typed corresponds to the conversion "-log(RBA*10-9)".

Add a CoMFA column to the spreadsheet

Adding a CoMFA column to the molecular spreadsheet is quite straightforward:

Select (empty) column 3 in the spreadsheet and click on Autofill. Select COMFA as the new column type, and select the following values (the defaults are correct, but check them anyway):

Click on Add Column and then on OK for the column name. A column of numbers should now appear behind the activity values. As a CoMFA analysis produces very many numbers, each number shown is actually a placeholder for an array of numbers - their only physical meaning is a very rough indication of the volume of the molecules. Your spreadsheet window should now look like this:

Perform a cross-validated PLS analysis

In a partial least-squares (PLS) analysis, two factors are important: the number of components used in the regression equation (which will be dealt with later) and the (usually squared) correlation coefficient. A non-cross-validated PLS analysis gives a squared correlation coefficient usually indicated by r2. This number, which is also used in (multiple) linear regression, is between zero and one and expresses the quality of the PLS analysis. It indicates the proportion of the variation in the dependent variable (here the activity) that is explained by the regression equation and its value should be as close to one as possible. However, r2 expresses the quality of the data fit rather than the quality of prediction (which is what we are actually interested in).

To express the predictive power of the analysis, the cross-validated r2, usually indicated by q2, is used. In cross-validation, one value is left out, a model is derived using the remaining data, and the model is used to predict the value originally left out. This procedure is repeated for all values, yielding q2. q2 is normally (much) lower than r2 and values greater than 0.5 already indicate significant predictive power.

After the cross-validated PLS analysis, you will determine the optimal number of components. Recall that the components are linear combinations of the variables (of which you have very many!), ordered in such a way that the first component will describe most of the variation in the activity, the second most of the remaining variation, etc. The cross-validated PLS analysis will be carried out for different numbers of components. As a rule of thumb, the number of components should not exceed one-third of the number of molecules. More components would lead to a model that is overtrained - it has a better fit to the training data but the predictive power is diminished.

Select columns 2 and 3 (LOGRBA and COMFA3). Select (in the spreadsheet) QSAR >> Partial Least Squares.... Select the following values (note that the number of components should be 10):

Click on Do PLS. Examine the PLS results (in the text window!).