Predicting olfactory receptor neuron responses from odorant structure

Predicting olfactory receptor neuron responses from odorant structure Predicting olfactory receptor neuron responses from odorant structure

journal.chemistrycentral.com
from journal.chemistrycentral.com More from this publisher
10.07.2015 Views

Chemistry Central JournalResearch articlePredicting olfactory receptor neuron responses from odorantstructureMichael Schmuker* 1,4 , Marien de Bruyne 2,3 , Melanie Hähnel 2 andGisbert Schneider 1Open AccessAddress: 1 Johann Wolfgang Goethe Universität, Beilstein Endowed Chair for Cheminformatics, Institute of Organic Chemistry and ChemicalBiology, Siesmayerstr. 70, 60323 Frankfurt am Main, Germany, 2 Freie Universität Berlin, Institute of Biology-Neurobiology, Königin-Luise-Str. 28–30, 14195 Berlin, Germany, 3 Monash University, School of Biological Sciences, Wellington Road, Clayton VIC 3800, Australia and 4 (presentaddress) Freie Universität Berlin, Institute of Biology-Neurobiology, Königin-Luise-Str. 28–30, 14195 Berlin, GermanyEmail: Michael Schmuker* - schmuker@zedat.fu-berlin.de; Marien de Bruyne - Marien.DeBruijne@sci.monash.edu.au;Melanie Hähnel - haehnel@neurobiologie.fu-berlin.de; Gisbert Schneider - g.schneider@chemie.uni-frankfurt.de* Corresponding authorPublished: 4 May 2007Chemistry Central Journal 2007, 1:11 doi:10.1186/1752-153X-1-11Received: 15 February 2007Accepted: 4 May 2007This article is available from: http://journal.chemistrycentral.com/content/1/1/11© 2007 Schmuker et alThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.AbstractBackground: Olfactory receptors work at the interface between the chemical world of volatilemolecules and the perception of scent in the brain. Their main purpose is to translate chemicalspace into information that can be processed by neural circuits. Assuming that these receptors haveevolved to cope with this task, the analysis of their coding strategy promises to yield valuable insightin how to encode chemical information in an efficient way.Results: We mimicked olfactory coding by modeling responses of primary olfactory neurons tosmall molecules using a large set of physicochemical molecular descriptors and artificial neuralnetworks. We then tested these models by recording in vivo receptor neuron responses to a newset of odorants and successfully predicted the responses of five out of seven receptor neurons.Correlation coefficients ranged from 0.66 to 0.85, demonstrating the applicability of our approachfor the analysis of olfactory receptor activation data. The molecular descriptors that are best-suitedfor response prediction vary for different receptor neurons, implying that each receptor neurondetects a different aspect of chemical space. Finally, we demonstrate that receptor responsesthemselves can be used as descriptors in a predictive model of neuron activation.Conclusion: The chemical meaning of molecular descriptors helps understand structure-responserelationships for olfactory receptors and their "receptive fields". Moreover, it is possible to predictreceptor neuron activation from chemical structure using machine-learning techniques, althoughthis is still complicated by a lack of training data.Page 1 of 10(page number not for citation purposes)

Chemistry Central JournalResearch article<strong>Predicting</strong> <strong>olfactory</strong> <strong>receptor</strong> <strong>neuron</strong> <strong>responses</strong> <strong>from</strong> <strong>odorant</strong><strong>structure</strong>Michael Schmuker* 1,4 , Marien de Bruyne 2,3 , Melanie Hähnel 2 andGisbert Schneider 1Open AccessAddress: 1 Johann Wolfgang Goethe Universität, Beilstein Endowed Chair for Cheminformatics, Institute of Organic Chemistry and ChemicalBiology, Siesmayerstr. 70, 60323 Frankfurt am Main, Germany, 2 Freie Universität Berlin, Institute of Biology-Neurobiology, Königin-Luise-Str. 28–30, 14195 Berlin, Germany, 3 Monash University, School of Biological Sciences, Wellington Road, Clayton VIC 3800, Australia and 4 (presentaddress) Freie Universität Berlin, Institute of Biology-Neurobiology, Königin-Luise-Str. 28–30, 14195 Berlin, GermanyEmail: Michael Schmuker* - schmuker@zedat.fu-berlin.de; Marien de Bruyne - Marien.DeBruijne@sci.monash.edu.au;Melanie Hähnel - haehnel@neurobiologie.fu-berlin.de; Gisbert Schneider - g.schneider@chemie.uni-frankfurt.de* Corresponding authorPublished: 4 May 2007Chemistry Central Journal 2007, 1:11 doi:10.1186/1752-153X-1-11Received: 15 February 2007Accepted: 4 May 2007This article is available <strong>from</strong>: http://journal.chemistrycentral.com/content/1/1/11© 2007 Schmuker et alThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.AbstractBackground: Olfactory <strong>receptor</strong>s work at the interface between the chemical world of volatilemolecules and the perception of scent in the brain. Their main purpose is to translate chemicalspace into information that can be processed by neural circuits. Assuming that these <strong>receptor</strong>s haveevolved to cope with this task, the analysis of their coding strategy promises to yield valuable insightin how to encode chemical information in an efficient way.Results: We mimicked <strong>olfactory</strong> coding by modeling <strong>responses</strong> of primary <strong>olfactory</strong> <strong>neuron</strong>s tosmall molecules using a large set of physicochemical molecular descriptors and artificial neuralnetworks. We then tested these models by recording in vivo <strong>receptor</strong> <strong>neuron</strong> <strong>responses</strong> to a newset of <strong>odorant</strong>s and successfully predicted the <strong>responses</strong> of five out of seven <strong>receptor</strong> <strong>neuron</strong>s.Correlation coefficients ranged <strong>from</strong> 0.66 to 0.85, demonstrating the applicability of our approachfor the analysis of <strong>olfactory</strong> <strong>receptor</strong> activation data. The molecular descriptors that are best-suitedfor response prediction vary for different <strong>receptor</strong> <strong>neuron</strong>s, implying that each <strong>receptor</strong> <strong>neuron</strong>detects a different aspect of chemical space. Finally, we demonstrate that <strong>receptor</strong> <strong>responses</strong>themselves can be used as descriptors in a predictive model of <strong>neuron</strong> activation.Conclusion: The chemical meaning of molecular descriptors helps understand <strong>structure</strong>-responserelationships for <strong>olfactory</strong> <strong>receptor</strong>s and their "receptive fields". Moreover, it is possible to predict<strong>receptor</strong> <strong>neuron</strong> activation <strong>from</strong> chemical <strong>structure</strong> using machine-learning techniques, althoughthis is still complicated by a lack of training data.Page 1 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11upper and lower threshold were excluded <strong>from</strong> the analysisfor the respective ORN.We assessed prediction performance using the MatthewsCorrelation Coefficient for binary data (MCC, eq. 4).Table 1 shows the MCC for the training data and thetested data. We excluded ethyl-3-hydroxybutyrate at ab2Band butyl acetate at ab3A <strong>from</strong> the calculation of the MCCof the test set, since these molecules were used to select thebest models (see Experimental section for details). Thesecompounds have entered the modeling process prior totesting and hence are not valid "test" compounds forthose ORNs.Five out of seven models succeeded in correctly predictingthe training data. The predictions for the ab3B and 6A<strong>neuron</strong> show imperfect performance, but still correlatewith the activity in the training data. The prediction ofORN response to novel molecules shows a mixed picture:For the ab3A ORN, the model achieved an MCC of 0.85,providing reliable prediction. For the ab1D, 2A, 5B and6A ORNs, the MCCs range <strong>from</strong> 0.66 to 0.69, still indicatinggood performance. In contrast, the models showedonly weak performance predicting activity for the ab2B(MCC = 0.17) and 3B ORNs (MCC = 0.34).The discrepancy between performance on the trainingdata and the test data for some <strong>receptor</strong>s may have severalcauses. First, although we used cross-validated trainingand in some cases additional activity data for model selection,due to the large number of models we built, it is possiblethat some models perfectly predict all training data,albeit by chance. Second, descriptor selection was performedon the whole data set instead of a cross-validatedprocedure, possibly "over-optimizing" descriptor spacefor the training data. However, because of the data splittingnecessary for cross-validation, the number of datainstances in one part of the data would have been toosmall for the statistical test we used to select descriptors.In both cases, the performance on the independent test setreveals the actual quality of prediction. This set containedonly substances that did not enter the model creation atany point and is thus not affected by the above issues.Table 1: Matthew's Correlation Coefficient (MCC) for trainingand test. The upper row refers to the performance on thetraining data, the lower row to performance on the test dataORN ab1D ab2A ab2B ab3A ab3B ab5B ab6AMCC training 1.00 1.00 1.00 1.00 0.77 1.00 0.86test 0.69 0.69 0.17 0.85 0.34 0.68 0.66The supplement gives detailed insight into the compoundswe used for testing and the results of the screening,in comparison with the predictions [see Additionalfile 4]. It should be noted that one compound (cyclohexanone)was inactive at ab3A in the training data (3 spikes/s), but active in the test data (33 spikes/s). A similar observationwas made for 4-methylphenol at the ab1D <strong>neuron</strong>:its activity was uncertain in the training data (22 spikes/s),but it was inactive in test data (5 spikes/s). These differencesmay be a consequence of the effect that minimalvariations in concentration may suffice to elicit a response[18].A possible source of error in the predictions is that it is notalways certain that the compound actually arriving at the<strong>receptor</strong> <strong>neuron</strong> did not undergo degradation, or thattraces of other compounds contaminated the stimulus, forexample as by-products <strong>from</strong> synthesis or as remnantsafter purification. These effects cannot be addressed bythis study, but would require analysis of the air stream inparallel to the measurements, for example by gas chromatography[19,20].One point of discussion is the threshold setting for activityassignment, in that it followed no algorithmic procedure.However, these thresholds proved to be sensible choices,and appeared reasonable to us according to the data. Firstof all, the application of thresholds was necessary to simplifythe data. As in any modeling study, simplificationshave to be introduced in order to focus on the most relevantfeatures, especially when the amount of data is limited.In this case, we chose to discard the quantitativeactivity data in favor of a binary active/inactive prediction.Although our threshold settings may have enhanced theaforementioned difference in activity assignment, thesewere more likely due to changes in the experimentalsetup, or variance in the Drosophila stock between themeasurements of the training and test sets. Further, themodels do not take into account the different vapor pressuresof the compounds or effects of dose dependency ofthe <strong>responses</strong>, because the required data was not availablefor all compounds. We also did not address any possibleeffects of modifiers of OR activity such as Olfactory BindingProteins (OBPs). These proteins populate the aqueouslymph surrounding <strong>olfactory</strong> dendrites and have beenshown to be involved in olfaction. Drosophila mutantsdevoid of the LUSH OBP have defects in avoiding highalcohol concentrations [21] and lack response to a pheromone[22]. It has also been suggested that OBPs areinvolved in shuttling hydrophobic <strong>odorant</strong>s through thelymph [23]. The model, being trained on the activationdata of ORNs in their "native surround" (i.e. the lymph),implicitely treats everything between the <strong>odorant</strong> andORN activation as a "black box" and hence also containseffects of OBPs, if present.Page 3 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11Interpretation of descriptor selectionAs stated above, we selected subsets of descriptors that arebest suited for separating active <strong>from</strong> inactive compoundsprior to ANN training. In addition to reducing the "noise"introduced into the data by unsuitable descriptors, theranked list of descriptors can also give insight into the SARof the ORNs. Since each descriptor represents a molecularfeature, descriptors in the selected subset point to potentiallypreferred molecular features detected by an ORN.The sum of preferred features determines an ORN's"receptive field".The descriptor rankings were produces using the p-value<strong>from</strong> a Kolmogorov-Smirnov test (KS-test) for significantdifference between two data sets (inactive vs. active compounds),separately for each ORN. Descriptors with thelowest p-values were ranked highest. The ranked lists ofdescriptors including their associated p-values are given inthe supplement [see Additional file 5].We observed that the set of highest ranking descriptors isdifferent for each ORN. This may correspond to a differentSAR for each ORN, in that different chemotypes are recognizedby different <strong>receptor</strong>s. In the following, we describehow the descriptor rankings relate to the SARs of theORNs in this study. For the sake of brevity, we refer toindividual descriptors by their abbreviations. More elaborateexplanations of all descriptors that appear here and inthe ranked lists are provided in the supplement [see Additionalfile 6].ab1DFor ab1D, the highest ranked descriptor is std_dim3, a 3Dshape descriptor that describes the standard deviationalong the principal component axis of the atom coordinates.Typical activators of ab1D (methyl salicylate, acetophenone,phenylacetaldehyde) have disk-like shape incommon due to their aromatic ring systems (see Figure 1).Hence, they will have small values for this descriptor, discriminatingthem <strong>from</strong> the other molecules in the data set.This descriptor does not feature strongly in the rankings ofother ORNs that respond to aliphatic compounds. Furthermore,the high ranking of several descriptors forcharge distribution on the molecular surface (such asPEOE_VSA_FPNEG, Q_VSA_FNEG, FCASA-) reflect theexposed carbonyl groups in most activators of ab1D, creatinga focused negative partial charge distribution on themolecular surface (cf. Figure 1). Charge distributiondescriptors feature high on the list of several ORNs.ab2AA strong effect of partial charge can also be observed forthe activators of the ab2A ORN (ethyl acetate, 2,3-butanedione,propanone, ethyl propionate), which are all comparablysmall and bear a focused negative partial chargeon the molecular surface (cf. Figure 2). The focused chargeis again represented in the highest scoringPEOE_VSA_FPNEG descriptor. The high rank of a_ICMcan be related to the small molecule size. It describes themean atom information content, which reflects theentropy, used by its information-theoretical meaning, inatom composition. For two equal-sized molecules, theone which is composed of more different atom types willhave the higher entropy. Accordingly, for two moleculeswith the same number of different atom types, the smallerone will have higher entropy. Now the high scoring moleculesincorporate only two atom types, namely O and C,as well as the majority of the remaining molecules in thedata set. Thus, the smaller molecule size likely is the discriminatingfeature. Several connectivity descriptors(chi1v_C, chi1_C etc.) also reflect the importance of moleculesize.ab2B and ab3AThe AM1_HOMO descriptor, which is an index for "reactivity",yields a high rank for the ab3A <strong>neuron</strong>. Moreover,the MNDO_HF descriptor (heat of formation) correlateswell with ab3A spike rate change (Pearson correlationcoefficient: -0.55, p < 10 -4 ). Also, the ionization potential(reflected in the AM1_IP, PM3_IP and MNDO_IP descriptors)yields a high rank. All these descriptors relate to thereactivity of a molecule and are negatively correlated withactivity. This seems evident if one considers that most activatorsof ab3A are esters, which are less reactive than forexample aldehydes and primary alcohols, two groups towhich many of the non-activators belong.Similar observations can be made for ab2B, where four ofthe five activators of the ab2B ORN (ethyl butanoate, hexanol,γ-valerolactone, ethyl-2-methylbutanoate) have aslightly elevated ionization potential according to theAM1_IP descriptor, compared to non-activators (e.g. 3-methylthio-1-propanol, benzaldehyde or linalool), aswell as a high ranking of the AM1_HOMO descriptor.ab5BFor the ab5B ORN, the highest ranked descriptors arerelated to molecular shape, expressed by the descriptorsdeveloped by Kier & Hall descriptors [24] (KierA3, KierA1,KierA2, KierFlex, Kier2, Kier3). In combination with thehigh ranked b_1rotR descriptor (the relative number ofrotatable bonds in the molecule), this reflects ab5B's preferencefor larger, flexible ligands, such as pentyl acetate, 2-heptanone and 3-octanol.ab6AFinally, for the ab6A ORN the κ3 and κ2 descriptorsdescribed by Kier & Hall [24] rank highest (Kier3 andKier2). κ2 encodes information about the "spatial densityof atoms" in a molecular graph, while κ3 encodes thePage 4 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11overtraining. This may be a possible explanation for therather poor predictive performance of the ab6A model.The ab6A ORN shows a somewhat broader selectivitycharacteristic: activators are not as easy to discriminate<strong>from</strong> non-activators as for the other ORNs, and ourmethod of assigning binary activity values may not havebeen appropriate in this case. Here it is important to notethat ab6A is the only ORN in this study for which the<strong>receptor</strong> gene could not yet be identified [3].General remarksAll these interpretations should be treated with care. It isnot justified to interpret an individual descriptor as thesole discriminating feature. Rather, the KS-statistics demonstratethat many features are suitable for classification.Descriptor selection is the result of a statistical procedure,and depends on the composition of the data set. Theresults we present in this section should be read as anexample of how to extract knowledge <strong>from</strong> such an analysis.Moreover, the ANN models combine the informationobtained <strong>from</strong> the selected features to represent amore complex and nonlinear (except for Perceptron-typeANNs) relationship between molecular <strong>structure</strong> andactivity than is suggested by the inherently linear descriptorranking.Disk-like shapes of ab1D activatorsFigure 1Disk-like shapes of ab1D activators. Three activators ofab1D, methyl salicylate (187 spikes/s), phenylacetaldehyde(76 spikes/s) and acetophenone (157 spikes/s) show theirdisk-like shape in surface representation. Red areas indicatenegative partial charge, blue areas positive partial charge, andwhite indicates neutral (= no) charge."centrality of branching"; κ3 values are larger whenbranching is located at the extremities of the moleculargraph or when no branching happens in the molecule,and they are smaller when branching is located near thecenter of the molecule [25]. Interestingly, the single ANNmodel that was selected for prediction of ab6A activityonly used these two descriptors. Considering that thedescriptor values of activators all lie inside a very smallrange in which no non-activators are present (data notshown), and the fact that the selected ANN model has twohidden <strong>neuron</strong>s, the network simply "cut out" the valuerange in which the activators of ab6A lie, a typical effect ofWith these notes of caution, one might speculate thatbinding of odor molecules is achieved through different<strong>receptor</strong>-ligand interaction mechanisms at each OR. Forexample, our study suggests that ab2A is activated at leastin part by the polarity of small ligands, whereas ab5Bappears to require the flexibility of large ligands. While inthe past the classification of chemical stimuli was basedon selected functional groups or chemical class, the use ofphysicochemical descriptors provides a different view onthe molecular features that govern ORN activation.A systematic analysis of ORN selectivity was complicatedby the limited amount of ORN response data. Onlyrecently, more comprehensive data on Drosophila ORN<strong>responses</strong> became available [26]. Although the data wasacquired using a different methodology (heterologousexpression of OR genes in an "empty" ORN), it is possiblethat more data on these ORs will yield better results. Thismay be a fruitful task for a future study. It will be interestingto see if the abstract description of chemical entities aswe used here can aid to reveal a logical <strong>structure</strong> in theselectivity of ORNs.Using ORN <strong>responses</strong> to predict ORN <strong>responses</strong>If ORN <strong>responses</strong> really span some sort of chemical space,it should as well be possible to use the spike rates as adescriptor. To assess this hypothesis, we tried to predictactivity of one ORN using <strong>responses</strong> of the remainingORNs. We used the logarithm of the spike rates, becausePage 5 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11Table 2: Matthew's Correlation Coefficient (MCC) for trainingand screening (test) using ORN <strong>responses</strong> as descriptor. Theupper row refers to the performance on the training data, thelower row to performance on the test dataORN ab1D ab2A ab2B ab3A ab3B ab5B ab6AMCC training 0.55 0.0 0.76 1.00 0.77 0.80 0.72test 0.0 0.0 -0.10 0.47 0.66 0.54 0.54ligands. Finally, the ORN <strong>responses</strong> themselves can effectivelybe used as a descriptor to predict <strong>responses</strong> of otherORNs, providing evidence that ORNs indeed analyzechemical space in a way that can be exploited to predict<strong>receptor</strong>-ligand affinities.Charge distribution of ab2A activatorsFigure 2Charge distribution of ab2A activators. Conolly-surfacerepresentation for activators of the ab2A ORN. Colorscheme is identical to Figure 1.principle component analysis showed that this transformationresults in a more uniform distribution with lessoutliers (data not shown). ANN training and model selectionfollowed the same protocol as above, except that only150 pairs of test and training data were formed and, noadditional validation data was available to prune networksthat showed poor generalization. Since only sixdescriptors were available to train the ANNs, we did notapply KS-statistics for data reduction.The results are given as correlation coefficients in Table 2.ORNs ab3A, ab3B, ab5B, and ab6A show moderate correlation(MCC between 0.47 to 0.66) on the test set, but predictioncompletely failed for ab1D, ab2A (MCC = 0,respectively) and ab2B (MCC = -0.10). This indicates thatthis approach indeed works, at least for four out of seven<strong>receptor</strong>s. The failure at the remaining three likely results<strong>from</strong> the fact that for these <strong>receptor</strong>s there are too fewactives in the test set, namely one for each ab1D and ab2A(salicylaldehyde and propyl acetate resp.), and three forab2B (octanol, ethyl 3-hydroxybutanoate and 2-octanone).ConclusionWe have demonstrated that it is possible to predict DrosophilaORN <strong>responses</strong> <strong>from</strong> molecular <strong>structure</strong>. Theapproach performed well on the majority of <strong>receptor</strong>s,considering that only few data was available for training.The features that were selected as being suitable for modeltraining indicate that each ORN has different preferencesregarding the physicochemical properties of its potentialExperimentalWe prepared a database containing the molecular <strong>structure</strong>sof the <strong>odorant</strong>s previously screened [18] and theiractivity (in spikes/s) on the <strong>neuron</strong>s of the classes ab1D,ab2A, ab2B, ab3A, ab3B, ab5B and ab6A. We chose theseORNs because interpretation of the response spectrumwas not complicated by high <strong>responses</strong> to the solvent, andat least four molecules were active for these ORNs. Thisyielded a minimum ratio of active to inactive molecules ofroughly 1 to 10, and allowed splitting of the data into atraining and a validation set of the same size, and at leasttwo instances of active molecules in each set (cf. "NeuralNetwork Training").Definition of activity rangesWe transformed the continuous range of activity levels[see Additional file 2] into all-or-none data by setting alower and an upper threshold for each ORN. Moleculeswith activities below the lower threshold were consideredinactive, while those with activities above the upperthreshold were considered active. Odorants with an activityvalue between the two thresholds were excluded <strong>from</strong>the modeling process, because their activity cannot bedetermined with a high level of confidence. The doseresponsecurve of ORNs is a sigmoid, and small differencesin odor delivery can in result in changes in the concentrationsproducing inconsistencies between thepreviously published results [18] and the recordings inthis study, particularly for these "borderline" odors.To determine the two thresholds we used the followingprocedure: Starting <strong>from</strong> activity histograms for eachORN, we estimated a lower threshold below which a moleculeis considered inactive. Assuming that the activities ofinactive compounds would be distributed around zerospikes/s (but without knowing the true distribution), weestimated the lower threshold to be where the first "gap"in the activity histogram distribution was located. Similarly,we estimated the upper threshold above which wePage 6 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11considered molecules as being active. This procedure isillustrated in the supplement [see Additional file 7].In one case, additional data [3] indicated that ethyl acetate,considered inactive at ab5B according to the threshold,may actually be a weak activator for the ab5B <strong>neuron</strong>.In consequence, we marked its activity as "unknown". Thefinal data collection including the thresholds for activityassignment can be reviewed in the supplement [see Additionalfile 2].Descriptor calculation, selection and rankingWe calculated 203 molecular descriptors using MOE(Chemical Computing Group, Montreal) for each <strong>odorant</strong>molecule, including calculated physical properties,subdivided surface areas, atom and bond counts, Kier &Hall connectivity and shape indices, adjacency and distancematrices, pharmacophore features, partial chargeindices, potential energies, surface area, volume andshape indices and conformation dependent charge indices.Prior to descriptor calculation, we generated heuristic 3Dconformations with CORINA (Molecular Networks,Erlangen, Germany). At this stage, we used one conformationper molecule. Subsequently, those conformationswere refined by energy minimization using MOE'sMMFF94x force field, a modified version of the MMFF94sforce field [27]. Minimization was stopped at a gradient of10 -5.Pruning unsuitable descriptorsNine descriptors were discarded because they had zerovariance across odor molecules. Some descriptors (e.g. thedipole moment) depend on the three-dimensional conformationof the molecule, which could lead to inconsistentmodeling results for different conformations. Becausewe do not know which conformation of an <strong>odorant</strong> stimulatesthe ORN we sought to eliminate descriptors thatvary strongly with 3D conformation.To identify such strongly varying descriptors, we generatedmultiple conformers of all <strong>odorant</strong>s using MOE's stochasticconformer generation functionality, using anenergy cutoff of 5 kcal/mol. This resulted in a median nineconformers per molecule, with a maximum of 956 conformersfor nonanal. For each descriptor the variance overall conformers of an <strong>odorant</strong> was calculated and scaledusing the Fano Factor [28], F D = σ D, with σ D the varianceand μ D the mean of descriptor D over all conforma-μ Dtions, without prior normalization. We calculated themean F D of each descriptor over all molecules andranked the descriptors accordingly. Data <strong>from</strong> preliminaryexperiments (not shown) suggested a set of descriptorsthat particularly affected prediction quality throughconformational variation. We selected the one with thesmallest Fano Factor, which was the "dipole" descriptorwith F D = 0.03, and eliminated all 26 descriptors with amean F D ≥ 0.03.Descriptor selectionDescriptors were ranked by their ability to separate active<strong>from</strong> inactive molecules. This ability was assessed usingthe Kolmogorov-Smirnov (KS) test [29]. The KS-test comparesthe distribution of two series of data samples A andB by comparing, for each potential value x, the fraction ofvalues <strong>from</strong> A less than x with the fraction of B values lessthan x. The KS-value (k KS ) is the maximum difference overall x values. For each ORN, the descriptor values of allactive <strong>odorant</strong>s provided A, while B was provided by theinactive <strong>odorant</strong>s.The KS-test was performed using MATLAB R14 (The Math-Works, Natick, MA). For the ranking we used the p-valueof the KS-test, that is, the probability that A and B stem<strong>from</strong> the same distribution. High KS-values result in lowp-values. Descriptors with low p-values were ranked highest.Note that the ranking is specific and unique for eachORN. This is because for each ORN, different moleculesconstitute the active and inactive population, and indconsequence the descriptor values for active and inactivemolecules are differently distributed.Artificial neural network trainingWe trained multilayer feed-forward Artificial Neural Networks(ANNs) to predict the activity of <strong>odorant</strong> molecules.Such networks have been described in detailelsewhere [30,31]. Briefly, a network with k inputs, j <strong>neuron</strong>sin the hidden layer, and i output <strong>neuron</strong>s delivers theμoutput O i in response to a pattern μ according to equation(1):⎛μ⎛μ ⎞ ⎞Oi = g⎜bi + Wij⋅ g bj + wjk ⋅ξ⎜k⎟⎜⎟⎝ j ⎝ k ⎠⎟⎠∑ ∑ , (1)with g(x) the transfer function of the output and hiddenlayer <strong>neuron</strong>s respectively (see eq. (2)), b i , b j the bias of the<strong>neuron</strong>s, W ij the weight of the jth hidden <strong>neuron</strong> to the ithoutput <strong>neuron</strong>, w jk the weight of kth input <strong>neuron</strong> to thejth hidden <strong>neuron</strong>, and ξ μ k the kth element of input pat-Page 7 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11tern μ. We used a sigmoidal transfer function (equation(2)):1gx () = ,+−1e xwhere x is the net input of a <strong>neuron</strong>.(2)The MATLAB Neural Network Toolbox was used for ANNmodeling, employing backpropagation training with agradient descent algorithm as implemented in MATLAB'straingdx function [30].Descriptor values were scaled to zero mean and unitstandard deviation (autoscaling) prior to network training.We assigned a target value of 1 to active moleculesand 0 to inactive molecules. By random permutation andsubsequent splitting we formed 250 pairs of test and validationdata, keeping the fraction of active to inactive moleculesidentical in both sets.Network performance during training was assessed usingthe mean standard error (MSE, equation (3))S1 2MSE O expect O predict = ∑ O expect −OpredictSi=1( , ) ( ) ,(3)where O predict was the output of the network and O expectwas given by the target values.The MSE on the training data served as fitness functionduring training. ANN training was stopped when the MSEon the validation data did not decrease for 5,000 trainingepochs.Model performance evaluationTwo factors greatly influence the outcome of ANN training:The ANN architecture (how many <strong>neuron</strong>s to use inthe hidden layer) and the number of inputs (moleculardescriptors). More <strong>neuron</strong>s in the hidden layer or a highernumber of inputs to the ANN may allow for more complexdescription of the data, but the resulting model isalso susceptible to overfitting, that is, modeling finedetails without revealing the global data <strong>structure</strong>.Because these parameters are difficult to estimate inadvance, we trained many networks with different combinationsof parameters, varying the number of <strong>neuron</strong>s inthe hidden layer <strong>from</strong> one to four. In the special case ofone hidden <strong>neuron</strong>, the ANN was reduced to a single <strong>neuron</strong>,which essentially is a Perceptron architecture [30]. Tovary the number of descriptors, we cumulatively used thefirst 1, 2,...30 descriptors <strong>from</strong> the ranked list, meaning weused the first descriptor, then the first two and so on untilwe used all 30 highest-ranked descriptors.In total, we trained 30,000 ANN models per ORN (4architectures × 30 input dimensionalities × 250 repetitionswith different data splitting). We proceeded withselection of models with high predictive quality in crossvalidation.We used the Matthews Correlation CoefficientMCC [32] to assess prediction quality (eq. (4)):MCC =P⋅ N + O⋅U( N + U) ⋅ ( N+ O) ⋅ ( P + U) ⋅ ( P + O) , (4)where P is the number of "true positives", that is, datainstances that are active and have also been predictedactive. N ("true negatives") is the number of datainstances that are inactive and have been predicted inactive.O denotes the number of "overpredicted" instances,predicted active in spite of being inactive, and U is thenumber of "underpredicted" instances, that is, activeinstances predicted inactive. During each training run, werecorded the MCC on the training data as well as on thevalidation data for this run.Model selectionA well-trained, well-generalizing model will have a highMCC both on the training and validation data. Hence, weselected ANNs with a training MCC equal or greater thantheir validation MCC, differing by no more than 0.1.From all ANNs fulfilling these criteria, we selected thoseTable 3: Additional validation compounds. Additional <strong>odorant</strong> activity data we used for model selection. Sources: a: [33], b: [3]ORN Odorant Name Source Remarksab1D furfural a -ab2B cyclohexanol a -(R)-ethyl-3-hydroxybutyrate b unknown stereoisomer, not tested in previous study [18]ab3A butyl acetate b -ethyl acetate b unsure activity in [18]1-hexanol b unsure activity in [18]ab3B pentyl Acetate b unsure activity in [18]E2-hexenal b unsure activity in [18]ab5B ethyl acetate b considered inactive in [18]Page 8 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11with the maximum training MCC. If the selection resultedin more than one ANN, we used all selected ANNs andcombined their prediction values by averaging.For some ORNs, additional <strong>odorant</strong> activity data wasavailable <strong>from</strong> other sources [3,33], providing an additionalselection constraint on the models (see Table 3).Models failing to correctly predict the additional activitydata were discarded. Of the additional compounds, Ethyl-3-hydroxybutyrate, a strong activator for ab3A accordingto [3], was not tested in [18], making it suitable as anadditional validation point. Ethyl acetate was weaklyactive in [3] at the ab5B ORN but inactive in the originaldata. Assuming that it truly is an activator of ab5B, weexcluded it <strong>from</strong> network training and used it to validatethe ANN predictions. The remaining compounds in Table3 were originally excluded <strong>from</strong> training because theiractivity fell in between the upper and lower activitythreshold and thus could not be derived with certainty.Since the additional sources suggest they are active, weused them as validation compounds for model selection.ElectrophysiologyWe used the models to predict activity for a new set of<strong>odorant</strong>s and tested the predictions in a new set of measurements<strong>from</strong> Drosophila ORNs. Electrical activity wasrecorded extracellularly by inserting glass electrodes intoindividual sensilla on the antenna of Drosophila melanogastermales as previously described [18,34]. Each sensillumhouses several ORNs, either 4 (ab1 sensilla) or 2(ab2, 3, 4, 5 and 6 sensilla). Neuronal excitation wasmeasured as counts of spikes (action potentials) producedduring a 500 ms stimulation period. Spike rates for each<strong>odorant</strong> were averaged <strong>from</strong> at least 9 (ab1 and ab2 sensilla),7 (ab3 sensillum) or 3 individuals (ab5 and ab6sensilla). It has previously been shown that spikes producedby the <strong>neuron</strong>s in each of these sensilla can be reliablyseparated based on amplitude and shape differences[18,33,35]. The models were based on data generatedwith Tungsten electrodes but tested using saline filledglass electrodes. Both are standard methods that havebeen shown to produce similar results [34]. The fly waspermanently bathed in a 194 cm/s air-stream. Most <strong>odorant</strong>swere dissolved at 1% v/v in paraffin oil and air <strong>from</strong>a 5 ml syringe, containing 10 μl on a small piece of filterpaper, was injected with a ca. 9-fold dilution factor [18].Three <strong>odorant</strong>s were tested at a 100 times lower concentration[see Additional file 4] because they were extremelypotent activators for some ORNs.OdorantsOdorants were obtained <strong>from</strong> Sigma, Aldrich or Fluka, ofpurity > 99% or highest available, except for octanal(98%), salicylaldehyde (98%), ethyl 3-hydroxybutanoate(98%) and 2-octanone (98%). Except for (S)-(+)-carvone,all chiral <strong>odorant</strong>s were applied as racemic mixtures.Additional materialAdditional File 1trainingCompounds. Odorant molecules for which ORN <strong>responses</strong> wereobtained in [18]. Compound names are given in [Additional file 2].Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S1.pdf]Additional File 2trainingResponses. Activity values (in spikes/s) and per-ORN thresholds.Spike rates of "active" <strong>odorant</strong>s are set in bold in the respective column.Compounds in brackets have uncertain activity (i.e. spike rates betweenthe upper and lower threshold).Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S2.pdf]Additional File 3testCompounds. These <strong>odorant</strong>s were screened to check prediction quality.Compound names are given in [Additional file 4].Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S3.pdf]Additional File 4testResponses. Measured response in test (t, in spikes/s) and predictedactivation (p, 1 = predicted active, 0 = predicted inactive). The upper tencompounds were also part of the training set. Underlined ORN <strong>responses</strong>are considered active, those in plain font inactive according to the thresholdsin ORN response [see Additional file 2]. Comparisons between predictionsand measurements are marked up according to the following: --means true negative, ++ true positive, o false positive (overpredicted) andu false negative (underpredicted). Responses to ethyl-3-Hydroxybutyrateat ab2B and butyl acetate at ab3A have not been taken into account forthe MCC calculation.Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S4.pdf]Additional File 5pValues. The ranked list of descriptors for each ORN and the associatedp-values.Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S5.pdf]Additional File 6descriptorMeaning. The descriptors that we used in this study and howthey are derived <strong>from</strong> chemical <strong>structure</strong>.Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S6.pdf]Page 9 of 10(page number not for citation purposes)


Chemistry Central Journal 2007, 1:11http://journal.chemistrycentral.com/content/1/1/11Additional File 7threshold. Binarizing activity using thresholds, here with respect to theab3A ORN class. On the left, histogram representation of the activities,using 40 bins on the activity range. On the right, <strong>odorant</strong>s arranged byactivity (in spikes/s), with the highest activities most right. The two linesindicate the thresholds we determined. Odorants with activities below thelower threshold are considered "inactive" (blue circles), those above theupper threshold are considered "active" (red circles). The remaining <strong>odorant</strong>sare excluded <strong>from</strong> the analysis (black circles).Click here for file[http://www.biomedcentral.com/content/supplementary/1752-153X-1-11-S7.pdf]AcknowledgementsThis work has been supported by the Beilstein-Institut zur Förderung derChemischen Wissenschaften.References1. Buck L, Axel R: A novel multigene family may encode <strong>odorant</strong><strong>receptor</strong>s: a molecular basis for odor recognition. Cell 1991,65:175-187.2. Vosshall LB, Wong AM, Axel R: An <strong>olfactory</strong> sensory map in thefly brain. Cell 2000, 102(2):147-159.3. Hallem EA, Ho MG, Carlson JR: The molecular basis of odor codingin the Drosophila antenna. Cell 2004, 117(7):965-979.4. Vaidehi N, Floriano WB, Trabanino R, Hall SE, Freddolino P, Choi EJ,Zamanakos G, Goddard WA: Prediction of <strong>structure</strong> and functionof G protein-coupled <strong>receptor</strong>s. Proc Natl Acad Sci USA2002, 99(20):12622-12627.5. Floriano WB, Vaidehi N, Goddard WA: Making sense of olfactionthrough predictions of the 3-D <strong>structure</strong> and function of<strong>olfactory</strong> <strong>receptor</strong>s. Chem Senses 2004, 29(4):269-290.6. Hall SE, Floriano WB, Vaidehi N, Goddard WA: Predicted 3-D<strong>structure</strong>s for mouse I7 and rat I7 <strong>olfactory</strong> <strong>receptor</strong>s andcomparison of predicted odor recognition profiles withexperiment. Chem Senses 2004, 29(7):595-616.7. Becker OM, Shacham S, Marantz Y, Noiman S: Modeling the 3D<strong>structure</strong> of GPCRs: advances and application to drug discovery.Current Opinion in Drug Discovery and Development 2003,6:353-361.8. Kairys V, Fernandes MX, Gilson MK: Screening drug-like compoundsby docking to homology models: a systematic study.Journal of Chemical Information and Modeling 2006, 46:365-379.9. Araneda RC, Kini AD, Firestein S: The molecular receptive rangeof an <strong>odorant</strong> <strong>receptor</strong>. Nat Neurosci 2000, 3(12):1248-1255.10. Manallack DT, Ellis DD, Livingstone DJ: Analysis of linear and nonlinearQSAR data using neural networks. Journal of MedicinalChemistry 1994, 37:3758-3767.11. Schneider G, Wrede P: Artificial neural networks for computer-basedmolecular design. Prog Biophys Mol Biol 1998,70(3):175-222.12. Winkler DA, Burden FR: Application of neural networks tolarge dataset QSAR, virtual screening, and library design.Methods in Molecular Biology 2002, 201:325-367.13. Tsantili-Kakoulidou A, Kier LB: A quantitative <strong>structure</strong>-activityrelationship (QSAR) study of alkylpyrazine odor modalities.Pharm Res 1992, 9(10):1321-1323.14. de Mello Castanho Amboni RD, da Silva Junkes B, Yunes RA, HeinzenVE: Quantitative <strong>structure</strong>-odor relationships of aliphaticesters using topological indices. J Agric Food Chem 2000,48(8):3517-3521.15. Wailzer B, Klocker J, Buchbauer G, Ecker G, Wolschann P: Predictionof the aroma quality and the threshold values of somepyrazines using artificial neural networks. J Med Chem 2001,44(17):2805-2813.16. Lavine BK, Davidson CE, Breneman C, Katt W: Electronic van derWaals surface property descriptors and genetic algorithmsfor developing <strong>structure</strong>-activity correlations in <strong>olfactory</strong>databases. J Chem Inf Comput Sci 2003, 43(6):1890-1905.17. Sell CS: On the unpredictability of odor. Angew Chem Int Ed Engl2006, 45(38):6254-6261.18. de Bruyne M, Foster K, Carlson JR: Odor coding in the Drosophilaantenna. Neuron 2001, 30(2):537-552.19. Vetter RS, Sage AE, Justus KA, Cardé RT, Galizia CG: Temporalintegrity of an airborne odor stimulus is greatly affected byphysical aspects of the odor delivery system. Chem Senses2006, 31(4):359-369.20. Lin DY, Zhang S, Block E, Katz LC: Encoding social signals in themouse main <strong>olfactory</strong> bulb. Nature 2005, 434:470-477.21. Kim MS, Repp A, Smith DP: LUSH <strong>odorant</strong>-binding proteinmediates chemosensory <strong>responses</strong> to alcohols in Drosophilamelanogaster. Genetics 1998, 150(2):711-721.22. Xu P, Atkinson R, Jones DNM, Smith DP: Drosophila OBP LUSHis required for activity of pheromone-sensitive <strong>neuron</strong>s. Neuron2005, 45(2):193-200.23. Kaissling KE: Olfactory peri<strong>receptor</strong> and <strong>receptor</strong> events inmoths: a kinetic model. Chem Senses 2001, 26(2):125-150.24. Hall LH, Kier LB: Reviews in Computational Chemistry Volume 2. Indianapolis:Purdue University at Indianapolis, chap. The Molecular ConnectivityChi Indexes and Kappa Shape Indexes in Structure-PropertyModeling; 1991:367-422.25. Todeschini R, Consonni V: Handbook of molecular descriptors Weinheim:Wiley-VCH; 2000.26. Hallem EA, Carlson JR: Coding of odors by a <strong>receptor</strong> repertoire.Cell 2006, 125:143-160.27. Halgren TA: MMFF VI. MMFF94s option for energy minimizationstudies. Journal of Computational Chemistry 1999, 20:720-729.28. Fano U: Ionization yield of radiations. II. The fluctuations ofthe number of ions. Physical Reviews 1947, 72:26-29.29. Manoukian EB: Mathematical Nonparametric Statistics New York: Gordonand Breach Science Publishers; 1986.30. Hertz J, Palmer RG, Krogh AS: Introduction to the theory of neural computationBoulder, Colorado: Westview Press; 1991.31. Zupan J, Gasteiger J: Neural Networks in Chemistry and Drug DesignWeinheim: Wiley-VCH; 1999.32. Matthews BW: Comparison of the predicted and observed secondary<strong>structure</strong> of T4 phage lysozyme. Biochimica BiophysicaActa 1975, 405:442-451.33. Stensmyr MC, Giordano E, Balloi A, Angioy AM, Hansson BS: Novelnatural ligands for Drosophila <strong>olfactory</strong> <strong>receptor</strong> <strong>neuron</strong>es.J Exp Biol 2003, 206(Pt 4):715-724.34. Dobritsa AA, van der Goes van Naters W, Warr CG, Steinbrecht RA,Carlson JR: Integrating the molecular and cellular basis ofodor coding in the Drosophila antenna. Neuron 2003,37(5):827-841.35. Clyne P, Grant A, O'Connell R, Carlson JR: Odorant response ofindividual sensilla on the Drosophila antenna. Invert Neurosci1997, 3(2–3):127-135.Publish with ChemistryCentral and everyscientist can read your work free of chargeOpen access provides opportunities to ourcolleagues in other parts of the globe, by allowinganyone to view the content free of charge.W. Jeffery Hurst, The Hershey Company.available free of charge to the entire scientific communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Centralyours you keep the copyrightSubmit your manuscript here:http://www.chemistrycentral.com/manuscript/Page 10 of 10(page number not for citation purposes)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!