Extracting SAR Rules from Compound Data

Extraction of SAR Rules from 

Compound Data 

Jürgen Bajorath 

Department of Life Science Informatics 

LIMES Program Chemical Biology and Medicinal Chemistry 

University of Bonn 

How to extract SAR information from 

compound data sets? 

Systematically 

With the aid of graphical representations

Basic SAR Concepts 

6 nM 6 nM 

SAR continuity 

distinct structures with 

similar potency 

SAR discontinuity 

similar structures with 

highly different potency 

(“activity cliff”) 

continuity 

discontinuity 

2.3 µM 

Concept of Activity Landscapes 

“Activity landscapes”: biological activity hypersurfaces 

within chemical space; 

visualized as a 2D projection of chemical space with 

compound potency as the third dimension

Idealized Activity Landscapes and SARs 

Continuous SAR Discontinuous SAR 

gradual changes in 

structure result in moderate 

changes in activity 

“rolling hills” 

Basic SAR Concepts 

small changes in 

structure have dramatic 

effects on activity 

“activity cliffs” 

6 nM 6 nM 

Coexistence of 

continuous and 

discontinuous SAR 

components: 

SAR heterogeneity 

corresponding to 

variable activity landscapes 

continuity 

discontinuity 

2.3 µM

y 

Variable Activity Landscapes 

Cathepsin S inhibitors 

potency 

- not idealized, but calculated - 

“Coordinate-free” chemical 

space (MACCS Tanimoto 

coefficient distances) 

2D projection through multidimensional 

scaling 

xy-plane: MACCS 

Tanimoto similarity-based 

projection 

z-axis: interpolation of 

potency values 

color code: surface 

elevation 

Activity Landscapes and SARs 


x 

2D projection of an activity 

landscape 

points represent molecules 

color: potency 

(red: high, green: low) 

area shaded according to 

interpolated potency




2D vs. 3D landscape 

representation 


0.1 nM 

Activity cliff formed by highly 

and weakly potent molecules 

10 μM

Systematic SAR Analysis 

What do we like to learn ? 

Activity cliffs 

SAR microenvironments 

subsets of compounds 

representing different local SARs 

Graph Representations 

Discontinuous SAR 

components 

lead optimization 

Continuous SAR 

components 

QSAR, lead hopping 

Basic data 

A list of compounds 

with potency values 

pairwise comparison

Graph Representations 

⎛ 1 ⎞ 

cont = weighted mean ⎜ ⎟ 

{ i , j i > j } ⎝ 1 + sim (i,j) ⎠ 

P i ⋅ P j 

weight ij = 

1 + P − P 

i 

j 

disc = 

1 

SARI − 

2 

= ( cont − ( 1 disc ) ) 

Common features 

edges determined by 

2D structural similarity 

potency used as 

node annotation 

SAR Index Scoring - Annotation 

Numerical function to characterize SAR features 

continuity score 

emphasizes structurally 

diverse compounds having 

similar potency 

balances 

two parts 

GLOBAL SCORE 

all possible compound pairs 

mean P i j 

⎪⎧ 

⎪⎧ 

i > j , P Pi 

Pj 

1, 

⎪⎫ 

i − −Pj 

> 1, 

⎪⎫ 

⎨ i , j 

⎬ 

⎪⎩ ⎪⎩ sim( i , j ) > 0 . 65 65⎪⎭ 

⎪⎭ 

( − P ⋅ sim( i , j ) ) 

(P, potency; sim, 

pairwise 2D similarity) 

discontinuity score 

emphasizes similar 

compounds with large 

potency differences

SAR Index Scoring 

Numerical function to characterize SAR features 

node size scaling 

high compound score 

low compound score 

reflects the potency 

deviation of a compound 

from its structurally 

similar neighbors 

LOCAL SCORE 

pairs formed by a given compound 

disc 

( ) 

i j 

= mean P i − P 

{ j sim( i , j ) > t , i ≠ j } 

( ⋅ sim ( i , j ) ) 

discontinuity score 

emphasizes similar 

compounds with large 

potency differences 

Network-like Similarity Graph (NSG) 

Exemplary graphical 

SAR analysis method

Network-like Similarity Graph 

Network-like Similarity Graph 

NSG for a set of 

squalene synthese 

inhibitors 

Annotated graph 

representation of 

similarity relationships 

in compound data sets 

Nodes: represent all 

compounds in the data set 

Edges: connect nodes with 

high pairwise similarity 

Clusters: Ward’s hierachical 

clustering (gray background) 

Layout: Fruchterman-Reingold 

Annotated graph 

representation of 

similarity relationships 

in compound data sets 

Annotations: 

node size 

cluster scores 

global scores

NSG – Score Annotations 

node size compound compound compound compound 

discontinuity discontinuity discontinuity discontinuity score score score score 

cluster scores SAR SAR SAR SAR Index Index Index Index for for for for 

compound compound compound compound clusters clusters clusters clusters 

global scores SAR SAR SAR SAR Index Index Index Index for for for for the the the the entire entire entire entire 

compound compound compound compound set set set set 

highlights compounds that 

introduce SAR discontinuity/ 

activity cliffs in a data set 

indicates the level of 

continuity/discontinuity in a 

group of similar compounds 

indicates the level of 

continuity/discontinuity in the 

data set 

NSG Interpretation - Local SAR Features

Activity Cliff Index 

NSGs provide interactive 

graphical access to prominent 

activity cliffs 

Cliff Index (CI) enables 

systematic mining and ranking 

of activity cliffs 

CI prioritizes pairs of similar 

compounds having large 

potency differences: 

( ) i j P P j i ⋅ 

+ 

2 

1 sim( , ) 

j i − 

= CI( , ) 

CI = 15.2 

SAR Pathways from NSGs 

Pathways are annotated with compound 

discontinuity scores to emphasize compounds 

forming activtiy cliffs 

7 μM 

CI = 13.4 

activity cliff marker 

1 μM 

0.015 nM 

A sequence of pairwise 

similar compounds with 

balanced chemical and 

activity similarity 

potency 

increases from start to end 

node 

potency gradient 

smooth gradients are preferred

SAR Pathways 

Cytochrome P450 

2C19 PubChem 

screening data set 

Preferred pathways with: 

pairwise similar compounds 

scaffold hop 

Pathway SAR Model 

small increase in potency per compound 

large potency difference between start- and endpoint 

smooth potency gradient 

many compounds 

deviation deviation from a linear 

potency increase 

number number number of of compounds compounds in 

the pathway 

Pathways are based on 

a predefined SAR model 

potency potency 

potency 

difference 

difference 

between 

start- and 

endpoint

SAR Trees 

Cytochrome P450 

2C19 PubChem 

screening data set 

SAR Trees 

SAR Trees provide a 

structural context for 

individual pathways 

Activity cliff pathways 

can be monitored 

A set of pathways 

organized in a tree 

root 

all pathways begin (or lead to) 

the same compound 

branches 

identical pathway sections are 

fused into one branch 

leaves 

endpoints of potency gradients 

(highest/lowest potent 

compounds) 

activity cliff

Advanced Application: Studying 

Multi-target SARs Using NSGs 

NSGs can also be utilized 

to compare SAR behavior 

for multiple targets 

Node color reflects 

compound selectivity 

instead of potency 

From Activity Cliffs to Selectivity Cliffs 

Multi-target SARs 

Target-pair selectivity: 

difference between 

logarithmic potency 

SA / B( 

i) 

= −SB 

/ A = PA( 

i) 

− PB 

( i) 

Structure-selectivity 

relationships (SSRs) 

cathepsin L 

cathepsin B 

pIC50 = 9 pIC50 = 7 

S L/B = 2

SAR and SSR Network Analysis 

cathepsin L 

0.48 

0.05 

Potency-based NSG 

Potency: 

10.4 3.0 

Compound discontinuity score: 

1 

0 

activity cliff markers 

Cluster discontinuity score 


cathepsin L 

0.48 

0.05 

cathepsin B 

1 

0 

“rough” SAR 

“smooth” SAR 

0.10 

0.27


cathepsin L / 

cathepsin B 

0.73 

0.72 1 

Local SSR Environments 

cathepsin L / 

cathepsin B 

0.73 

0.72 

Selectivity-based NSG 

Selectivity: 

3.2 (L) – 3.2 (B) 

Compound discontinuity score: 

1 

0 

selectivity cliff markers 

Cluster discontinuity score 

0 

“rough” SSR 

“smooth” SSR 

discontinuous SSR

Activity Cliffs vs. Selectivity Cliffs 

L B L/B 

discontinuous SAR continuous SAR 

L: 15 nM 

B: 3.5 μM 

L/B: 1.4 

discontinuous SSR 

activity cliff markers selectivity cliff markers 

Local SSR Environments 

cathepsin L / 

cathepsin B 

0.73 

0.72 

L: 10 μM 

B: 170 nM 

L/B: -1.8 

discontinuous SSR

Activity Cliffs vs. Selectivity Cliffs 

L B L/B 

continuous SAR continuous SAR 

L: 3.6 μM 

B: 102 μM 

L/B: 1.5 

Selectivity Determinants 

L/B 

discontinuous SSR 


L: 26 μM 

B: 5.3 μM 

L/B: -0.7 

Molecules with different selectivity are found in the neighborhood of 


Selectivity rules can be formulated 

sel: -0.7 

sel: 0.1 

halogens with increasing 

bulkiness and decreasing 

electronegativity shift 

selectivity toward cat L 

sel: 1.5 

sel: 2.0

Selectivity Determinants 

sel: 2.3 

Conclusions 

K/L 

bulkier substituents shift 

selectivity towards cat L 

sel: -0.6 

sel: -0.8 

sel: -0.8 

Numerical and graphical analysis tools are developed for 

mining of SAR information in compound data sets 

Annotated similarity-based compound networks play an 

important role for graphical SAR analysis 

NSGs enable a systematic comparison of global and local 

SAR features in compound data sets and the identification 

of activity cliffs 

SAR Trees are based on pre-defined SAR model 

NSG enable a comparative analysis of multi-target SARs

Extracting SAR Rules from Compound Data

Create successful ePaper yourself

Delete template?

Save as template?