bbc 2015

10 th Benelux Bioinformatics Conference 

bbc 2015 

December 7 - 8, 2015 

Antwerp, Belgium 

www.bbc2015.be 

1

10th Benelux Bioinformatics Conference bbc 2015 

10 th Benelux Bioinformatics Conference 


PROCEEDINGS 

December 7 and 8, 2015 

Antwerp, Belgium 

Elzenveld, Lange Gasthuisstraat 45, 2000 Antwerp, Belgium 

2


3


Welcome to the 10 th Benelux Bioinformatics Conference! 

Dear attendee, 

It is our great pleasure to welcome you to the 10th Benelux Bioinformatics Conference in Antwerp (Belgium)! 

We are especially proud to host this conference, for the first time ever, in Antwerp, the diamond city. 

Ten years of BBC is worth some celebration. The meeting has always struck the right balance between 

strengthening the regional network and offering a scientifically strong program. From its inception 10 years 

ago, the BBC has always been a prominent platform for the thriving regional bioinformatics community to 

present their latest research. Not only did many young bioinformatics scientists get their first experience 

presenting their work as a poster or an oral presentation at one of the BBC editions, it has always attracted a 

healthy mix of presenters and attendees from all career stages, with diverse backgrounds. 

The program of this year's edition again demonstrates the wide range of life science disciplines in which 

bioinformatics plays a key role nowadays. First, we are delighted to introduce two eminent keynote speakers: 

Cedric Notredame (Center for Genomic Regulation) and Lars Juhl Jensen (Novo Nordisk Foundation Center for 

Protein Research). Second, a program committee of 36 scientists has critically reviewed a large number of 

submissions and selected 24 authors to deliver an oral presentation. In addition, we have two special 

corporate talks. Furthermore, we have again a large number of poster presentations that promise a very 

interactive poster session, and our corporate sponsors present their activities at their respective booths. Last 

but not least, our special guest Pierre Rouzé will bring us a perspective on the history of bioinformatics and 10 

years of Benelux Bioinformatics Conferences. 

For this edition, we would like to congratulate 10 (mostly master) students that were selected from a large 

pool of submissions to enjoy a student fellowship. For many of them it is their first chance to actively 

participate in a scientific conference, and we hope that it inspires them for their future bioinformatics career. 

The program also includes a healthy mix of chances for social interaction and networking. Conference dinner, 

coffee and lunch breaks and the farewell drink are perfect opportunities to strengthen the network even 

further. 

We cannot close this foreword without a very strong word of thank you to the many people who made this 

event possible. Thanks to the sponsors for their crucial support, to the keynote speakers and all other 

presenters for presenting their work, to the program committee for reviewing many abstracts, to many 

volunteers and people in the administration of the University of Antwerp for their helping hands, in many 

different ways. 

Last but not least, thank you for being here and being part of yet another great BBC edition. We wish you an 

enjoyable and very illuminating meeting. 

On behalf of the organizing committee, 

Kris Laukens & Pieter Meysman 

BBC2015 chairs 

University of Antwerp 

4


Special thanks to the BBC 2015 sponsors! 

Gold sponsors: 

Silver sponsors: 

Bronze sponsors: 

Affiliations: 

5


Organizing committee 

 

 

 

 

 

Kris Laukens, University of Antwerp, Belgium 

Pieter Meysman, University of Antwerp, Belgium 

Geert Vandeweyer, University of Antwerp, Belgium 

Yvan Saeys, Ghent University, Belgium 

Thomas Abeel, Delft University of Technology, The Netherlands 

Programme committee 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Thomas Abeel, Delft University of Technology, The Netherlands 

Stein Aerts, University of Leuven, Belgium 

Francisco Azuaje, Luxembourg Institute of Health, Luxembourg 

Gianluca Bontempi, Université libre de Bruxelles, Belgium 

Tomasz Burzykowski, Hasselt University, Belgium 

Susan Coort, Maastricht University, The Netherlands 

Tim De Meyer, Ghent University, Belgium 

Jeroen De Ridder, Delft University of Technology, The Netherlands 

Dick De Ridder, Delft University of Technology, The Netherlands 

Peter De Rijk, University of Antwerp, Belgium 

Pierre Dupont, Université catholique de Louvain, Belgium 

Pierre Geurts, University of Liège, Belgium 

Peter Horvatovich, University of Groningen, The Netherlands 

Jan Ramon, University of Leuven, Belgium 

Rob Jelier, University of Leuven, Belgium 

Gunnar Klau, Centrum Wiskunde & Informatica, The Netherlands 

Andreas Kremer, ITTM S.A., Luxembourg 

Kris Laukens, University of Antwerp, Belgium 

Tom Lenaerts, Université libre de Bruxelles, Belgium 

Steven Maere, Ghent University / VIB, Belgium 

Lennart Martens, Ghent University / VIB, Belgium 

Pieter Meysman, University of Antwerp, Belgium 

Perry Moerland, University of Amsterdam, Belgium 

Pieter Monsieurs, SCK-CEN, Belgium 

Yves Moreau, University of Leuven, Belgium 

Yvan Saeys, Ghent University / VIB, Belgium 

Thomas Sauter, University of Luxembourg, Luxembourg 

Alexander Schoenhuth, Centrum Wiskunde & Informatica, The Netherlands 

Berend Snel, Utrecht University, Belgium 

Dirk Valkenborg, VITO, Belgium 

Raf Van de Plas, Delft University of Technology, The Netherlands 

Vera van Noort, University of Leuven, Belgium 

Natal van Riel, Eindhoven University of Technology, The Netherlands 

Klaas Vandepoele, Ghent University / VIB, Belgium 

Geert Vandeweyer, University of Antwerp, Belgium 

Wim Vrancken, Vrije Universiteit Brussel, Belgium 

6


Local Organizing Committee 

 

 

 

 

 

 

Charlie Beirnaert, University of Antwerp 

Wout Bittremieux, University of Antwerp 

Bart Cuypers, University of Antwerp 

Nicolas De Neuter, University of Antwerp 

Aida Mrzic, University of Antwerp 

Stefan Naulaerts, University of Antwerp 

The results published in this book of abstracts are under the full responsibility of the authors. The 

organizing committee cannot be held responsible for any errors in this publication or potential 

consequences thereof. 

7


Conference agenda 1/2 

December 6, 2015: Satellite events 

12.30 – 19.00 Student-run satellite meeting at the Institute of Tropical Medicine, Antwerp. 

19.00 - … Guided sightseeing tour of Antwerp for early arrivals. 

December 7, 2015: Main Conference 

8.30 - 9.30 Registration and welcome coffee. 

9.30 - 9.50 

Welcome and conference opening, with foreword by UAntwerpen Rector Prof. 

Alain Verschoren. 

9.50 - 10.50 

K1 Invited keynote: Lars Juhl Jensen. Medical data and text mining: Linking 

diseases, drugs, and adverse reactions. 

10.50 - 11.10 Coffee break. 

Selected talks session 1 

11.10 - 11.25 

O1 Mafalda Galhardo, Philipp Berninger, Thanh-Phuong Nguyen, Thomas Sauter and Lasse 

Sinkkonen. Cell type-selective disease association of genes under high regulatory load. 

11.25 - 11.40 

O2 Andrea M. Gazzo, Dorien Daneels, Maryse Bonduelle, Sonia Van Dooren, Guillaume 

Smits and Tom Lenaerts. Predicting oligogenic effects using digenic disease data. 

11.40 - 11.55 

O3 Wouter Saelens, Robrecht Cannoodt, Bart N. Lambrecht and Yvan Saeys. A 

comprehensive comparison of module detection methods for gene expression data. 

11.55 - 12.10 

O4 Joana P. Gonçalves and Sara C. Madeira. LateBiclustering: Efficient discovery of temporal 

local patterns with potential delays. 

12.10 - 12.30 

C1 Nicolas Goffard. Illumina software platforms to transform the path to knowledge and 

discovery. (Corporate presentation: Illumina) 

8


12.30 - 15.00 Lunch break & poster session. 


15.00 - 15.15 

O5 Robrecht Cannoodt, Katleen De Preter and Yvan Saeys. Inferring developmental 

chronologies from single cell RNA. 

15.15 - 15.30 

O6 Vân Anh Huynh-Thu and Guido Sanguinetti. Combining tree-based and dynamical 

systems for the inference of gene regulatory networks. 

15.30 - 15.45 

15.45 - 16.00 

O7 Annika Jacobsen, Nika Heijmans, Renée van Amerongen, Martine Smit, Jaap Heringa 

and K. Anton Feenstra. Modeling the Regulation of β-Catenin Signalling by WNT stimulation 

and GSK3 inhibition. 

O8 Thanh Le Van, Jimmy Van den Eynden, Dries De Maeyer, Ana Carolina Fierro, Lieven 

Verbeke, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt and Kathleen Marchal. 

Ranked tiling based approach to discovering patient subtypes. 

16.00 - 16.15 

O9 Martin Bizet, Jana Jeschke, Matthieu Defrance, François Fuks and Gianluca Bontempi. 

Development of a DNA methylation-based score reflecting Tumour Infiltrating Lymphocytes. 

16.15 - 16-30 

O10 Aliaksei Vasilevich, Shantanu Singh, Aurélie Carlier and Jan de Boer. Prediction of cell 

responses to surface topographies using machine learning techniques. 

16.30 - 17.00 Coffee break. 


17.00 - 17.15 

O11 Wout Bittremieux, Pieter Meysman, Lennart Martens, Bart Goethals, Dirk Valkenborg 

and Kris Laukens. Analysis of mass spectrometry quality control metrics. 

17.15 - 17.30 

O12 Şule Yılmaz, Masa Cernic, Friedel Drepper, Bettina Warscheid, Lennart Martens and 

Elien Vandermarliere. Xilmass: A cross-linked peptide identification algorithm. 

17.30 - 17.45 

17.45 - 18.00 

O13 Nico Verbeeck, Jeffrey Spraggins, Yousef El Aalamat, Junhai Yang, Richard M. Caprioli, 

Bart De Moor, Etienne Waelkens and Raf Van de Plas. Automated anatomical interpretation 

of differences between imaging mass spectrometry experiments. 

O14 Yousef El Aalamat, Xian Mao, Nico Verbeeck, Junhai Yang, Bart De Moor, Richard M. 

Caprioli, Etienne Waelkens and Raf Van de Plas. Enhancement of imaging mass spectrometry 

data through removal of sparse intensity variations. 

18.10 - 18.30 Walk to the gala dinner leaving from conference venue. 

18.30 - 22.00 Gala dinner at Pelgrom – Pelgrimstraat 15, Antwerpen. 

9


Conference agenda 2/2 

December 8, 2015: Main Conference 

8.30 - 9.30 Welcome coffee. 

9.30 - 9.40 Opening and announcements. 


9.40 - 9.55 

9.55 - 10.10 

10.10 – 10.25 

10.25 - 10.40 

O15 Gipsi Lima Mendez, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin, 

Fabrizio Carcillo, Simon Roux, Gianluca Bontempi, Matthew B. Sullivan, Chris Bowler, Eric 

Karsenti, Colomban de Vargas and Jeroen Raes. Determinants of community structure in the 

plankton interactome. 

O16 Mohamed Mysara, Yvan Saeys, Natalie Leys, Jeroen Raes and Pieter Monsieurs. 

Bioinformatics tools for accurate analysis of amplicon sequencing data for 

biodiversity analysis. 

O17 Sjoerd M. H. Huisman, Else Eising, Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Arn 

van den Maagdenberg and Marcel Reinders. Gene co-expression analysis identifies brain 

regions and cell types involved in migraine pathophysiology: a GWAS-based study using the 

Allen Human Brain Atlas. 

O18 Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Aldo Grefhorst, Isabel Mol, Hetty Sips, 

Jose van den Heuvel, Jenny Visser, Marcel Reinders and Onno Meijer. Spatial co-expression 

analysis of steroid receptors in the mouse brain identifies region-specific 

regulation mechanisms. 

10.40 - 11.10 Coffee break. 


11.10 - 11.25 

O19 Bart Cuypers, Pieter Meysman, Manu Vanaerschot, Maya Berg, Malgorzata 

Domagalksa, Jean-Claude Dujardin and Kris Laukens. A systems biology compendium for 

Leishmania Donovani. 

11.25 - 11.40 

O20 Volodimir Olexiouk, Elvis Ndah, Sandra Steyaert, Steven Verbruggen, Eline De Schutter, 

Alexander Koch, Daria Gawron, Wim Van Criekinge, Petra Van Damme and Gerben 

Menschaert. Multi-omics integration: Ribosome profiling applications. 

11.40 - 11.55 

O21 Qingzhen Hou, Kamil Krystian Belau, Marc Lensink, Jaap Heringa and K. Anton 

Feenstra. CLUB-MARTINI: Selecting favorable interactions amongst available candidates: A 

coarse-grained simulation approach to scoring docking decoys. 

11.55 - 12.10 

O22 Elien Vandermarliere, Davy Maddelein, Niels Hulstaert, Elisabeth Stes, Michela Di 

Michele, Kris Gevaert, Edgar Jacoby, Dirk Brehmer and Lennart Martens. Pepshell: 

Visualization of conformational proteomics data. 

10


12.10 - 12.30 

C2 Carine Poussin. The systems toxicology computational challenge: Identification of 

exposure response markers. (Corporate presentation: sbv IMPROVER) 

12.30 - 13.30 Lunch break. 

13.30 - 14.30 

K2 Invited keynote: Cedric Notredame. Multiple survival strategies to deal with the 

multiplication of multiple sequence alignment methods. 


14.30 - 14.45 

O23 Thomas Moerman, Dries Decap and Toni Verbeiren. Interactive VCF comparison using 

Spark Notebook. 

14.45 - 15.00 

O24 Sepideh Babaei, Waseem Akhtar, Johann de Jong, Marcel Reinders and Jeroen de 

Ridder. 3D hotspots of recurrent retroviral insertions reveal long-range interactions with 

cancer genes. 

15.00 - 15.30 Coffee break. 

15.30 - 16.00 K3 Invited keynote: Pierre Rouzé. Thirty years in Bioinformatics. 

16.00 - 16.30 Closing and awards. 

16.30 - 17.00 Closing reception. 

11


Gala dinner 

The gala event will take place at the Pelgrom, a Medieval-style restaurant at walking distance from 

the Elzenveld conference location, on the evening of Monday December 7th, after the conference 

programme, from 18h30 until 22h00. Gala dinner participation is optional, although highly 

recommended! 

The Pelgrom is one of Antwerp’s most historic eating and drinking place, situated in authentic 15th 

century cellars that were used by merchants for temporary storage during the two big annual 

Antwerp fairs. Prepare to feast on a Medieval buffet in the style of Antwerp’s Golden Century! 

The Pelgrom is at walking distance from the 

Elzenveld conference location. For people using 

public transportation, after the end of the gala 

dinner, the Antwerp-Central train station can easily 

be reached by tram from the Groenplaats station 

(10 minutes), or on foot (20 minutes). 

Where? Restaurant Pelgrom, Pelgrimsstraat 15, 2000 Antwerp 

When? Monday December 7th, 2015; 18h30 - 22h00 

12


List of abstracts 

K1 MEDICAL DATA AND TEXT MINING: LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS 17 

K2 

Keynotes 

MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE MULTIPLICATION OF MULTIPLE SEQUENCE 

ALIGNMENT METHODS 

18 

Corporate presentations 

C1 ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO KNOWLEDGE AND DISCOVERY 19 

C2 

THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE 

RESPONSE MARKERS 

20 

Selected oral presentations 

O1 CELL TYPE-SELECTIVE DISEASE ASSOCIATION OF GENES UNDER HIGH REGULATORY LOAD 21 

O2 PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA 22 

O3 

O4 

A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS FOR GENE EXPRESSION 

DATA 

LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL PATTERNS WITH POTENTIAL 

DELAYS 

O5 INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL RNA 25 

O6 

O7 

COMBINING TREE-BASED AND DYNAMICAL SYSTEMS FOR THE INFERENCE OF GENE 

REGULATORY NETWORKS 

MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT STIMULATION AND GSK3 

INHIBITION 

O8 RANKED TILING BASED APPROACH TO DISCOVERING PATIENT SUBTYPES 28 

O9 

O10 

DEVELOPMENT OF A DNA METHYLATION-BASED SCORE REFLECTING TUMOUR INFILTRATING 

LYMPHOCYTES 

PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES USING MACHINE LEARNING 

TECHNIQUES 

O11 ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS 31 

O12 XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM 32 

O13 

O14 

AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES BETWEEN IMAGING MASS 

SPECTROMETRY EXPERIMENTS 

ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA THROUGH REMOVAL OF SPARSE 

INTENSITY VARIATIONS 

O15 DETERMINANTS OF COMMUNITY STRUCTURE IN THE PLANKTON INTERACTOME 35 

O16 

O17 

BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON SEQUENCING DATA FOR 

BIODIVERSITY ANALYSIS 

GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND CELL TYPES INVOLVED IN 

MIGRAINE PATHOPHYSIOLOGY: A GWAS-BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS 

13 

23 

24 

26 

27 

29 

30 

33 

34 

36 

37


O18 

SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN THE MOUSE BRAIN IDENTIFIES 

REGION-SPECIFIC REGULATION MECHANISMS 

O19 A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI 39 

O20 MULTI-OMICS INTEGRATION: RIBOSOME PROFILING APPLICATIONS 40 

O21 

CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS AMONGST AVAILABLE CANDIDATES: A 

COARSE-GRAINED SIMULATION APPROACH TO SCORING DOCKING DECOYS 

O22 PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS DATA 42 

O23 INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK 43 

O24 

3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL LONG-RANGE INTERACTIONS 

WITH CANCER GENES 

Poster presentations 

38 

41 

44 

P1 KNN-MDR APPROACH FOR DETECTING GENE-GENE INTERACTIONS 45 

P2 CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC PATHWAYS IN FUNGI 46 

P3 

VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS USING POLIMERO AND 

POLIMERO-BIO 

P4 DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND 48 

P5 

P6 

BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW COVERAGE SEQUENCING DATA, BY 

INTEGRATION OF HADOOP, HBASE AND HIVE 

ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING LONG-TERM PATIENT GUT 

COLONIZATION 

P7 XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC 51 

P8 IDENTIFICATION OF NUMTS THROUGH NGS DATA 52 

P9 MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING SCHEMES FOR BACTERIA 53 

P10 

P11 

FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN BIOLOGICAL INTERPRETATION OF 

GWAS RESULTS 

IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS IN SETS OF FUNCTIONALLY 

RELATED GENES 

P12 PHENETIC: MULTI-OMICS DATA INTERPRETATION USING INTERACTION NETWORKS 56 

P13 

THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS SUSCEPTIBILITY IN ALLOGENEIC 

TRANSPLANT POPULATIONS 

P14 NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM WHOLE GENOME NGS DATA 58 

P15 

ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR NANOMATERIAL SAFETY 

EVALUATION 

P16 BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: SOMETIMES LESS IS MORE 60 

P17 TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS 61 

P18 

P19 

P20 

RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH ALTERNATIVE FUNCTIONALITY IN 

MUSHROOMS 

MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE QUANTIFICATION IN LABEL- 

FREE MASS SPECTROMETRY-BASED QUANTITATIVE PROTEOMICS 

A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF MONOALLELICALLY EXPRESSED 

LOCI AND THEIR DEREGULATION IN CANCER 

P21 GEVACT: GENOMIC VARIANT CLASSIFIER TOOL 65 

P22 

MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH THROUGHPUT INTERACTOMICS DATA 

FROM ARRAY-MAPPIT EXPERIMENTS 

P23 HIGHLANDER: VARIANT FILTERING MADE EASIER 67 

14 

47 

49 

50 

54 

55 

57 

59 

62 

63 

64 

66


P24 

P25 

P26 

P27 

P28 

DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR GENE REGULATORY NETWORK 

INFERENCE FROM GENE EXPRESSION DATA WITH MULTIPLE DOSES AND TIME POINTS 

IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS USING A “DUMMY” LIGAND 

APPROACH 

PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL GENETICALLY MODIFIED 

CONGENIC MICE 

DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION AND DIFFERENCES IN DRUG 

SUSCEPTIBILITY WITH WGS DATA 

APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO CIRCULATING MICRORNAS REVEALS 

NOVEL BIOMARKERS FOR DRUG-INDUCED LIVER INJURY 

P29 INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION 73 

P30 GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS FROM GENE EXPRESSION DATA 74 

P31 

KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT FOR INTRINSICALLY DISORDERED 

PROTEINS 

P32 ON THE LZ DISTANCE FOR DEREPLICATING REDUNDANT PROKARYOTIC GENOMES 76 

P33 THE ROLE OF MIRNAS IN ALZHEIMER’ S DISEASE 77 

P34 FUNCTIONAL SUBGRAPH ENRICHMENTS FOR NODE SETS IN REGULATORY NETWORKS 78 

P35 HUMANS DROVE THE INTRODUCTION & SPREAD OF MYCOBACTERIUM ULCERANS IN AFRICA 79 

P36 

LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA DETECTION AND 

CLASSIFICATION IN PLANTS 

P37 ANALYSIS OF RELATIONSHIP PATTERNS IN UNASSIGNED MS/MS SPECTRA 81 

P38 MINING ACROSS “ OMICS ” DATA FOR DRUG PRIORITIZATION 82 

P39 

P40 

ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX HISTORY OF NON-BIFURCATING 

SPECIATION IN THE GENUS ARABIDOPSIS 

RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN READING FRAMES (SORFS), A 

NEW SOURCE OF BIOACTIVE PEPTIDES 

P41 RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE ALIGNMENT 85 

P42 EARLY FOLDING AND LOCAL INTERACTIONS 86 

P43 

P44 

BINDING SITE SIMILARITY DRUG REPOSITIONING: A GENERAL AND SYSTEMATIC METHOD FOR 

DRUG DISCOVERY AND SIDE EFFECTS DETECTION 

ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS OF ACETOBACTER 

GHANENSIS AND ACETOBACTER SENEGALENSIS TO THE COCOA BEAN FERMENTATION PROCESS 

THROUGH A GENOMIC APPROACH 

P45 REPRESENTATIONAL POWER OF GENE FEATURES FOR FUNCTION PREDICTION 89 

P46 ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY PREDICTION 90 

P47 

P48 

MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC VARIANTS AT THE PROTEIN LEVEL 

IMPROVES THE IDENTIFICATION OF THEIR DELETERIOUS EFFECTS 

NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN DEAMIDATION FROM SEQUENCE-DERIVED 

SECONDARY STRUCTURE AND INTRINSIC DISORDER 

P49 OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL MODELS 93 

P50 

P51 

EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION ACROSS MULTIPLE MICROBIAL 

GENOMES 

INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES FOR PREDICTING CLINICAL 

CODES 

P52 SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS 96 

P53 

FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND COMPARE CYTOMETRY DATA IN 

THE BROWSER 

P54 TOWARDS A BELGIAN REFERENCE SET 98 

P55 MANAGING BIG IMAGING DATA FROM MICROSCOPY: A DEPARTMENTAL-WIDE APPROACH 99 

15 

68 

69 

70 

71 

72 

75 

80 

83 

84 

87 

88 

91 

92 

94 

95 

97


P56 

ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN CANCER GENOMES USING 

ENHANCER PREDICTION MODELS AND MATCHED GENOME-EPIGENOME-TRANSCRIPTOME 

DATA 

P57 I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN SEQUENCE VISUALIZATION 101 

P58 

P59 

SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY PURIFICATION-MASS 

SPECTROMETRY DATA ANALYSIS 

MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: AN EXTREMELY IMBALANCED BIG 

DATA PROBLEM 

P60 COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-EXPRESSION NETWORKS 104 

P61 

THE DETECTION OF PURIFYING SELECTION DURING TUMOUR EVOLUTION UNVEILS CANCER 

VULNERABILITIES 

P62 FLOREMI: SURVIVAL TIME PREDICTION BASED ON FLOW CYTOMETRY DATA 106 

P63 

P64 

P65 

STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO UNDERSTAND GENOTOXICITY OF MLV- 

BASED GENE THERAPY VECTORS 

THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS FERMENTUM IMDO 130101 AND ITS 

METABOLIC TRAITS RELATED TO THE SOURDOUGH FERMENTATION PROCESS 

ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN SUGGESTS REDUCED 

INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC PROCESSES IN ITS SUSPECTED BAT 

RESERVOIR HOST 

P66 PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING 110 

P67 

P68 

IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING A NETWORK-BASED 

APPROACH 

DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT LACTOBACILLUS NICHES USING 

METAGENOMIC SEQUENCING 

P69 HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES USING MATRIX FACTORIZATION 113 

P70 THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS DISTRIBUTION 114 

100 

102 

103 

105 

107 

108 

109 

111 

112 

Corporate poster presentations 

C2 

THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE 

RESPONSE MARKERS 

20 

16


K1. MEDICAL DATA AND TEXT MINING: 

LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS 

Lars Juhl Jensen 

Clinical data describing the phenotypes and treatment of patients is an underused data source that has much greater 

research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for revealing 

unknown disease correlations and for improving post-approval monitoring of drugs. In my presentation I will introduce 

the centralized Danish health registries and show how we use them for identification of temporal disease correlations and 

discovery of common diagnosis trajectories of patients. I will also describe how we perform text mining of the clinical 

narrative from electronic health records and use this for identification of new adverse reactions of drugs. 

17

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 

Abstract ID: K2 

Keynote 


K2. MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE 

MULTIPLICATION OF MULTIPLE SEQUENCE ALIGNMENT METHODS 

Cedric Notredame 

In this seminar I will introduce some of the latest developments in the field of multiple sequence alignment construction, 

including some of the work from my group. I will briefly review the main challenges and the latest work in the field, 

including ClustalO and the phylogeny aware aligners like SATe and how these aligners relate to consistency based 

methods like T-Coffee. I will also look at the complex relationship between multiple sequence alignment accuracy, 

structural modeling and phylogenetic tree reconstruction and introduce the notion of reliability index while reviewing 

some of the latest advances in this field, including the TCS (Transitive consistency score). I will show how this index can 

be used to both identify structurally correct positions in an alignment and evolutionary informative sites, thus suggesting 

more unity than initially thought between these two parameters. I will then introduce the structure based clustering 

method we recently developed to further test these hypothesis. I will finish with some consideration on the main 

challenges that need to be confronted for the accurate modeling of biological sequences relationship with a special 

attention on genomic and RNA sequences. All methods are available from www.tcoffee.org. 

REFERENCES 

TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Chang 

JM, Di Tommaso P, Notredame C. Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1. 

Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Kemena C, Bussotti G, 

Capriotti E, Marti-Renom MA, Notredame C. Bioinformatics. 2013 May 1;29(9):1112-9. doi: 10.1093/bioinformatics/btt096. Epub 2013 Feb 28. 

Alignathon: a competitive assessment of whole-genome alignment methods. Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K, 

Seledtsov I, Molodtsov V, Raney BJ, Clawson H, Kim J, Kemena C, Chang JM, Erb I, Poliakov A, Hou M, Herrero J, Kent WJ, Solovyev V, 

Darling AE, Ma J, Notredame C, Brudno M, Dubchak I, Haussler D, Paten B. Genome Res. 2014 Dec;24(12):2077-89. doi: 10.1101/gr.174920.114. 

Epub 2014 Oct 1. 

Epistasis as the primary factor in molecular evolution. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Nature. 2012 Oct 

25;490(7421):535-8. doi: 10.1038/nature11510. Epub 2012 Oct 14. 

18


Abstract ID: C1 

Corporate presentation 


C1. ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO 

KNOWLEDGE AND DISCOVERY 

Nicolas Goffard 

Illumina, Inc. ngoffard@illumina.com 

The next big bottleneck in the biological sample to answer workflow has undoubtedly moved beyond the generation of 

the raw data towards its initial processing and analysis and even more so its biological and medical interpretation. There 

are two main reasons why this is particularly challenging for research organisations to successfully accomplish. Firstly 

there is a need to easily and securely analyse, archive and share sequencing data as well as to simplify and accelerate the 

data analysis with push button tools using widely validated and scientifically accepted algorithms. Secondly there is a 

requirement to normalize, standardize and curate not just their proprietary data from multiple studies, but to do it in a 

way that allows them to compare it in real time to data produced from public domain studies. Illumina provides two 

integrated software platforms to overcome these challenges called BaseSpace and NextBio and this presentation provides 

an overview of the capabilities found within both to empower biologists and informaticians to interactively explore the 

data. 

19


Abstract ID: C2 

Corporate presentation 


C2. THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: 

IDENTIFICATION OF EXPOSURE RESPONSE MARKERS 

Carine Poussin, Vincenzo Belcastro, Stéphanie Boué, Florian Martin, 

Alain Sewer, Bjoern Titz, Manuel C. Peitsch & Julia Hoeng. 

Philip Morris International Research and Development, Philip Morris Product SA, 

Quai Jeanrenaud 5, CH-2000 Neuchâtel, Switzerland 

INTRODUCTION 

Risk assessment in the context of 21st century 

toxicology relies on the identification of specific 

exposure response markers and the elucidation of 

mechanisms of toxicity, which can lead to adverse 

events. As a foundation for this future predictive risk 

assessment, diverse set of chemicals or mixtures are 

tested in different biological systems, and datasets are 

generated using high-throughput technologies. 

However, the development of effective computational 

approaches for the analysis and integration of these data 

sets remains challenging. 

METHODS 

The sbv IMPROVER (Industrial Methodology for 

Process Verification in Research; 

http://sbvimprover.com/) project aims to verify methods 

and concepts in systems biology research via challenges 

posed to the scientific community. In fall 2015, the 4th 

sbv IMPROVER computational challenge will be 

launched which is aimed at evaluating algorithms for 

the identification of specific markers of chemical 

mixture exposure response in blood of humans or 

rodents. The blood is an easily accessible matrix, 

however remains a complex biofluid to analyze. This 

computational challenge will address questions related 

to the classification of samples based on transcriptomics 

profiles from well-defined sample cohorts. Moreover, it 

will address whether gene expression data derived from 

human or rodent whole blood are sufficiently 

informative to identify human-specific or speciesindependent 

blood gene signatures predictive of the 

exposure status of a subject to chemical mixtures 

(current/former/non-exposure). 

RESULTS & DISCUSSION 

Participants will be provided with high quality datasets 

to develop predictive models/classifiers and the 

predictions will be scored by an independent scoring 

panel. The results and post-challenge analyses will be 

shared with the scientific community, and will open 

new avenues in the field of systems toxicology. 

REFERENCES 

Meyer et al. Industrial methodology for process verification in 

research (IMPROVER): toward systems biology verification. 

Bioinformatics, 2012 

Meyer et al. Verification of systems biology research in the age of 

collaborative competition. Nat Biotechnol, 2011 

Tarca et al. Strengths and limitations of microarray-based phenotype 

prediction: lessons learned from the IMPROVER Diagnostic 

Signature Challenge. Bioinformatics, 2013 

Hartung, T. Lessons learned from alternative methods and their 

validation for a new toxicology in the 21st century. Journal of 

toxicology and environmental health, 2010 

Hoeng et al. A network-based approach to quantifying the impact of 

biologically active substances. Drug Discov Today, 2012. 

20


Abstract ID: O1 

Oral presentation 


O1. CELL TYPE-SELECTIVE DISEASE ASSOCIATION 

OF GENES UNDER HIGH REGULATORY LOAD 

Mafalda Galhardo 1 , Philipp Berninger 2 , Thanh-Phuong Nguyen 1 , Thomas Sauter 1 & Lasse Sinkkonen 1*. 

Life Sciences Research Unit, University of Luxembourg, Luxembourg, Luxembourg 1 ; Biozentrum, University of Basel 

and Swiss Institute of Bioinformatics, Basel, Switzerland 2 . * lasse.sinkkonen@uni.lu 

Identification of biomarkers and drug targets is a key task of biomedical research. We previously showed that diseaselinked 

metabolic genes are often under combinatorial regulation (Galhardo et al. 2014). Here we extend this analysis to 

include almost 100 transcription factors (TFs) and key histone modifications from over 100 samples to show that genes 

under high regulatory load (HRL) are enriched for disease-association across cell types. Network and pathway analysis 

suggests the central role of HRL genes in biological networks, under heavy regulation both at transcriptional and posttranscriptional 

level, as a possible explanation for the observed enrichment. Thus, epigenomic mapping of enhancers 

presents an unbiased approach for identification of novel disease-associated genes. 

INTRODUCTION 

Identification of disease-relevant genes and gene products 

as biomarkers and drug targets is one of key tasks of 

biomedical research. Still, a great majority of research is 

focused on a small minority of genes while many remain 

unstudied (Pandey et al. 2014). Unbiased prioritization 

within these ignored genes would be important to harvest 

the full potential of genomics in understanding diseases. 

Many databases to catalog disease-associated genes have 

been created, including DisGeNET that draws from 

multiple sources (Bauer-Mehren et al. 2010). In addition, 

large amounts of publicly available epigenomic data on 

the cell type-selective regulation of these genes has been 

produced. The importance of epigenetic regulation for 

disease development is increasingly recognized, for 

example in analysis of GWAS studies where causal SNPs 

are mostly located within gene regulatory regions 

(Maurano et al. 2012). 

METHODS 

Public ChIP-seq data produced by the ENCODE project 

(Dunham et al. 2012), the BLUEPRINT Epigenome 

project (Martens et al. 2013) and the NIH Epigenomic 

Roadmap project (Kundaje et al. 2015) were downloaded 

on May 2014. The data were used to rank active protein 

coding genes (based on NCBI Entrez and marked by 

H3K4me3) by their regulatory load based on the number 

of associated TFs or enhancer (H3K27ac) regions using 

GREAT tool. The enrichment of disease genes from 

DisGeNET among HRL genes was tested using either 

Matlab® hypergeometric cumulative distribution function 

and adjusted for multiple testing with the Benjamini and 

Hochberg methodology or normalized enrichment score. 

Enriched diseases were clustered using R package 

“blockcluster”. Peak calling for super-enhancers was done 

using HOMER. A liver disease gene network was 

constructed from HPRD based on liver diseases genes 

from MeSH and genes from CTD and had 8278 

interactions. Statistical analysis of KEGG pathway 

enrichments and betweenness centrality was done using 

random sampling tests. miRNA target predictions were 

obtained from TargetScan6.2. Further details of the used 

methods can be found in Galhardo et al. 2015. 


Using ENCODE ChIP-Seq profiles for 93 transcription 

factors (TFs) in nine cell lines, we show that HRL genes 

are enriched for disease-association across cell types 

(Figure 1). TF load correlates with the enhancer load of 

the genes, allowing the identification of HRL genes by 

epigenomic mapping of active enhancers marked by 

H3K27ac modifications. Identification of the HRL genes 

across 139 samples from 96 different cell and tissue types 

reveals a consistent enrichment for disease-associated 

genes in a cell type-selective manner. 

The HRL genes are involved in more pathways than 

expected by chance, exhibit increased betweenness 

centrality in the interaction network of liver disease genes, 

and carry longer 3’UTRs with more microRNA binding 

sites than genes on average, suggesting a role as hubs 

within regulatory networks. 

Thus, epigenomic mapping of enhancers presents an 

unbiased approach for identification of novel diseaseassociated 

genes (Galhardo et al. 2015). 

Transcription factor 

binding sites 

(93 TFs) 

9 ENCODE cell lines 

A549, GM12878, H1hESC, HCT116, 

HeLaS3, HepG2, HUVEC, K562, MCF7 

Gene ranking by 

regulatory load 

(Number of TFs or enhancers per gene) 

ChIP-seq data (Human) 

Active enhancers 

(H3K27ac) 

139 samples comprising 

96 tissue or cell types 

Disease genes 

(min score 0.08) 

High regulatory load genes are enriched 

for disease association 

FIGURE 1. Worflow of the disease-gene enrichment analysis. 

Figure 1 

REFERENCES 

Pandey AK et al. PLoS One, 9:e88889 (2014). 

Bauer-Mehren A et al. Nucleic Acids Res., 33:D514-D517 (2010). 

Maurano et al. Science, 337:1190-1195 (2012). 

Galhardo et al. Nucleic Asics Res. 42:1474-1496 (2014). 

Dunham et al. Nature, 489:57-74 (2012) 

Martens et al. Haematologica, 98:1487-1489 (2013) 

Kundaje et al. Nature, 518:317-330 (2015). 

Galhardo et al. Nucleic Acids Res. 10.1093/nar/gkv863 (2015). 

21



10th Benelux Bioinformatics Conference Oral presentation 


O2. PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA 

Andrea M. Gazzo 1,2,3* , Dorien Daneels 1,3 , Maryse Bonduelle 3 , Sonia Van Dooren 1,3 , Guillaume Smits 1,4 & Tom 

Lenaerts 1,2,5 . 

Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium 1 ; MLG, Departement d'Informatique, 

Universite Libre de Bruxelles, Brussels, Belgium 2 ; Center for Medical Genetics, Reproduction and Genetics, 

Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium 3 ; Genetics, 

Hopital Universitaire des Enfants Reine Fabiola, Universite Libre de Bruxelles, Brussels, Belgium 4 ; 

Computerwetenschappen, Vrije Universiteit Brussel, Brussel, Belgium 5 . * Andrea.Gazzo@ulb.ac.be 

Recent research has shown that disorders may be better described by more complex inheritance mechanisms, advocating 

that some of the monogenic disease may in fact be oligogenic. Understanding how the combined interplay and weight of 

variants leads to disease may provide improved and novel insights into diseases classically considered being monogenic. 

Here we present a unique classification method that separates two types of digenic diseases, i.e. those that requires 

variants in both genes to induce the disease and those where one is causative and the second increases the severity. Our 

results show that a clear separation can be made between both classes using gene and variant-level features extracted 

from DIDA. 

INTRODUCTION 

DIDA is a novel database that provides for the first time 

detailed information on genes and associated genetic 

variants involved in digenic diseases, the simplest form of 

oligogenic inheritance 1 . The database is accessible via 

http://dida.ibsquare.be and currently includes 213 digenic 

combinations involved in 44 different digenic diseases 2 . 

These combinations are composed of 364 distinct variants, 

which are distributed over 136 distinct genes. Creating this 

new repository was essential, as current databases do not 

allow one to retrieve detailed records regarding digenic 

combinations. Genes, variants, diseases and digenic 

combinations in DIDA are annotated with manually 

curated information and information mined from other 

online resources. Each digenic combination was 

categorized into one of two effect classes: either ``on/off'', 

in which variant combinations in both genes are required 

to develop the disease, or ``severity'', where variants in 

one gene are enough to develop the disease and carrying 

variant combinations in two genes increases the severity or 

affects its age of onset. In this work we present a predictor 

capable of distinguishing between the digenic effect 

classes. We analyse the result of this predictor in relation 

to specific features collected for the different digenic 

combinations in DIDA, as for instance the 

haploinsufficiency of the genes, their zygosity and the 

relationship between them, providing insight into the 

biological meaning of the result. 

METHODS 

We used a machine learning approach to determine the 

classes, i.e. "severity" or "on/off", of a digenic 

combination. Starting with feature selection we chose the 

most informative features to classify the digenic 

combination in either 2 classes. For each of the two genes 

involved in a digenic combination: Zygosity 

(Heterozygote, Homozygote, etc.), recessiveness 

probability, haploinsufficiency score, known recessive 

information, if the gene is essential or not (based on 

Mouse knock out experimental data) are used as features 

in the predictor. At variant level, we used as features the 

pathogenicity predictions from SIFT and Polyphen 2 tools. 

Finally, we encode also the relationship between the two 

genes, defining the relation "Similar function", "Directly 

interacting" and "Pathway membership". After different 

tests we decided to use a Random forest algorithm, as this 

approach gave the best results. 


After a 10-fold cross validation we obtained promising 

performances, with an MCC of 0,67 and 0,92 as AUROC. 

Regretfully, this performance is an overestimation since, 

as the gene-based features are the most important, many 

examples with mutations mapped on the same gene pair 

lead to the same oligogenic effect class. A stratification 

that ensures that the same pair of genes are never in both 

the training and in the testing set was required. We 

manually created 5 subsets, where the instances with the 

same gene-pair belong to the same subset. . After this 

procedure we assessed again the performances, obtaining 

an MCC of 0,36 and as AUROC 0,78. In order to verify 

the significance of the performances we retrained the 

random forest on a randomization of the data. This 

randomization was obtained by shuffling all the features 

for each instance but maintaining class unchanged. This 

reshuffling resulted in an MCC close to zero and a 

AUROC near to 0.5, as expected. This additional test 

confirms the significance of the stratified results. 

In a next stage we are analysing the relationship between 

the oligogenic effect and the features used, particularly in 

terms of biological and molecular interpretation. As a 

future perspective, the benefit at clinical level is very 

promising: one goal of medical genetics is to assign 

predictive value to the genotype, in order to it to assist in 

diagnosis and disease management. If we can infer, based 

on the genotype, what the digenic/oligogenic effect will be, 

we can potentially anticipate the treatment. 

REFERENCES 

[1] Gazzo, A. et al., DIDA: a curated and annotated digenic diseases 

database, under review on NAR database issue (2016). 

[2] Schäffer, A. A. (2013) Digenic inheritance in medical genetics. 

J. Med. Genet., 50, 641–652. 

22





O3. A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS 

FOR GENE EXPRESSION DATA 

Wouter Saelens 1,2* , Robrecht Cannoodt 1,2,3 , Bart N. Lambrecht 1,2 & Yvan Saeys 1,2 . 

VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Center for Medical 

Genetics, Ghent University Hospital 3 . * wouter.saelens@ugent.be 

Module detection is central in every analysis of large scale gene expression data. While numerous methods have been 

developed, the relative merits and drawbacks of these different approaches is still unclear. In this work we use known 

gene regulatory networks to do an unbiased comparison of 41 module detection methods, spanning clustering, 

biclustering, decomposition, direct network inference and iterative network inference. This analysis showed that 

decomposition methods outperform current clustering methods. Our work provides a first comprehensive evaluation to 

guide the biologist in their choice but also serves as a protocol for the evaluation of novel module detection methods. 

INTRODUCTION 

Module detection methods form a cornerstone in the 

analysis of genome wide gene expression compendia. 

Modules in this context are defined as groups of genes 

with a similar expression profile, and therefore frequently 

share certain functions, are co-regulated and cooperate to 

produce a certain phenotype. 

Over the last years, dozens of module detection methods 

have been developed, which can be classified in five 

different categories. The most popular method is 

undoubtedly clustering, which will group genes into 

modules based on global similarity in expression profiles. 

Within the transcriptomics community these methods have 

received a considerable amount of criticism. This is 

mainly due to three drawbacks: (i) clustering cannot detect 

so called local co-expression effects, (ii) most clustering 

methods are unable to detect overlapping modules and (iii) 

clustering methods do not model the underlying gene 

regulatory network. Alternative approaches have therefore 

been developed which either handle both overlap and local 

co-expression (biclustering and decomposition) or model 

the gene regulatory network (direct network inference and 

iterative network inference). 

Given this methodological diversity, it is important that 

existing and new approaches are evaluated on robust and 

objective benchmarks. However, evaluation studies in the 

past were limited in the number of methods, use synthetic 

data or do not correctly assess the balance between false 

positives and false negatives. In this study we therefore 

provide a novel unbiased and comprehensive evaluation 

strategy (Figure 1), and used it to evaluate 41 state-of-theart 

module detection methods. 

METHODS 

The key of our approach is that we use golden standard 

regulatory networks to define sets of known modules. 

These can be used to directly assess the sensitivity and 

specificity of the different module detection methods. We 

used four different large scale gene expression compendia, 

two from E. coli and two from S. cerevisae. For each of 

these organisms a substantial part of the regulatory 

network is already known, either based on the integration 

of small-scale experiments or based on large, genome 

wide datasets. We use these networks to define groups of 

known modules using by looking at genes which either 

share on regulator, all regulators or are strongly 

interconnected. We used four different metrics to compare 

a set of observed modules with known modules: recovery 

and recall control the type II errors, while the relevance 

and specificity control the type I errors. 

Parameter tuning is a necessary but often overlooked 

challenge of module detection methods. As default 

parameters of a tool are usually optimized for some 

specific test cases by the authors, they do not necessarily 

reflect general good performance on other datasets. On the 

other hand, one should be careful of overfitting parameters 

on specific characteristics of the data, as such parameters 

will lead to suboptimal results when using the same 

parameter settings on other datasets. In this study we first 

optimized parameters using a grid-based approach. Next, 

to avoid overfitting we used the optimal parameters on one 

dataset to score the performance on another dataset, in an 

approach akin to cross-validation. 


We evaluated 41 different module detection methods 

covering all five approaches. Overall, our analysis showed 

that certain decomposition methods, those based on the 

independent component analysis, outperform current stateof-the-art 

clustering methods. However, despite their 

theoretical advantages, neither biclustering nor network 

inference methods are able to outperform clustering 

methods. Importantly, our results are stable across datasets, 

module definitions and scoring metrics, demonstrating the 

robustness of our evaluation methodology. 

FIGURE 1. Overview of our evaluation methodology. 

The applications of our work are twofold. First, if local coexpression 

and overlap are of interest, we discourage the 

use of biclustering methods and suggest the use of 

decomposition instead. Secondly, we provide a new 

comprehensive evaluation methodology which can be used 

to compare novel methods with the current state-of-the-art. 

23





O4. LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL 

PATTERNS WITH POTENTIAL DELAYS 

Joana P. Gonçalves 1,2* & Sara C. Madeira 3,4 . 

Pattern Recognition and Bioinformatics Group, Department of Intelligent Systems, Delft University of Technology 1 ; 

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute 2 ; Department of Computer Science and 

Engineering, Instituto Superior Técnico, Universidade de Lisboa 3 ; INESC-ID 4 . * research@joanagoncalves.org 

Temporal transcriptomes can provide valuable insight into the dynamics of transcriptional response and gene regulation. 

In particular, many studies seek to uncover functional biological units by identifying and grouping genes with common 

expression patterns. Nevertheless, most analytical tools available for this purpose fall short in their ability to consider 

biologically reasonable models and adequately incorporate the temporal dimension. Each biological task is likely to 

occur within a time period that does not necessarily span the whole time course of the experiment, and genes involved in 

such a task are expected to coordinate only while the task is ongoing. LateBiclustering is an efficient algorithm to 

identify this type of coordinated activity, while allowing genes to participate in distinct biological tasks with multiple 

partners over time. Additionally, LateBiclustering is able to capture temporal delays suggestive of transcriptional 

cascades: one of the hallmarks of gene expression and regulation. 

INTRODUCTION 

The discovery of patterns in temporal transcriptomes 

exposes gene expression dynamics and contributes to 

understand the machinery involved in its modulation. 

Various analytical tools are employed in this regard. 

Differential expression summarizes an entire time course 

into one feature, thus lacking detail. Clustering maintains 

respects the chronological order, but focuses on global 

similarities and tends to identify rather broad patterns, 

associated with unspecific functions. Biclustering offers 

increased granularity by additionally searching for local 

patterns, but allows for arbitrary jumps in time, eventually 

leading to patterns that are incoherent from a temporal 

perspective. 

METHODS 

LateBiclustering is an efficient algorithm for the 

identification of transcriptional modules, here termed 

LateBiclusters. Each LateBicluster is a group of genes 

showing a similar expression pattern with potential delays, 

within a particular time frame that does not necessarily 

span the whole time course of the transciptome. 

LateBiclustering only reports maximal LateBiclusters, that 

is, those that cannot be extended and are not fully 

contained in any other LateBicluster. 

LateBiclustering takes as input a gene-time expression 

matrix of real values. Each gene expression profile is first 

normalized to zero mean and unit standard deviation. A 

discretization is further applied to discern variations 

between consecutive time points into three levels: downtrend, 

no-change and up-trend. Upon discretization each 

gene profile can be seen as a string. 

 

 

A generalized suffix tree is built to find common 

patterns in the gene profiles. Internal nodes 

satisfying certain properties are marked for their 

potential to denote LateBiclusters. 

When an internal node does not satisfy the basic 

conditions for LateBicluster maximality, a 

procedure is applied to remove occurrences 

leading to non-maximal LateBiclusters. For this 

purpose, LateBiclustering uses a bit array 

representing the occurrences underlying each 

 

internal node. During the maximality update 

procedure, the bit array of the inspected node is 

compared against those of internal children nodes 

(right-max) and nodes from which the inspected 

node receives suffix links (left-max). 

Finally, LateBiclustering comes with different 

heuristics to report a single pattern occurrence per 

gene in each maximal LateBicluster. A heuristic 

is necessary because there may be multiple 

occurrences of a pattern in the profile of a given 

gene, which is a direct consequence of allowing 

the discovery of delayed patterns. 


LateBiclustering is the first efficient algorithm suitable for 

the discovery of biclusters with temporal delays. It runs in 

polynomial time, while previous methods yielded 

exponential time complexity. LateBiclustering was able to 

find planted biclusters in synthetic data. It also identified 

biologically relevant LateBiclusters associated with 

Saccharomyces cerevisiae’s response to heat stress, and 

interesting time-lagged responses. 

FIGURE 1. Schematic of the LateBiclustering algorithm. 

REFERENCES 

Gonçalves JP & Madeira SC. IEEE/ACM Transactions on 

Computational Biology and Bioinformatics, 11(5), 801–813 

(2014). 

24





O5. INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL 

RNA 

Robrecht Cannoodt 1,2,3* , Katleen De Preter 3 & Yvan Saeys 1,2 . 

Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center, Ghent 1 ; Department of 

Respiratory Medicine, Ghent University Hospital, Ghent 2 ; Center of Medical Genetics, Ghent University Hospital, 

Ghent 3 . * robrecht.cannoodt@ugent.be 

With the advent of single cell RNA sequencing, it is now possible to analyse the transcriptomes of hundreds of individual 

cells in an unbiased manner. Reconstructing the developmental chronology of differentiating cells is a challenging task, 

and doing so in a unsupervised and robust manner is a hitherto untackled problem. We developed a truly unsupervised 

developmental chronology inference technique, and evaluated its performance and robustness using multiple datasets. 

INTRODUCTION 

Early attempts at inferring the chronologies of single cells 

are MONOCLE (Trapnell et al., 2014) and NBOR 

(Schlitzer et al., 2015). However, these techniques are not 

unsupervised as they require knowledge of the cell type of 

each cell prior to analysis, which biases the results to prior 

knowledge and possibly obstructs the discovery of novel 

subpopulations. 

METHODS 

Our approach consists of four steps. 

In the first step, the feature space (~30000 genes) is 

reduced to three dimensions. 

Secondly, outliers are detected and removed, using a K- 

nearest neighbour approach. After outlier removal, the 

original feature space is again reduced to three dimensions. 

Next, a nonparametric nonlinear curve is iteratively fitted 

to the data. 

Finally, each cell is projected onto the curve, thus 

resulting in a cell chronology. 


A single-cell RNAseq dataset (Schlitzer et al., 2015) 

contains profilings of DC progenitor cells. These cells are 

expected to differentiate from MDP to CDP to PreDC. Our 

method is able to intuitively visualise known population 

groups (Figure 1), as well as infer the developmental 

chronology of the individual cells (Figure 2). 

We evaluated our method on four datasets (Shalek et al., 

2014; Trapnell et al., 2014; Buettner et al., 2015 and 

Schlitzer et al., 2015), and found it to perform better and 

more robustly than existing methods MONOCLE and 

NBOR. 

This approach opens opportunities to further study known 

mechanisms or investigate unknown key regulatory 

structures in cell differentiation, or detect novel 

subpopulations in a truly unsupervised manner. 

REFERENCES 

Buettner F et al. Nature Biotechnology 33, 155-160 (2015). 

Schlitzer A et al. Nature Immunology 16, 718-726 (2015). 

Shalek A et al. Nature 509, 363-369 (2014). 

Trapnell C et al. Nature Biotechnology 32, 381-386 (2014). 

FIGURE 1. After feature space reduction and outlier detection of 244 DC 

progenitor cells (Schlitzer et al., 2015), our method can intuitively 

visualise known populations. 

FIGURE 2. An iterative curve fitting results in a smooth curve reflecting 

the developmental chronology. After projecting each cell to the curve, 

regulatory patterns in expression which correlate with this timeline can 

be investigated. 

25





O6. COMBINING TREE-BASED AND DYNAMICAL SYSTEMS 

FOR THE INFERENCE OF GENE REGULATORY NETWORKS 

Vân Anh Huynh-Thu 1* & Guido Sanguinetti 2,3 . 

GIGA-R & Department of Electrical Engineering and Computer Science, University of Liège 1 ; School of Informatics, 

University of Edinburgh 2 ; SynthSys – Systems and Synthetic Biology, University of Edinburgh 3 . * vahuynh@ulg.ac.be 

INTRODUCTION 

Reconstructing the topology of gene regulatory networks 

(GRNs) from time series of gene expression data remains 

an important open problem in computational systems 

biology. Current approaches can be broadly divided into 

model-based and model-free approaches, and face one of 

two limitations: model-free methods are scalable but 

suffer from a lack of interpretability, and cannot in general 

be used for out of sample predictions. On the other hand, 

model-based methods focus on identifying a dynamical 

model of the system; these are clearly interpretable and 

can be used for predictions, however they rely on strong 

assumptions and are typically very demanding 

computationally. Here, we aim to bridge the gap between 

model-based and model-free methods by proposing a 

hybrid approach to the GRN inference problem, called 

Jump3 (Huynh-Thu & Sanguinetti, 2015). Our approach 

combines formal dynamical modelling with the efficiency 

of a nonparametric, tree-based method, allowing the 

reconstruction of GRNs of hundreds of genes. 

METHODS 

Gene expression model. At the heart of the Jump3 

framework, we use the on/off model of gene expression 

(Ptashne & Gann, 2002), where the rate of transcription of 

a gene can vary between two levels depending on the 

activity state μ of the promoter of the gene. The expression 

x of a gene is modelled through the following stochastic 

differential equation: 

dx i = (A i μ i (t) + b i – λ i x i )dt + σdω(t), 

where subscript i refers to the i-th target gene. Here, the 

promoter state μ i (t) is a binary variable (the promoter is 

either active or inactive) that depends on the expression 

levels of the transcription factors (TFs) that bind to the 

promoter. A i , b i and λ i are kinetic parameters, and the term 

σdω(t) represents a white noise-driving process with 

variance σ 2 . 

Network reconstruction with jump trees. Recovering 

the regulatory links pointing to gene i amounts to finding 

the genes whose expression is predictive of the promoter 

state μ i . To achieve this goal, we propose a procedure that 

learns, for each target gene i, an ensemble of decision trees 

predicting the promoter state μ i at any time t from the 

expression levels of the candidate regulators at the same 

time t. However, standard tree-based methods cannot be 

applied here since the output μ i (t) is a latent variable. We 

therefore propose a new decision tree algorithm called 

“jump tree”, which splits the observations by maximising 

the marginal likelihood of the dynamical on/off model. 

The learned tree-based model is then used to derive an 

importance score for each candidate regulator, computed 

as the sum of the likelihood gains that are obtained at all 

the tree nodes where this regulator was selected to split the 

observations. The importance of a candidate regulator j is 

used as weight for the putative regulatory link of the 

network that is directed from gene j to gene i. 


We evaluated Jump3 on the networks of the DREAM4 In 

Silico Network challenge (Prill et al., 2010). For each 

network topology, two types of simulated expression data 

were used: data simulated using the on/off model (toy 

data) and the time series data that was provided in the 

context of the DREAM4 challenge. We compared Jump3 

to other GRN inference methods: two model-free methods, 

which are time-lagged variants of GENIE3 (Huynh-Thu et 

al., 2010) and CLR (Faith et al., 2007) respectively; two 

model-based methods, namely Inferelator (Greenfield et 

al., 2010) and TSNI (Bansal et al., 2006), and G1DBN 

(Lèbre, 2009), a method based on dynamic Bayesian 

networks. Areas Under the Precision-Recall curves 

(AUPRs) obtained for size-100 networks are shown in 

Table 1. Jump3 yields the highest AUPR in the case of the 

toy data. As expected, its performance decreases when the 

networks are inferred from the DREAM4 data, due to the 

mismatch between the on/off model and the one used to 

simulate the data. However, Jump3 still outperforms the 

other methods. 

Toy 

DREAM4 

Jump3 0.272 ± 0.060 0.187 ± 0.058 

GENIE3-lag 0.114 ± 0.010 0.176 ± 0.056 

CLR-lag 0.088 ± 0.008 0.169 ± 0.047 

Inferelator 0.069 ± 0.006 0.144 ± 0.036 

TSNI 0.020 ± 0.003 0.042 ± 0.010 

G1DBN 0.104 ± 0.024 0.114 ± 0.043 

TABLE 1. Comparison of network inference methods (mean AUPR and 

standard deviation). 

We also applied Jump3 to gene expression data from 

murine bone marrow-derived macrophages treated with 

interferon gamma (Blanc et al., 2011). Several of the hub 

TFs in the predicted network have biologically relevant 

annotations. They include interferon genes, one gene 

associated with cytomegalovirus infection, and cancerassociated 

genes, showing the potential of Jump3 for 

biologically meaningful hypothesis generation. 

REFERENCES 

Bansal M et al. Bioinformatics 22, 815-822 (2006). 

Blanc M et al. PLoS Biol 9, e1000598 (2011). 

Faith JJ et al. PLoS Biol 5, e8 (2007). 

Greenfield A. PLoS ONE 5, e13397 (2010). 

Huynh-Thu VA & Sanguinetti G. Bioinformatics 31, 1614-1622 (2015). 

Huynh-Thu VA et al. PLoS ONE 5, e12776 (2010). 

Lèbre S. Stat Appl Genet Mol Biol 8, Article 9 (2009). 

Prill RJ et al. PLoS ONE 5, e9202 (2010). 

Ptashne M & Gann A. Genes and Signals. Cold Harbor Spring 

Laboratory Press (2002). 

26





O7. MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT 

STIMULATION AND GSK3 INHIBITION 

Annika Jacobsen 1 , Nika Heijmans 2 , Reneé van Amerongen 2 , Folkert Verkaar 3 , 

Martine J. Smit 3 , Jaap Heringa 1 & K. Anton Feenstra 1 *. 

1 Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands; 2 Van Leeuwenhoek Centre 

for Advanced Microscopy and Section of Molecular Cytology, Swammerdam Institute for Life Sciences, University of 

Amsterdam, The Netherlands; 3 Division of Medicinal Chemistry, VU University Amsterdam, The Netherlands. 

*k.a.feenstra@vu.nl 

The Wnt/β-catenin signaling pathway is crucial for stem cell self-renewal, proliferation and differentiation. Hyperactive 

Wnt/β-catenin signaling caused by genetic alterations plays an important role in oncogenesis. In our newly developed 

Petri net model, GSK3 inhibition leads to significantly higher pathway activation (high β-catenin levels) compared to 

WNT stimulation, which is confirmed by TCF/LEF luciferase reporter assays experimentally. Using this validated model 

we can now simulate changes in Wnt/β-catenin signaling resulting from different mutations found in breast and 

colorectal cancer. We propose that this model can be used further to investigate different players affecting Wnt/β-catenin 

signaling during oncogenic transformation and the effect of drug treatment. 

WNT/Β-CATENIN 

Wnt/β-catenin signaling is important for stem cell 

maintenance and developmental processes and is highly 

conserved in all multicellular organisms (1, 2). The 

pathway regulates the expression of specific target genes 

by changing the levels of the transcriptional co-activator, 

β-catenin which activates the TCF/LEF transcription 

factors. Wnt/β-catenin signaling is active in stem cells 

located in Wnt rich environments. 

APC and AXIN are key proteins of the destruction 

complex, which targets β-catenin for destruction. 

Mutations in APC, AXIN and β-catenin play important 

roles in oncogenesis (2, 3). To better understand its role in 

oncogenesis, we here create a Petri net (PN) model of the 

Wnt/β-catenin signaling pathway, that uses available 

coarse-grained data, such as binary interactions and semiquantitative 

protein levels. Using this model and 

validating experiments we show how different strengths of 

Wnt stimulation and GSK3 inhibition activate signaling 

over time. 

PETRI-NET MODELLING 

We built a PN model of Wnt/β-catenin signaling describing 

the logic of known (inter)actions, cf. our previous 

work (5). In a PN, a place represents an entity (e.g. gene), 

a transition indicates the activity occurring between the 

places (e.g. gene expression), and these are connected by 

directed edges called arcs that represent their interactions 

(e.g., activation of gene expression by a protein). 

TRANSCRIPTION AND PROTEIN ASSAYS 

TCF/LEF transcription was measure by TOPFLASH 

reporter activity at several time points and at different 

concentrations of Wnt3a stimulation and GSK3 inhibition 

by CHIR99021. Active and total β-catenin (CTNNB1) 

levels were measured by Western blot. 

VALIDATED ACTIVATION & INHIBITION 

We simulate the model with initial Wnt and GSK3 token 

levels ranging from 0 to 5 to represent addition of Wnt and 

inhibition of GSK3. Figure 1 shows the four different β- 

catenin responses for Wnt addition (purple) and GSK3 

inhibition (green). At low GSK3 levels, β-catenin linearly 

increases, but at high GSK levels β-catenin remains low. 

At high Wnt levels, β-catenin shows a transient response, 

with the peak height increasing with Wnt levels. The 

increase of β-catenin is due to sequestration of AXIN to 

the cell membrane, which inactivates the destruction 

complex. Increase in β-catenin activates transcription of 

AXIN2 which triggers the negative feedback. 

FIGURE 1. Pathway response for different levels of Wnt and activity of 

GSK3. When adding Wnt, the pathway transiently activates but GSK3 

inhibition permanently activates. 

TCF/LEF reporter assay validation experiments for both 

perturbations show that transcriptional activity of 

TCF/LEF is both dosage and time dependent, 

corresponding well for GKS3 inhibition. Wnt3a stimulation, 

on the other hand, does activate expression, but we 

do not observe the β-catenin dosage or time effect 

predicted by our model. Measuring β-catenin by Western 

blot reveals a consistent increase upon pathway activation, 

however protein levels and changes are on the border of 

experimental sensitivity. 

In conclusion, our Petri net model recapitulates much of 

the known behavior of the Wnt/β-catenin pathway upon 

Wnt stimulation and GSK3 inhibition, and hints at 

subtleties in the mechanism that will help us gain further 

understanding in the role of this pathway in development 

and oncogenesis. 

REFERENCES 

1. Clevers & Nusse (2012) Cell. 149:1192-1205 

2. Holstein (2012) Cold Spring Harb Perspect Biol. 4:a007922 

3. MacDonald, Tamai & He (2009) Dev Cell. 17:9-26 

4. Klaus & Birchmeier (2008) Nat. Rev. Cancer. 8:387-398 

5. Bonzanni et al., (2009) Bioinformatics. 25:2049-2056 

27





O8. RANKED TILING BASED APPROACH TO DISCOVERING PATIENT 

SUBTYPES 

Thanh Le Van 1,* , Jimmy Van den Eynden 3 , Dries De Maeyer 2 , Ana Carolina Fierro 5 , Lieven Verbeke 5 , Matthijs van 

Leeuwen 4 , Siegfried Nijssen 1,4 , Luc De Raedt 1 & Kathleen Marchal 5,6 . 

Department of Computer Science 1 , Centre of Microbial and Plant Genetics 2 , KULeuven, Belgium; Department of 

Medical Biochemistry, University of Gothenburg 3 , Sweden; Leiden Institute for Advanced Computer Science 4 , 

Universiteit Leiden, The Netherlands; Department of Plant Biotechnology and Bioinformatics 5 , Department of 

Information Technology, iMinds 6 , Ghent University, Belgium. * thanh.levan@cs.kuleuven.be 

Cancer is a heterogeneous disease consisting of many subtypes that usually have both shared and distinguishing 

mechanisms. To derive good subtypes, it is essential to have a computational model that can score their homogeneity 

from different angles, for example, mutated pathways and gene expression. In this paper, we introduce our ongoing work 

which studies a constraint-based optimisation model to discover patient subtypes as well as their perturbed pathways 

from mutation, transcription and interaction data. We propose a way to solve the optimisation problem based on 

constraint programming principles. Experiments on a TCGA breast cancer dataset demonstrate the promise of the 

approach. 

INTRODUCTION 

Discovering patient subtypes and understanding their 

mechanisms are essential to provide precise treatments to 

patients. There have been efforts to understand how 

mutation causes subtypes such as the work by Hofree et 

al., (2013). However, to the best knowledge of the authors, 

it is still an open question on how to combine mutation 

and expression data to derive good subtypes. Therefore, 

we study a new computation model that can discover 

subtypes as well as their specific mutated genes and 

expressed genes from mutation, transcription and 

interaction data. 

METHODS 

We conjecture that a subtype consists of a number of 

patients who have the same set of differentially expressed 

genes and a set of mutated genes that hit the same 

pathways. 

To find both mutations and expressions of patient subtypes, 

we extend our recent ranked tiling method (Le Van et al., 

2014). Ranked tiling is a data mining method proposed to 

mine regions with high average rank values in a rank 

matrix. In this type of matrix, each row is a complete 

ranking of the columns. We find that rank matrices are a 

good abstraction for numeric data and are useful to 

integrate datasets that are at different scales. 

To apply the ranked tiling method, we first transform the 

given numeric expression matrix, where rows are 

expressed genes and columns are patients, into a ranked 

expression matrix. Then, we search for a region in the 

transformed matrix that has high average rank scores. 

However, different from the ranked tiling method, we 

impose a further constraint that the columns (patients) of 

the region should also have a number of mutated genes 

that have high rank scores in a network with respect to a 

network model. We formalise this as a constraint 

optimisation problem and use a constraint solver to solve 

it. 


We apply our method on TCGA breast cancer dataset and 

discover eight subtypes. Compared to PAM50 annotations, 

our method divide the Basal subtype into three sub-groups 

named S2, S3 and S6. The LumA subtype is divided into 

04 smaller groups, namely, S1, S4, S7 and S8. Finally, our 

method could recover the Her2 subtype in S5. 

To validate the mined subtypes in the patient dimension, 

we assume PAM50 annotations are true labels for them. 

Then, grouping patients into subtypes can be seen as a 

multi-class prediction problem, for which we can calculate 

F1 score to measure the average accuracy. We also 

compare our scores with state-of-the-art, including 

iCluster+ (Mo, Q. et al., 2013), NBS (Hofree et al., 2013) 

and SNF (Wang B. et al., 2014). The result (not shown) 

illustrates that our subtypes are more homogeneous than 

the ones produced by iCluster+ and NBS and are 

comparable to those by SNF. 

To validate the mined subtypes in the gene dimension, we 

perform geometric tests to see how their mutated genes 

and expressed genes are related to cancer pathways. The 

figure below is the heatmap showing the log_10 p-values 

of the tests. In this Figure, we can see that the discovered 

subtypes have specific perturbed pathways. 

FIGURE 1. Cancer pathway enrichment analysis using mined mutated 

genes and expressed genes of subtypes 

REFERENCES 

Hofree et al., Nat Methods 10(11), 1108–15 (2013). 

Le Van et al., ECML/PKDD 2014 (2), 98–113 (2014) 

Mo, Q. et al., PNAS 110(11), 4245–50 (2013) 

Wang, B. et al., Nature methods, 11(3), 333–7 (2014) 

28





O9. DEVELOPMENT OF A DNA METHYLATION-BASED SCORE 

REFLECTING TUMOUR INFILTRATING LYMPHOCYTES 

Martin Bizet 1,2,3*# , Jana Jeschke 1# , Christine Desmedt 4 , Emilie Calonne 1 , Sarah Dedeurwaerder 1 , 

Gianluca Bontempi 2,3 , Matthieu Defrance 1,2 , Christos Sotiriou 4 and Francois Fuks 1 

Laboratory of Cancer Epigenetics, Faculty of Medicine, Université Libre de Bruxelles 1 ; Interuniversity Institute of 

Bioinformatics in Brussels, Université Libre de Bruxelles & Vrije Universiteit Brussel 2 ; Machine Learning Group, 

Computer Science Department, Université Libre de Bruxelles, Brussels 3 ; Breast Cancer Translational Research 

Laboratory, Jules Bordet Institute, Université Libre de Bruxelles 4 ; # These authors contributed equally to this work; 

* mbizet@ulb.ac.be 

Tumour infiltrating lymphocytes (TIL) are increasingly recognised as one of the key feature to predict outcome and 

therapy response in malignancies. However, measuring quantities of TIL remains challenging since it relies on subjective 

and spatially-restricted measurements from a pathologist. In this study we used genome-scale DNA-methylation profiles 

from breast tumours to develop a so-called MeTIL score, which reflects TIL level within whole-tumour samples. We 

demonstrate the robustness to noise of the MeTIL score using simulated data as well as the ability of the MeTIL score to 

sensitively measure TIL in patient samples and to improve prediction of outcome. 

INTRODUCTION 

Breast cancer (BC) is one of the most common and 

deadliest diseases in women from Western countries. 

Tumour infiltrating lymphocytes (TIL) emerged as one of 

the key feature to predict outcome and response to 

treatment in this disease [ 1 ]. However the measurement of 

TIL levels remains challenging because it relies on manual 

readings of a tumour cancer slide by a pathologist, which 

is subjective by nature and does not necessary reflect the 

whole-tumour TIL content. In this study we took 

advantage of the high tissue-specificity of DNAmethylation 

patterns [ 2 ] to develop a so-called MeTIL 

score, which predicts the amount of lymphocytes within 

the tumour. 

METHODS 

The MeTIL score has been developed in 3 key-steps: 

We first used genome-scale DNA-methylation 

profiles data from 11 cell-lines (8 normal or 

cancerous epithelial breast and 3 T-lymphocytes) 

to extract 29 cytosines specifically unmethylated 

in T-lymphocytes (delta-beta < -0.8 and standard 

deviation between groups < 0.1). 

We then applied a cross-validated pipeline, 

associating mRMR feature selection and randomforest 

algorithm, on 118 BC samples to extract a 

minimal set of cytosines, which methylation level 

is predictive for quantities of TIL. 

Finally we used a “normalised PCA” approach to 

compute a unique MeTIL score from the 

individual methylation values. 

The robustness of the relation between the MeTIL score 

and TIL levels was also assessed using spearman 

correlation computed from 10 000 simulations with 

varying proportion of TIL (Fig.1B&C). The simulated 

data took two sources of noise into account: 

 

 

Technical noise modeled as a Gaussian noise 

Perturbations due to the presence of other celltypes 

within the tumour microenvironment that 

are not lymphocytic or epithelial, modeled by a 

methylation value sampled randomly among the 

array. 

Lastly, we measured TIL quantities with the MeTIL score 

in three independent BC cohorts and applied COX 

regression models to evaluate the prognostic value of the 

MeTIL score. 


We first applied a hierarchical clustering analysis and 

observed that BC samples with high TIL infiltration show 

a hypomethylated pattern for all MeTIL markers (Fig.1A). 

Furthermore we demonstrated, using simulations, a strong 

correlation between the MeTIL score and TIL levels, even 

when high level of noise (0.7 times the standard deviation) 

and high proportion of perturbing unknown cell-types 

(70%) were included in the model (Fig.1B). 

(A) 

(C) 

(B) 

FIGURE 1. The MeTIL score reflects TIL levels (A) Heatmap showing the 

methylation values of the 5 MeTIL markers. A ‘TIL high’ group with a 

hypomethylated pattern (orange) appeared. (B) Color-map of the 

spearman correlation between MeTIL score and TIL level for increasing 

noise (y-axis) and abundance of unknown cell-types (x-axis) based on 

simulations. (C) Methylation value of each MeTIL marker was simulated 

as the sum of the methylation level in lymphocyte (M1), epithelial cell 

(M2) and other cell-types (random value M3) weighted by their 

proportion in the tissue (f1, f2, f3) and an Gaussian noise (e). 

Finally, we observed consistent patterns of TIL levels 

within BC subtypes in independent cohorts suggesting the 

robust nature of our score to evaluate TIL levels. 

Furthermore, COX regressions analysis revealed a 

prognostic value for the MeTIL score in triple negative 

and luminal BC (p-value < 0.05). 

REFERENCES 

[ 1 ] Loi, S., et al. Official journal of the European Society for Medical Oncology / 

ESMO 25, 1544-1550 (2014). 

[ 2 ] Jeschke, J., Collignon, E., Fuks, F. FEBS J., 282, 9:1801-14. (2015). 

29





O10. PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES 

USING MACHINE LEARNING TECHNIQUES 

Aliaksei S Vasilevich 1 *,Shantanu Singh 2 , Aurélie Carlier 1 & Jan de Boer 1 . 

Laboratory for Cell Biology-inspired Tissue Engineering, Merln Institute, Maastricht University 1 , Imaging Platform, 

Broad Institute of MIT and Harvard 2 . *a.vasilevich@maastrichtuniversity.nl 

Topographical cues have been repeatedly shown to influence cell fate dramatically (Bettinger et. al., 2009). This 

phenomenon opens new opportunities to design the interaction between biomaterials and biological tissues in a 

predictable manner. Unfortunately, the exact mechanism of topographical control of cell behavior remains largely 

unknown. We have therefore developed a technology in our laboratory to determine an optimal surface topography for 

virtually any application in biomedical field. Previously we have reported that we can control cell shape by our surfaces 

in a predictable manner (Hulsman et.al., 2015). Here we demonstrate that we can successfully predict not only cell shape, 

but also cell response on protein level based on the properties of our topographies. The results of our study show that we 

are able to design materials for biomedical applications that require a particular cell behavior. 

INTRODUCTION 

The TopoChip, a micro topography screening platform, 

enables the assessment of cell response to 2176 unique 

topographies in a single high-throughput screen. The 

topographical features were randomly selected from an in 

silico library of more than 150 million of topographies, 

which were designed from algorithm that synthesized 

patterns based on simple geometric elements – circles, 

triangles and rectangles (Unadkat et al, 2011). In our 

previous studies, we have demonstrated that these surface 

topographies exert a mitogenic effect on hMSCs (Unadkat 

et al, 2011), as well as on cell shape (Hulsman et. al., 

2015). In this paper, we show that these topographies can 

also be used to modulate the ALP expression in human 

mesenchymal stromal cells, as well as pluripotency in 

human induced pluripotent stem (iPS) cells. We further 

show that computational models can be build to predict 

these protein levels using surface topography parameters. 

METHODS 

Cell response to topography was captured by high-content 

imaging. Using image analysis and data mining methods 

described previously (Hulsman et.al., 2015), 

multiparametric “profiles” of cellular response were 

obtained. Multiple replicates of each topography were 

used to estimate the median level of a cellular response of 

interest – either ALP in human mesenchymal stromal cells 

(hMSCs), or the median number of Oct4 positive cells in 

population of human induced pluripotent stem cell 

(hIPSCs). We aimed to predict the cellular response based 

on surface topography parameters using machine learning 

methods. To learn and validate these methods (specifically, 

classifiers), the data were split into training and testing 

sets in a 3:1 proportion respectively. In the training step, 

we performed a 10-fold cross-validation to obtain optimal 

parameters for each classifier. The caret package (Kuhn 

M., 2008) in R (R core team, 2015) was used to perform 

the analysis. 


In the first project, we conducted a screening on the 

TopoChip with hMSCs in order to find topographies that 

would be able to increase the ALP level, a protein that is 

an early marker of osteogenesis. We were able to 

successfully find such surfaces and confirm results 

experimentally (publication in preparation). To move 

further we decided to check how accurately we can make a 

prediction of ALP level in hMSCs based on topographical 

features. Focussing only on extreme examples, we 

selected 100 high- and and low-scoring topographies and 

used the model validation scheme described in Methods to 

find the most accurate binary classifier for our data set. 

We tested several classifiers and identified random forest 

as most precise, which obtained an accuracy of 96% on 

the held-out test set. 

In a second project, we aim to find a topography that will 

increase proliferation and pluripotency of hIPSCs. We 

used Oct4 as a marker of pluripotency. The screening was 

performed on one half of the Topochip (1000+ surfaces), 

which were then ranked based on the number of Oct4 

positive cells. One hundred high- and low-scoring surfaces 

were chosen to train a classifier. Using logistic regression , 

we obtained 72% accuracy on a held-out test set. We used 

this model to predict surfaces that would increase 

pluripotency in hIPSCs among surfaces that were not 

included in the initial screening. Topographies were 

ranked according to their predicted probability score and 

top 30 surfaces were chosen for experimental validation. 

We found that 79% of selected surfaces were predicted 

accurately. 

In summary, the combination of our screening methods 

and machine learning algorithms open new avenues to 

design surfaces with desired properties for variable 

applications. Our next step will be to find a surface with 

maximum ALP level from our virtual library based on our 

screening data. 

REFERENCES 

Bettinger C J, Langer R, & Borenstein J T. “Engineering Substrate 

Micro- and Nanotopography to Control Cell Function.” Angewandte 

Chemie (International ed. in English) 48.30 (2009). 

Hulsman M et. al., Analysis of high-throughput screening reveals the 

effect of surface topographies on cellular morphology, Acta 

Biomaterialia, 15, (2015). 

Kuhn M. “Building Predictive Models in R Using the caret Package” 

Journal of Statistical Software, Vol. 28, (2008) 

R Core Team. R: A language and environment for statistical computing. 

R Foundation for Statistical Computing, Vienna, Austria. URL 

http://www.R-project.org/. (2015) 

Unadkat H V. et al. “An Algorithm-Based Topographical Biomaterials 

Library to Instruct Cell Fate.” Proceedings of the National Academy 

of Sciences of the United States of America 108.40 (2011). 

30





O11. ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS 

Wout Bittremieux 1 , Pieter Meysman 1 , Lennart Martens 2 , Bart Goethals 1 , Dirk Valkenborg 3 & Kris Laukens 1 . 

Advanced Database Research and Modeling (ADReM) & Biomedical Informatics Research Center Antwerp (biomina), 

University of Antwerp / Antwerp University Hospital 1 ; Department of Biochemistry & Department of Medical Protein 

Research, Ghent University / VIB 2 ; Flemish Institute for Technological Research (VITO) 3 . 

* wout.bittremieux@uantwerpen.be 

Mass-spectrometry-based proteomics is a powerful analytical technique to identify complex protein samples, however, 

its results are still subject to a large variability. Lately several quality control metrics have been introduced to assess the 

performance of a mass spectrometry experiment. Unfortunately these metrics are generally not sufficiently thoroughly 

understood. For this reason, we present a few powerful techniques to analyse multiple experiments based on quality 

control metrics, identify low-performance experiments, and provide an interpretation of outlying experiments. 

INTRODUCTION 

Mass-spectrometry-based proteomics is a powerful 

analytical technique that can be used to identify complex 

protein samples. Despite many technological and 

computational advances, performing a mass spectrometry 

experiment is still a highly complicated task and its results 

are subject to a large variability. To understand and 

evaluate how technical variability affects the results of an 

experiment, lately several quality control (QC) and 

performance metrics have been introduced. Unfortunately, 

despite the availability of such QC metrics covering a 

wide range of qualitative information, a systematic 

approach to quality control is often still lacking. 

As most quality control tools are able to generate several 

dozens of metrics, any single experiment can be 

characterized by multiple QC metrics. Therefore it is 

often not clear which metrics are most interesting in 

general, or even which metrics are relevant in a specific 

situation. To take into account the multidimensional data 

space formed by the numerous metrics, we have applied 

advanced techniques to visualize, analyze, and interpret 

the QC metrics. 

METHODS 

Outlier detection can be used to detect deviating 

experiments with a low performance or a high level of 

(unexplained) variability. These outlying experiments can 

subsequently be analyzed to discover the source of the 

reduced performance and to enhance the quality of future 

experiments. 

However, it is insufficient to know that a specific 

experiment is an outlier; it is also of vital importance to 

know the reason. To understand why an experiment is an 

outlier, we have used the subspace of QC metrics in which 

the outlying experiment can be differentiated from the 

other experiments. This provides crucial information on 

how to interpret an outlier, which can be used by domain 

experts to increase interpretability and investigate the 

performance of the experiment. 


Figure 1 shows an example of interpreting a specific 

experiment that has been identified as an outlier. As can 

be seen, two QC metrics mainly contribute to this 

experiment being an outlier. The explanatory subspace 

formed by these QC metrics can be extracted, which can 

then be interpreted by domain experts, resulting in insights 

in relationships between various QC metrics. 

FIGURE 1. QC metrics importances for interpreting an outlying 

experiment. 

Next, by combining the explanatory subspaces for all 

individual outliers, it is possible to get a general view on 

which QC metrics are most relevant when detecting 

deviating experiments. When taking the various 

explanatory subspaces for all different outliers into 

account, a distinction between several of the outliers can 

be made in terms of the number of identified spectra 

(PSM’s). As can be seen in Figure 2, for some specific QC 

metrics (highlighted in italics) the outliers result in a 

notably lower number of PSM's compared to the nonoutlying 

experiments. 

Because monitoring a large number of QC metrics on a 

regular basis is often unpractical, it is more convenient to 

focus on a small number of user-friendly, well-understood, 

and discriminating metrics. As the QC metrics highlighted 

in Figure 2 are shown to indicate low-performance 

experiments, these metrics are prime candidates to monitor 

on a continuous basis to quickly detect faulty experiments. 

FIGURE 2. Comparison of the number of PSM’s between the non-outlying 

and the outlying experiments. 

31





O12. XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM 

Şule Yılmaz 1,2,3* , Masa Cernic 4 , Friedel Drepper 5 , Bettina Warscheid 5 , Lennart Martens 1,2,3 & Elien Vandermarliere 1,2,3 . 

Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent, Belgium 2 ; 

Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 ; Department of Biochemistry, Molecular and 

Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia 4 ; Functional Proteomics and Biochemistry, Department of 

Biochemistry and Functional Proteomics, Institute for Biology II and BIOSS Centre for Biological Signaling Studies, 

University of Freiburg, Freiburg, Germany 5 . *sule.yilmaz@ugent.be 

Chemical cross-linking coupled with mass spectrometry (XL-MS) facilitates the determination of protein structure and 

the understanding of protein interactions. The current computational approaches rely on different strategies with a limited 

number of open-source and easy-to-use search algorithms. We therefore built a novel cross-linked peptide identification 

algorithm, called Xilmass which has a novel database construction and a new scoring function adapted from traditional 

database search algorithms. We compared the performance of Xilmass against one of the most popular and publicly 

available algorithms: pLink, and a recently published algorithm Kojak. We found that Xilmass identified 140 spectra 

whereas Kojak and pLink identified 119 and 35, respectively. We mapped the cross-linking sites on the structure which 

resulted in the identification of 20 possible cross-linking sites. These findings show that Xilmass allows the identification 

of cross-linking sites. 

INTRODUCTION 

The structure of a protein is crucial for its functionality. 

Protein structure is commonly determined by X-ray 

crystallography or nuclear magnetic resonance (NMR). X- 

ray crystallography is only feasible for crystallizable 

proteins and NMR has a protein size limitation. Due to 

these restrictions, protein complexes are much more 

difficult to approach with these classical methods. 

However, chemical cross-linking of the complex coupled 

with mass spectrometry (XL-MS) allows to study of these 

protein complexes. The identification of the measured 

fragmentation spectra is a challenging task. One approach 

to identify cross-linked peptides is to linearize crosslinked 

peptide-pairs in order to generate a database to 

perform traditional search engines (Maiolica et al., 2007). 

However, a traditional search engine is not directly 

applicable to identify cross-linked peptides. Another 

approach is to rely on the usage of labeled cross-linkers, 

but this has a decreased performance when unlabeled 

cross-linkers are used. We therefore built an algorithm, 

Xilmass, which is designed for the identification of XL- 

MS fragmentation spectra without linearization of peptides 

and the requirement of labeled cross-linkers. We also 

introduced a new way of representation of a cross-linked 

peptide database and directly implemented a new scoring 

function. 

METHODS 

The data sets were derived from human calmodulin (CaM) 

and the actin binding domain of plectin (plectin-ABD) 

which were cross-linked by DSS. The data sets were 

analyzed on a Velos Orbitrap Elite. 

Cross-linked peptides were identified by Xilmass, pLink 

(Yang et al., 2012) and Kojak (Hoopmann et al., 2015). 

The identifications of both Xilmass and Kojak were 

validated by Percolator (Käll et al., 2007) at q-value=0.05. 

pLink returned a validated list at FDR=0.05. 

The findings on cross-linking sites were validated with the 

aid of the available structures (Plectin PDB-entry: 4Q57 

and calmodulin PDB-entry: 2F3Y). The cross-linking sites 

were predicted by X-Walk (Kahraman et al., 2011) and 

PyMOL was used for the visualization. 


We compared the number of identified spectra and crosslinking 

sites from Xilmass, pLink and Kojak. Xilmass 

identified 140 spectra whereas Kojak and pLink identified 

119 and 35 spectra, respectively (at FDR=0.05). Xilmass 

identified 53 cross-linking sites from the 140 spectra with 

37 obtained from at least 2 peptide-to-spectrum matches 

(PSMs). Kojak identified more cross-linking sites (60), 

however, only 26 cross-linking sites have at least 2 PSMs. 

The identified cross-linking sites by Xilmass were 

manually verified on the structure (Figure1). We defined 

20 cross-linking sites as possible (Cα-Cα distances within 

30Å (orange)) and not-predicted (Cα-Cα distances 

exceeding 30Å (blue)). These findings show that Xilmass 

allows the identification of cross-linking sites. 

FIGURE 1. The identified cross-linking sites were mapped on the plectin 

protein structure to manually verify them (PDB-entry:4Q57) 

REFERENCES 

Hoopmann ,M R et al. Journal of Proteome Research, 14, 2190–2198 

(2015) 

Kahraman,A. et al. Bioinformatics, 27, 2163–2164 (2011) 

Käll,L. et al. Nature Methods, 4, 923–925 (2007) 

Maiolica,A. et al. Molecular & cellular proteomics:MCP, 6, 2200–2211 

(2007) 

Yang,B. et al. Nature Methods, 9, 904–906 (2012) 

32





O13. AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES 

BETWEEN IMAGING MASS SPECTROMETRY EXPERIMENTS 

Nico Verbeeck 1* , Jeffrey Spraggins ,2 , Yousef El Aalamat 3,4 , Junhai Yang 2 , 

Richard M. Caprioli 2 , Bart De Moor 3,4 ,Etienne Waelkens 5,6 & Raf Van de Plas 1,2 . 

Delft Center for Systems and Control (DCSC), Delft University of Technology 1 ; Mass Spectrometry Research Center 

(MSRC),Vanderbilt University 2 ; STADIUS Center for Dynamical Systems, Signal Processing, and Data Analytics, Dept. 

of Electrical Engineering (ESAT), KU Leuven 3 ; iMinds Medical IT, KU Leuven 4 ; Dept. of Cellular and Molecular 

Medicine, KU Leuven 5 ; Sybioma, KU Leuven 6 . * n.verbeeck@tudelft.nl 

Imaging mass spectrometry (IMS) is a powerful molecular imaging technology that generates large amounts of data, 

making manual analysis often practically infeasible. In this work we aid the differential analysis of multiple IMS datasets 

by linking these data to an anatomical atlas. Using matrix factorization based multivariate analysis techniques, we are 

able to identify differential biomolecular signals between individual tissue samples in an obesity case study on mouse 

brain. The resulting differential signals are then automatically interpreted in terms of anatomical structures using a 

convex optimization approach and the Allen Mouse Brain Atlas. The automated anatomical interpretation facilitates 

much deeper exploration by the biomedical expert for these types of very rich data sets. 

INTRODUCTION 

Imaging Mass Spectrometry (IMS) is a relatively new 

molecular imaging technology that enables a user to 

monitor the spatial distributions of hundreds of 

biomolecules in a tissue slice simultaneously. This unique 

property makes IMS an immensely valuable technology in 

biomedical research. However, it also leads to very large 

amounts of data in a single analysis (e.g. >1 TB), making 

manual analysis of these data increasingly impractical. In 

order to aid the exploration of these data, we have recently 

developed a framework that integrates IMS data with an 

anatomical atlas. The framework uses the anatomical data 

in the atlas to automatically interpret the IMS data in terms 

of anatomical structures, and guides the user towards 

relevant findings within a single tissue section. In this 

work, we extend this framework towards the automated 

interpretation of biomolecular differences between 

multiple IMS datasets. 

METHODS 

We demonstrate our method on IMS data of multiple 

mouse brain sections, and use the Allen Mouse Brain 

Atlas as the curated anatomical data source that is linked 

to the MALDI-based IMS measurements. We spatially 

map the data of each individual IMS dataset to the 

anatomical atlas using both rigid and non-rigid registration 

techniques. This establishes a common reference space 

and allows for direct comparison of spatial locations 

between the different IMS datasets. Group Independent 

Component Analysis (GICA) is then used to automatically 

extract the differentially expressed biomolecular patterns, 

after which convex optimization is used to automatically 

interpret the differential components in terms of known 

anatomical structures (Verbeeck et al, 2014), directly 

listing the anatomical areas in which changes occur. 


We demonstrate our approach in an obesity case study on 

mouse brain. All tissue sections are cryosectioned at 10 

μm and thaw-mounted onto ITO coated glass slides after 

which they are sublimated with CMBT matrix. MALDI 

IMS images are collected using the Bruker 15T solariX 

FTICR MS with a spatial resolution of 50 μm, collecting 

approximately 35,000 pixels per experiment. 

The IMS data of the different experiments are registered to 

the anatomical reference space provided by the Allen 

Mouse Brain Atlas, establishing an inter-experiment 

study-wide reference space. Analysis of the IMS 

measurements using GICA reveals multiple biomolecular 

patterns that differentiate between the various dietary 

conditions examined by the study. The retrieved 

differentially expressed biomolecular patterns are then 

translated to combinations of anatomical structures using 

our convex optimization approach, similar to what a 

human investigator intends to do. This automated 

interpretation of inter-experiment differences can serve as 

a great accelerator in the exploration of IMS data, as it 

avoids the time-and resource-intensive step of having a 

histological expert manually interpret the differential 

patterns. 

FIGURE 1. Automated anatomical interpretation of a biomolecular 

pattern that is differentially expressed in coronal mouse brain sections 

between a high fat and a low fat diet in our obesity case study. 

REFERENCES 

Verbeeck, N. et al. Automated anatomical interpretation of ion 

distributions in tissue: linking imaging mass spectrometry to curated 

atlases. Anal. Chem. 86, 8974–8982 (2014). 

33





O14. ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA 

THROUGH REMOVAL OF SPARSE INTENSITY VARIATIONS 

Yousef El Aalamat 1,2* , Xian Mao 1,2 , Nico Verbeeck 3 , Junhai Yang 4 , Bart De Moor 1,2 , 

Richard M. Caprioli 4 , Etienne Waelkens 5,6 & Raf Van de Plas 3,4 . 

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data 

Analytics, KU Leuven 1 ; iMinds Medical IT, KU Leuven 2 ; Delft Center for Systems and Control, Delft University of 

Technology 3 ; Mass Spectrometry Research Center (MSRC),Vanderbilt University 4 ; Department of Cellular and 

Molecular Medicine, KU Leuven 5 ; Sybioma, KU Leuven 6 . *yelaalam@esat.kuleuven.be 

Imaging mass spectrometry (IMS) is rapidly evolving as a label-free, spatially resolved molecular imaging tool for the 

direct analysis of biological samples. However, mass spectrometry (MS) measurements are subject to different types of 

noise. In IMS, one of the most abundant noise types in ion images is the presence of localized intensity spikes, known 

also as sparse intensity variations, which occur on top of the biological ion distribution pattern. In this study, we develop 

a method that addresses the issue of sparse intensity noise. We use low-rank approximations of the IMS data to separate 

and filter sparse intensity variations from the MS signals. The efficiency of the developed method is tested using MS 

measurements of coronal sections of mouse brain and strong de-noising performance is demonstrated both along the 

spatial and the spectral domain. 

INTRODUCTION 

Imaging mass spectrometry (IMS) provides unique 

capabilities for biomedical and biological research. 

However, its measurements tend to be subject to different 

types of noise. One of the more abundant noise types in 

IMS are localized intensity spikes, which can be seen as 

sparse intensity variations on top of the true biological ion 

patterns. This kind of noise can have a substantial impact, 

particularly on low ion intensity measurements where the 

signal-to-noise ratio (SNR) can be significantly affected. 

We present a method to filter sparse intensity variations 

from IMS data, and demonstrate its use to de-noise IMS 

measurements both along the spatial and the spectral 

domain. 

METHODS 

We introduce a de-noising algorithm based on low-rank 

approximation, a concept from linear algebra. The method 

can separate sparse intensity variations from biological 

and tissue sample patterns, which hold up across multiple 

ions and pixels. This approach decomposes IMS data into 

two parts, namely a structured data matrix and a sparse 

data matrix. Since the noise tends to be sparse in nature, it 

will have a propensity to be collected into the sparse data 

part. The structured part tends to capture the de-noised 

IMS signals, effectively de-noising the ion images and the 

spectral profiles in the process. This de-noising method 

allows us to automatically filter sparse intensity variations 

from the underlying tissue signal without requiring any 

parameter tuning. 


The filter method is demonstrated on two IMS 

experiments (one lipid-focused and one protein-focused) 

acquired from coronal sections of mouse brain. For the 

protein experiment, the tissue section was coated with 

sinapinic acid, and measurements were acquired using a 

Bruker AutoFlex MALDI-TOF/TOF in positive linear 

mode at a spatial resolution of 100 μm and with a mass 

range extending from m/z 3000 to 22000. For the lipid 

experiment, the tissue section was sublimated with 1,5- 

diaminonaphthalene, and the measurements were acquired 

using a Bruker AutoFlex MALDI-TOF/TOF in negative 

reflectron mode at a spatial resolution of 80 μm and with a 

mass range extending from m/z 400 to 1000. The case 

studies demonstrate robust de-noising performance, 

retrieving the underlying tissue signal efficiently and 

consistently using the structured data matrix. On the 

spatial side, we observe a clean-up effect in the spatial 

distributions of both high- and low-intensity ions. The 

effect is especially impactful for low-intensity ions, 

showing a strong increase in the amount of spatial 

structure that can be retrieved from low SNR 

measurements and revealing patterns that would have 

gone unnoticed otherwise. On the spectral side, we 

observe an improved SNR after applying the method. 

Thus, at the cost of computational analysis, the de-noising 

method described here provides a means of increasing the 

amount of information that can be extracted from an IMS 

experiment, without requiring user interaction or 

additional measurement. 

FIGURE 1. Impact on both spatial and spectral domain. Top: example of 

de-noised ion image. Bottom: plot of a spectrum before (blue) and after 

(red) removal of sparse intensity variations. 

34





O15. DETERMINANTS OF COMMUNITY STRUCTURE 

IN THE PLANKTON INTERACTOME 

Gipsi Lima-Mendez 1,2* , Karoline Faust 1,2,3 , Nicolas Henry 4 , Johan Decelle 4 , Sébastien Colin 4 , Fabrizio Carcillo 2,3,5 , 

Simon Roux 6 , Gianluca Bontempi 5 , Matthew B. Sullivan 6 , Chris Bowler 7 , Eric Karsenti 7,8 , Colomban de Vargas 4 & 

Jeroen Raes 1,2 . 

Department of Microbiology and Immunology, Rega Institute KU Leuven 1 ; VIB Center for the Biology of Disease 2 ; 

Laboratory of Microbiology, Vrije Universiteit Brussel, Belgium 3 ; CNRS, UMR 7144, Station Biologique de Roscoff 4 ; 

Interuniversity Institute of Bioinformatics in Brussels (IB) 2 , Machine Learning Group, Université Libre de Bruxelles 5 ; 

Department of Ecology and Evolutionary Biology, University of Arizona, USA 6 ; Ecole Normale Supérieure, Institut de 

Biologie (IBENS), France 7 ; European Molecular Biology Laboratory 8 .*Gipsi.limamendez@vib-kuleuven.be 

Identifying the abiotic and biotic factors that shape species interactions are fundamental yet unsolved goals in ecology. 

Here, we integrate organismal abundances and environmental measures from Tara Oceans to reconstruct the first global 

photic-zone co-occurrence network. Environmental factors are incomplete predictors of community structure. Putative 

biotic interactions are non-randomly distributed across phylogenetic groups, and show both local and global patterns. 

Known and novel interactions were identified among grazers, primary producers, viruses and symbionts. The high 

prevalence of parasitism suggests that parasites are important regulators in the ocean food web. Together, this effort 

provides a foundational resource for ocean food web research and integrating biological components into ocean models. 

INTRODUCTION 

Determining the relative importance of both biotic and 

abiotic processes represents a grand challenge in ecology. 

Here we analyze sequence on plankton organisms and 

environmental data from the Tara-Oceans project. We 

applied network inference methods to construct a globalocean 

cross-kingdom species interaction network and 

disentangled the biotic and abiotic signals shaping this 

interactome (Lima-Mendez, et al., 2015). 

METHODS 

Methods are described in details in (Lima-Mendez, et al., 

2015). Briefly: 

 

 

Network inference. Taxon-taxon networks were 

constructed as in (Faust, et al., 2012), selecting 

Spearman and Kullback-Leibler dissimilarity. 

Edges with merged multiple-test-corrected p- 

values below 0.05 were kept. Taxon-environment 

networks were computed with the same 

procedure and merged with taxon-taxon networks 

for environmental triplet detection. 

Indirect taxon edge detection. For each triplet 

consisting of two taxa and one environmental 

parameter, we computed the interaction 

information (II) and taxon edges were considered 

indirect when II





O16. BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON 

SEQUENCING DATA FOR BIODIVERSITY ANALYSIS 

Mohamed Mysara 1-3 , Yvan Saeys 4,5 , Natalie Leys 1 , Jeroen Raes 2,6 & Pieter Monsieurs 1* . 

Unit of Microbiology, Belgian Nuclear Research Centre SCK•CEN, Mol; Belgium 1; Department of Bioscience 

Engineering, Vrije Universiteit Brussel VUB, Brussels, Belgium 2 ; Department of Structural Biology, Vlaams Instituut 

voor Biotechnologie VIB, Brussels, Belgium 3 ; Data Mining and Modeling Group, VIB Inflammation Research Center, 

Ghent, Belgium 4 , Department of RespiratoryMedicine, Ghent University Hospital, Ghent, Belgium 5 , Department of 

Microbiology and Immunology, REGA institute, KU Leuven, Belgium 6 . * pmonsieu@sckcen.be 

High-throughput sequencing technologies have created a wide range of new applications, also in the field of microbial 

ecology. Yet when used in 16S rRNA biodiversity studies, it suffers from two important problems: the presence of PCR 

artefacts (called chimera) and sequencing errors resulting from the sequencing sequencing technologies. In this work 

three artificial intelligence-based algorithms are proposed, CATCh, NoDe and IPED, to handle these two problems. A 

benchmarking study was performed comparing CATCh/NoDe (for 454 pyrosequencing) or CATCh/IPED (for Illumina 

MiSeq sequencing) with other state-of-the art tools, showing a clear improvement in chimera detection and reduction of 

sequencing errors respectively, and in general leading to more accurate clustering of the sequencing reads in Operational 

Taxonomic Units (OTUs). All algorithms are available via http://science.sckcen.be/en/Institutes/EHS/MCB/MIC 

/Bioinformatics/. 

INTRODUCTION 

The revolution in new sequencing technologies has led to 

an explosion of possible applications, including new 

opportunities for microbial ecological studies via the 

usage of 16S rDNA amplicon sequencing. However, 

within such studies, all sequencing technologies suffer 

from the presence of erroneous sequences, i.e. (i) chimera, 

introduced by wrong target amplification in PCR, and (ii) 

sequencing errors originating from different factors during 

the sequencing process. As such, there is a need for 

effective algorithms to remove those erroneous sequences 

to be able to accurately assess the microbial diversity. 

METHODS 

First, a new algorithm called CATCh (Combining 

Algorithms to Track Chimeras) was developed by 

integrating the output of existing chimera detection tools 

into a new more powerful method. Second, NoDe (Noise 

Detector) was introduced, an algorithm that identifies and 

corrects erroneous positions in 454-pyrosequencing reads. 

Third, IPED (Illumina Paired End Denoiser) algorithm 

was developed to handle error correction in Illumina 

MiSeq sequencing data as the first tool in the field. After 

identifying those positions likely to contain an error, those 

sequencing reads are subsequently clustered with correct 

reads resulting in error-free consensus reads. The three 

algorithms were benchmarked with state-of-the-art tools. 


Via a comparative study with other chimera detection 

tools, CATCh was shown to outperform all other tools, 

thereby increasing the sensitivity with up to 14% (see 

Figure 1). 

FIGURE 1. Plot indicating the effect of applying 5% indels (shown on the 

left) and 5% mismatches (shown on the right), on the performance of 

different chimera detection tools. CATCh was found to outperform other 

existing tools. 

Similarly, NoDe and IPED were benchmarked against 

other denoising algorithms, thereby showing a significant 

improvement in reduction of the error rate up to 55% and 

75% respectively (see Figure 2). The combined effect of 

our algorithms for chimera removal and error correction 

also had a positive effect on the clustering of reads in 

operational taxonomic units (OTUs), with an almost 

perfect correlation between the number of OTUs and the 

number of species present in the mock communities. 

Indeed, when applying our improved pipeline containing 

CATCh and NoDe on a 454 pyrosequencing mock dataset, 

our pipeline could reduce the number of OTUs to 28 (i.e. 

close 18, the correct number of species). In contrast, 

running the straightforward pipeline without our 

algorithms included would inflate the number of OTUs to 

98. Similarly, when tested on Illumina MiSeq sequencing 

data obtained for a mock community, using a pipeline 

integrating CATCh and IPED, the number of OTUs 

returned was 33 (i.e. close to the real number of 21 

species), while 86 OTUs was obtained using the default 

mothur pipeline. 

REFERENCES 

Mysara M., Leys N., Raes J., Monsieurs P.- NoDe: a fast error-correction 

algorithm for pyrosequencing amplicon reads.- In: BMC 

Bioinformatics, 16:88(2015), p. 1-15.- ISSN 1471-2105 

Mysara M., Saeys Y., Leys N., Raes J., Monsieurs P.- CATCh, an 

Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing 

Studies.- In: Applied and Environmental Microbiology, 81:5(2015), 

p. 1573-1584.- ISSN 0099-2240 

36





O17. GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND 

CELL TYPES INVOLVED IN MIGRAINE PATHOPHYSIOLOGY: A GWAS- 

BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS 

Sjoerd M.H. Huisman 1,2* , Else Eising 3 , Ahmed Mahfouz 1,2 , Lisanne Vijfhuizen 3 , International Headache Genetics 

Consortium, Boudewijn P.F. Lelieveldt 2 , Arn M.J.M. van den Maagdenberg 3,4 & Marcel J.T. Reinders 1 . 

DBL, Dept. of Intelligent Systems, Delft University of Technology, The Netherlands 1 ; LKEB, Dept. of Radiology, Leiden 

University Medical Center, The Netherlands 2 ; Dept. of Human Genetics, Leiden University Medical Center, The 

Netherlands 3 ; Dept. of Neurology, Leiden University Medical Center, The Netherlands 4 . * s.m.h.huisman@tudelft.nl 

Migraine is a common brain disorder, with a heritability of around 50%. To understand the genetic component of this 

disease, a large genome wide association study has been carried out. Several loci were identified, but their interpretation 

remained challenging. We integrated the GWAS results with gene expression data, from healthy human brains, to 

identify anatomical regions and biological pathways implicated in migraine pathophysiology. 

INTRODUCTION 

Genome Wide Association Studies (GWAS) are 

frequently used to find common variants with small effect 

sizes. However, they often provide researchers with short 

lists of single nucleotide polymorphisms (SNPs) with 

uncertain connections to biological functions. 

We present an analysis of GWAS data for migraine, where 

the full list of SNP statistics is used to find groups of 

functionally related migraine-associated genes. For this 

end we make use of gene co-expression in the healthy 

human brain. 

We performed genome wide clustering of genes, followed 

by enrichment analysis for migraine candidate genes. In 

addition, we constructed local co-expression networks 

around high-confidence genes. Both approaches converge 

on distinct biological functions and brain regions of 

interest. 

METHODS 

Migraine GWAS data was obtained from the International 

Headache Genetics Consortium, with 23,285 cases and 

95,425 controls (Anttila et al., 2013). Genes were scored 

by SNP load and divided into high-confidence genes, 

migraine candidate genes, and non-migraine genes. 

Spatial gene expression data in the healthy adult human 

brain was obtained from the Allen Brain Institute 

(Hawrylycz et al., 2012). It contains microarray 

expression values of 3702 samples from 6 donors. Robust 

gene co-expressions were used to cluster genes into 18 

modules, which were then tested for enrichment of 

migraine candidate genes, and functionally characterized. 

In a second approach, local co-expression networks were 

built around the high-confidence migraine genes. These 

local networks were then compared to the modules of the 

first approach. 


The genome wide analysis revealed several modules of 

genes enriched in migraine candidates. Two modules have 

preferential expression in the cerebral cortex and are 

enriched in synapse related annotations and neuron 

specific genes. A third module contains oligodendrocytes 

and genes preferentially expressed in subcortical regions. 

The local co-expression networks, of the second approach, 

converge on the same pathways and expression patterns, 

even though the high confidence genes lie mostly outside 

of the modules of interest. This provides a control to the 

results of the first approach. 

FIGURE 1. The co-expression network around high confidence migraine 

genes of the second approach. Genes (and links between them) of the 

migraine modules of the first approach are coloured in red, yellow, blue, 

and green. 

The analyses confirm the previously observed link 

between migraine and cortical neurotransmission. They 

also point to the involvement of subcortical myelination, 

which is in line with recent tentative findings. These 

results show that more relevant information can be 

extracted from GWAS results, using (publicly available) 

tissue specific expression patterns. 

REFERENCES 

Anttila V. et al. Genome-wide meta-analysis identifies new susceptibility 

loci for migraine. Nat. Genet. 45, 912–7, (2013). 

Hawrylycz M.J. et al. An anatomically comprehensive atlas of the adult 

human brain transcriptome. Nature 489, 391–9, (2012). 

37





O18. SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN 

THE MOUSE BRAIN IDENTIFIES REGION-SPECIFIC REGULATION 

MECHANISMS 

Ahmed Mahfouz 1,2* , Boudewijn P.F. Lelieveldt 1,2 , Aldo Grefhorst 3 , Isabel M. Mol 4 , Hetty C.M. Sips 4 , José K. van den 

Heuvel 4 , Jenny A. Visser 3 , Marcel J.T. Reinders 2 , & Onno C. Meijer 4 . 

Department of Radiology, Leiden University Medical Center 1 ; Delft Bioinformatics Lab, Delft University of 

Technology 2 ; Department of Internal Medicine, Erasmus University Medical Center 3 ; Department of Internal Medicine, 

Leiden University Medical Center 4 . * a.mahfouz@lumc.nl 

Steroid hormones coordinate the activity of many brain regions by binding to nuclear receptors that act as transcription 

factors. This study uses genome wide correlation of gene expression in the mouse brain to discover 1) brain regions that 

respond in a similar manner to particular steroids, 2) signaling pathways that are used in a steroid receptor and brain 

region-specific manner, and 3) potential target genes and relationships between groups of target genes. The data 

constitute a rich repository for the research community to support new insights in neuroendocrine relationships, and to 

develop novel ways to manipulate brain activity in research of clinical settings. 

INTRODUCTION 

Steroid receptors are pleiotropic transcription factors that 

coordinate adaptation to different physiological states. An 

important target organ is the brain, but its complexity 

hampers the understanding of their modulation. 

METHODS 

We used the Allen Brain Atlas (ABA) (Lein et al., 2007), 

the most comprehensive repository of in situ 

hybridization-based gene expression in the adult mouse 

brain, to identify genes that have three dimensional (3D) 

spatial gene expression profiles similar to steroid receptors. 

To validate the functional relevance of this approach, we 

analyzed the co-expression relationship of the 

glucocorticoid receptor (Gr) and estrogen receptor alpha 

(Esr1) and their known transcriptional targets in their 

brain regions of action. Next, we studied the regionspecific 

co-expression of nuclear receptors and their coregulators 

to identify potential partners mediating the 

hormonal effects on dopaminergic transmission. Finally, 

to illustrate the potential of using spatial co-expression to 

predict region-specific steroid receptor targets in the brain, 

we identified and validated gene which responded to 

changes in estrogen in the arcuate nucleus and medial 

preoptic area of the mouse hypothalamus. 


For each steroid receptor, we ranked genes based on their 

spatial co-expression across the whole brain as well as in 

each of the aforementioned 12 brain structures separately. 

For each steroid receptor, strongly co-expressed genes 

within a brain region are likely related to the localized 

functional role of the receptor. For example, out of the top 

10 genes co-expressed with Esr1 across the whole brain, 4 

were previously shown to be regulated by Esr1 and/or 

estrogens in various tissues (Gpr101, Calcr, Ngb, and 

Gpx3) 

We assessed the extent of co-expression of glucocorticoid 

(GC)-responsive genes (Datson et al., 2012) with Gr in the 

whole brain, the hippocampus and its substructures the 

dentate gyrus (DG) and the different subregions of the 

cornu ammonis (CA). GC-responsive genes were 

significantly co-expressed with Gr in the DG, but 

interestingly also in the whole brain and in the CA3 region 

(FDR-corrected p < 1.8×10 -3 ; Mann-Whitney U-Test). 

Similarly, A Mann-Whitney U-test showed that a set of 15 

genes that are sensitive to gonadal steroids (Xu et al., 

2012) is significantly correlated to Esr1 across the whole 

brain (FDR-corrected p = 8.69 ×10 -14 ), as well as in the 

hypothalamus (p = 3.85×10 -10 ) , the brain region 

responsible for the sexual behavior in animals. 

In order to identify putative region-dependent coregulators 

of steroid receptors, we analyzed the coexpression 

relationships of the each steroid receptor and a 

set of 62 nuclear receptor co-regulators as present on a 

peptide array (Nwachukwu et al., 2014). We focused our 

analysis on well-established target regions of steroid 

hormone action, dopaminergic brain regions (ventral 

tegmental area; VTA & substantia nigra; SN). We found 

three significantly co-expressed co-regulators with 

androgen receptor (Ar): Pnrc2, Pak6 and Trerf1, 

suggesting that these receptors may be involved in 

mediating Ar effects on dopaminergic transmission. 

In order to validate the predictive value of high correlated 

expression with a steroid receptor, we analyzed the 

response of top 10 genes that are strongly co-expressed 

with Esr1 in the hypothalamus to the estrogen 

diethylstilbesterol (DES) in castrated male mice using 

qPCR. We performed quantitative double in situ 

hybridization (dISH) for Esr1 and the six mRNAs (Irs4, 

Magel2, Adck4, Unc5, Ngb, and Gdpd2) that showed more 

than 1.3 fold enrichment in qPCR. We found Irs4 and 

Magel2 mRNA were both significantly upregulated by 

DES treatment (1.9 and 2.4-fold, respectively). 

REFERENCES 

Lein E. et al. Nature 445, 168–76 (2007). 

Datson N. et al. Hippocampus 22, 359–71 (2012). 

Xu X. et al., Cell 3, 596–607 (2012). 

Nwachukwu J. et al. eLife 3, e02057 (2014). 

38





O19. A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI 

Bart Cuypers 1,2,3* , Pieter Meysman 1,2 , Manu Vanaerschot 3 , Maya Berg 3 , Malgorzata Domagalska 3 , Jean-Claude 

Dujardin 3,4# & Kris Laukens 1,2# . 

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center 

Antwerpen (biomina) 2 ; Molecular Parasitology Unit, Department of Biomedical Sciences, Institute of Tropical Medicine, 

Antwerp 3 ; 4 Department of Biomedical Sciences, University of Antwerp 4 . * bart.cuypers@uantwerpen.be # shared senior 

authors 

Leishmania donovani is the cause of visceral leishmaniasis in the Indian subcontinent and poses a threat to public health 

due to increasing drug resistance. Only little is known about its very peculiar molecular biology and there has been little 

‘omics integration effort so far. Here we present an integratory database or ‘omics compendium that contains all 

genomics, transcriptomics proteomics and metabolomics experiments that are currently publically available for 

Leishmania donovani. Additionally the user interface contains analysis tools for new datasets that uses smart data mining 

strategies like frequent itemset mining to link results from different ‘omics layers. 

INTRODUCTION 

The protozoan parasite Leishmania donovani causes 

visceral leishmaniasis (VL), a life threatening disease 

which affects 500 000 people each year. With only four 

drugs available and rapidly emerging drug resistance, 

knowledge about the parasite’s resistance mechanisms is 

essential to boost the development of new drugs. However, 

only little is known about the gene regulation of 

Leishmania and the few findings indicate major 

differences to known gene expression systems. Indeed, no 

polymerase II promotors have ever been found in 

Leishmania 1 . Genes are constitutively transcribed in large 

polycistronic units and subsequently spliced into 

individual mRNAs (trans-splicing) 1 . A modified thymine, 

Base J, marks the end of transcription units and functions 

as a stop signal for the RNA polymerase 2 . Gene 

expression is then assumed to be regulated at the posttranscriptional 

level (mRNA stability, translation 

efficiency, epigenetic factors, etc…) but evidence to 

support this is scarce 1 . Integration of different ‘omics 

could shed light on these gene regulatory mechanisms, but 

there has been little integration effort so far. 

METHODS 

We developed an easy to use tool, able to import and 

connect all existing L. donovani –omics experiments. 

Genomics, epigenomics, transcriptomics, proteomics, 

metabolomics and phenotypic data was collected and 

added to a MySQL database compendium, further 

complemented with publicly available data. Relations 

between different ‘omics layers were explicitly defined 

and provided with a level of confidence. Python scripts 

were developed to preprocess, analyse and import the data. 

To allow comparability between different experiments, 

platforms and labs the three integration principles of the 

COLOMBOS bacterial expression compendium were 

adapted 3 . 1) Use the same data-analysis pipeline for all 

data. 2) Work with contrasts to a control condition instead 

of expression values. 3) Annotate these contrasts in a 

unified and structured manner. 

Next to this vast data source a set of integrative dataanalysis 

tools was developed based on data mining 

strategies. For example: One tool uses frequent itemset 

mining algorithms to detect which proteins and 

metabolites frequently exhibit the same behaviour under 

different conditions. Another tool converts several –omics 

layers to a network format that can be opened in 

Cytoscape and can thus be the basis for network analysis. 

The Django and Twitter Bootstrap frameworks were used 

to create a web portal to make the tools accessible to any 

Leishmania researcher. 


Excellent public gene, protein, metabolite annotation 

databases for Leishmania and related species are already 

available (e.g. TriTrypDB and GeneDB). However, the 

strength of our tool is that it links these annotation data to 

‘omics experiments that are either provided by the user, or 

that are publically available. New experiments can quickly 

be preprocessed, analysed and integrated in the database 

via its python back end. The compendium is therefore not 

only a look-up tool (e.g. under which conditions is this 

gene or metabolite upregulated?), but has tools available 

to also analyse the user-provided data with intelligent data 

mining tools (e.g. which metabolites/genes are typically 

upregulated in drug-resistant strains?). These new 

experiments provide additional confidence and 

information about the biological entities in the database. 

Unlike many other databases, the compendium has an 

elaborate quality control system. Every result provided by 

the tools can be traced back to the experimental data, 

which contains the necessary quality control plots to 

support the experiment’s validity. Additionally, it contains 

all relevant information about the extractions and the 

origin of the biological material. 

Using the compendium and its tools, we characterized the 

development and drug-resistance in a system biology 

context of Leishmania donovani. The genomes of more 

than 200 strains were examined for associations with 

phenotypical features and a subset was linked to 

transcriptomics, proteomics and metabolomics results. The 

compendium and its scripts were designed to be generic 

and can therefore be used for other organisms with only 

minor changes. 

REFERENCES 

1. Donelson, J. (1999) PNAS. 96, 2579–258. 

2. Van Luenen, H. G. a M. et al. (2012) Cell. 150, 909–21. 

3. Meysman. et al. (2014) Nucleic acids research. 42, D649- 

D653. 

39





O20. MULTI-OMICS INTEGRATION: RIBOSOME PROFILING 

APPLICATIONS 

Volodimir Olexiouk 1 , Elvis Ndah 1 , Sandra Steyaert 1 , Steven Verbruggen 1 , Eline De Schutter 1 , Alexander Koch 1 , Daria 

Gawron 2 , Wim Van Criekinge 1 , Petra Van Damme 2 , Gerben Menschaert 1,* . 

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and 

Bioinformatics, Faculty of Bioscience Engineering, Ghent University 1 ; Dept. Medical Protein Research, VIB-Ghent 

University 2 . * Gerben.menschaert@ugent.be 

Ribosome profiling is a relatively new NGS technology that enables the monitoring of the in vivo synthesis of mRNAencoded 

translation products measured at the genome-wide level. The technique, also sometimes referred to as RIBOseq, 

uses the property of translating ribosomes to protect mRNA fragments from nuclease digestion and allows to determine 

genomic positions of translating ribosomes with sub-codon to single-nucleotide precision. Since the advent of the 

technology, several bioinformatics solutions have been devised to investigate this type of data. Here we will present 

several solutions to detect novel proteoforms by combining RIBOseq and mass spectrometry data, to detect putatively 

coding small open reading frames (sORFs), and to evaluate the impact of DNA and RNA methylation on the translation 

level. 

INTRODUCTION 

Integration of different OMICS technologies is routinely 

adapted to investigate biological systems. Our lab focuses 

on high-throughput data analysis and the development of 

novel data integration methodologies. Currently our focus 

goes to ribosome profiling (Ingolia et al., 2011), an NGS 

based technique to measure the so-called translatome (i.e. 

the mRNA that shows ribosome occupancy). This 

technique is applied in combination with other sequencing 

based protocols to measure expression (RNAseq), 

translation (mass spectrometry) and to chart maps of 

regulatory elements such as DNA methylation (reduced 

representation bisulfite sequencing, RRBS) and RNA 

methylation (m 6 Aseq) to address several biological 

questions. 

METHODS 

For the integration of RIBOseq and mass spectrometry 

(MS), we devised a tool called PROTEOFORMER 

(www.biobix.be/proteoformer). This proteogenomics tool 

consists of several steps. It starts with the mapping of 

ribosome-protected fragments (RPFs) and quality control 

of subsequent alignments. It further includes modules for 

identification of transcripts undergoing protein synthesis, 

positions of translation initiation with sub-codon 

specificity and single nucleotide polymorphisms (SNPs). 

We used PROTEOFORMER to create protein sequence 

search databases from publicly available mouse and inhouse 

performed human RIBOseq experiments and 

evaluated these with matching proteomics data (Crappé et 

al., 2015). 

Another pipeline based on RIBOseq data is built around 

the discovery of putatively coding small open reading 

frames (sORFs). Herein, the first step is to delineate 

sORFs based on RPF coverage throughout the coding 

sequence and at the translation initiation site. Afterwards, 

state-of-the-art tools and metrics accessing the coding 

potential of sORFs are implemented and a list of candidate 

sORFs for downstream analysis is compiled (e.g. MSbased 

identification). 

To assess the impact of DNA-methylation at the 

translation level a double knockout DNMT model was 

studied (WT and DNMT1 + 3B knockout HCT116 cell 

line). Genome-wide DNA methylation profiling was 

performed using RRBS, while ribosome profiling, 

quantitative shotgun and positional proteomics (Nterminal 

COFRADIC) were used to obtain protein 

expression data. 

An initial experiment to integrate m6Aseq (measuring the 

m6A epitranscriptome) and ribosome profiling has also 

been executed on HCT116 cells. 


The RIBOseq-MS integration (through 

PROTEOFORMER) increases the overall protein 

identification rates with 3% and 11% (improved and new 

identifications) for human and mouse respectively and 

enables proteome-wide detection of 5’-extended 

proteoforms, upstream ORF (uORF) translation and nearcognate 

translation start sites. The PROTEOFORMER 

tool is available as a stand-alone pipeline and has been 

implemented in the galaxy framework for ease of use. 

The sORF pipeline was tested and curated on three 

different cell-lines (HCT116: human, E14 mESC: mouse, 

and S2: fruitfly). The public repository has been made 

available at www.sorfs.org (Olexiouk V. et al., in review), 

and so far includes the datasets mentioned above. 

In the study for the effect of DNA methylation at the 

proteome level in the DNMT double knock-out we found 

that the knockout cells show more significantly upregulated 

than down-regulated genes and that these upregulated 

genes were characterized by higher levels of 

promoter methylation in the wild type cells. Both the MS 

and RIBOseq analyses corroborated these findings. 

Preliminary results based on the m6A sequencing confirm 

previous findings on know m6A sequence motifs and 

enrichment of m6A sites in specific functional regions 

(around translation start sites and in 3’UTR regions) and 

moreover some examples hint at an effect of m6A on 

ribosomal pausing, after integrating m6A- and RIBOseq 

data. 

REFERENCES 

Ingolia N. et al. Cell 11;147(4):789-802 (2011). 

Crappé, J., Ndah, E. et al. NAR 11;43(5):e29 (2015). 

40





O21. CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS 

AMONGST AVAILABLE CANDIDATES: A COARSE-GRAINED SIMULATION 

APPROACH TO SCORING DOCKING DECOYS 

Qingzhen Hou 1* , Kamil K. Belau 2 , Marc F. Lensink 3 , Jaap Heringa 1 & K. Anton Feenstra 1* . 

Center for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam, De Boelelaan 1081A, 1081 HV 

Amsterdam, The Netherlands 1 ; Intercollegiate Faculty of Biotechnology, University of Gdańsk - Medical University of 

Gdańsk, Kładki 24, 80-822 Gdańsk, Poland 2 ; Institute for Structural and Functional Glycobiology (UGSF), CNRS 

UMR8576, FRABio FR3688, University Lille, 59000, Lille, France 3 . 

Protein-protein Interactions (PPIs) play a central role in all cellular processes. Large-scale identification of native binding 

orientations is essential to understand the role of particular protein-protein interactions in their biological context. We 

estimate the binding free energy using coarse-grained simulations with the MARTINI forcefield, and use those to rank 

decoys for 15 CAPRI benchmark targets. In our top 100 and top 10 ranked structures, for the 'easier' targets that have 

many near-native conformations, we obtain a strong enrichment of acceptable or better quality structures; for the 'hard' 

targets with very few near-native complexes in the decoys, our method is still able to retain structures which have native 

interface contacts. Moreover, CLUB-MARTINI is rather precise for some targets and able to pinpoint near-native 

binding modes in top 1, 5, 10 and 20 selections. 

INTRODUCTION 

Measuring binding free energy is essential to understand the 

relevance of particular protein-protein interactions in their 

biological context. Moreover, at the atomic scale, molecular 

simulations give us insight into the physically realistic details 

of these interactions. In our recent study, we successfully 

applied coarse-grained molecular dynamics simulations to 

estimate binding free energy with similar accuracy as and 

500-fold less time consuming than full atomistic simulation 

(May et al., 2014). The approach relied on the availability of 

crystal structures of the protein complex of interest. Here, we 

investigate the effectiveness of this approach as a scoring 

method to identify stable binding conformations out of 

docking decoys from protein docking. 

We apply our method as an evaluation method to rank more 

than 19 000 docked protein conformations, or ‘decoys’, for 

15 benchmark targets from the Critical Assessment of 

PRedicted Interactions (CAPRI) (Lensink & Wodak, 2014). 

METHODS 

For each target, the binding free energy of all decoys was 

calculated, using the MARTINI forcefield as introduced 

before (May et al., 2014). In short, for a set of closely spaced 

separation distances, we calculate the constraint force applied 

to maintain the set distance. Integrating this force yields a 

potential of mean force (PMF), from which the binding free 

energy is extracted as the highest minus the lowest value. 

Previously, for accuracy, we used up to 20 replicate 

simulations for each distance in the PMF, but for efficiency, 

here we use only a single replicate initially. We then selected 

the lowest-scoring half to run an additional four replicates to 

obtain better sampling and more accurate estimates of the 

binding free energy. In total, we used approximately 800 000 

core-hours of compute time. 


We obtained strong enrichment of acceptable and high 

quality structures in the TOP 100 based on our PMF free 

energies, as shown in Figure 1. We estimate the error of our 

energies to be significant. This can be approved by increasing 

sampling, but remains very expensive. 

Moreover, for several targets, we can select near-native 

structures in top 1, top 5 and top 10 as shown in Table 1, 

which means that, overall, our method is rather precise. From 

estimates of the error, we expect we can improve accuracy by 

extending the amount of sampling done at each distance. In 

conclusion, our approach can find favorable interactions from 

available candidates produced by docking programs. To the 

best of our knowledge, this is the first time interaction free 

energy from a coarse-grained force field is used as a scoring 

method to rank docking solutions at a large scale. 

FIG. 1. Enrichment in 

percentage of 

acceptable or better 

structures. For each of 

the 13 targets with 

acceptable or better 

decoys, two columns 

(from left to right) 

stand for CAPRI 

Score_set and top 100 

in our rank of binding 

free energy calculation. Red, orange and yellow represent the fractions of 

high, medium and acceptable quality structures over the number of all or 

selected docking decoys. The order (left to right) is based on the fraction 

of acceptable structures in each target (easy to difficult) 

Table 1. Success selections of top ranked structures 

Selection Target\Quality High Medium Acceptable 

Total 

(% ) 

TOP 1 

T47 1 0 0 100 

T53 0 0 1 100 

T47 3 2 0 100 

TOP 5 

T41 0 0 4 80 

T53 0 0 3 60 

T37 0 2 0 40 

T47 7 3 0 100 

T41 0 1 7 80 

TOP 10 T53 0 1 5 60 

T37 0 3 0 30 

T50 0 0 1 10 

T47 14 6 0 100 

T41 0 4 13 85 

T53 0 3 9 60 

TOP 20 T37 0 4 2 30 

T50 0 0 3 15 

T40 1 2 0 15 

T46 0 0 1 5 

REFERENCES 

May, Pool, Van Dijk, Bijlard, Abeln, Heringa & Feenstra. Coarsegrained 

versus atomistic simulations: realistic interaction free energies 

for real proteins. Bioinformatics (2014) 30: 326-334. 

Lensink & Wodak. Score_set: A CAPRI benchmark for scoring protein 

complexes. Proteins (2014) 82:3163-3169. 

41





O22. PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS 

DATA 

Elien Vandermarliere 1,2* , Davy Maddelein 1,2 , Niels Hulstaert 1,2 , Elisabeth Stes 1,2 , Michela Di Michele 1,2 , 

Kris Gevaert 1,2 , Edgar Jacoby 3 , Dirk Brehmer 3 & Lennart Martens 1,2 . 

Department of Medical Protein Research, VIB 1 ; Department of Biochemistry, Ghent University 2 ; Oncology Discovery, 

Janssen Research and Development – Janssen Pharmaceutica, Beerse 3 . * elien.vandermarliere@ugent.be 

Proteins are dynamic molecules; they undergo crucial conformational changes induced by post-translational 

modifications and by binding of cofactors or other molecules. The characterization of these conformational changes and 

their relation to protein function is a central goal of structural biology. Unfortunately, most conventional methods to 

obtain structural information do not provide information on protein dynamics. Therefore, mass spectrometry-based 

approaches, such as limited proteolysis, hydrogen-deuterium exchange, and stable-isotope labelling, are frequently used 

to characterize protein conformation and dynamics, yet the interpretation of these data can be cumbersome and time 

consuming. Here, we present PepShell, a tool that allows interactive data analysis of mass spectrometry-based 

conformational proteomics studies by visualization of the identified peptides both at the sequence and structure levels. 

Moreover, PepShell allows the comparison of experiments under different conditions which include proteolysis times or 

binding of the protein to different substrates or inhibitors. 

INTRODUCTION 

The study of protein structure with mass spectrometry, 

called conformational proteomics, is frequently used to 

characterize protein conformations and dynamics. Most of 

these methods exploit the surface accessibility of amino 

acids within the native protein conformation or more 

specifically, the differences in protein surface accessibility 

in different situations within a protein structure. 

The experimental setup and subsequent workflow of a 

conformational proteomics experiment do not deviate 

drastically from that of a classic mass spectrometry-based 

experiment in which peptides present in a complex peptide 

mixture are identified. The final outcome of a 

conformational proteomics experiment is a list of peptides. 

These peptide lists typically span multiple experimental 

conditions across which the structural observations are to 

be compared; the peptide lists have to be combined and, if 

available, mapped onto the structure of the protein. 

To fulfill these latter steps, we developed PepShell 

(Vandermarliere et al., 2015), to guide the interpretation 

of mass spectrometry-based proteomics data in the context 

of protein structure and dynamics. 

TOOL DESCRIPTION 

PepShell aids the user in the interpretation of the outcome 

of conformational proteomics experiments and is 

composed of three panels: the experiment comparison 

panel, the PDB view panel, and the statistics panel. 

 

The data to analyze 

PepShell allows the input from limited proteolysis, 

hydrogen-deuterium exchange, MS footprinting and 

stable-isotope labelling experiments. The data have to 

be present in a comma-separated text file format. The 

project selection interface allows the user to select a 

reference project and to indicate which setups need to 

be compared with each other. 

 

Experiment comparison 

This panel allows the comparison of the selected 

experimental setups at the sequence level. For each 

experimental condition, the identified and quantified 

peptides are mapped onto the sequence of the protein 

of interest. 

The PDB view panel 

Here, the detected peptides are mapped on the protein 

structure. The main requirement is the availability of a 

3D structure of the protein of interest. 

 

Statistics within PepShell 

In this panel, the peptides of interest can be analyzed 

in more detail. The outcome from CP-DT (Fannes et 

al., 2013) for tryptic cleavage probability for each 

tryptic cleavage position is given. Also detailed 

comparison of the peptide ratios over the different 

experimental setups is allowed. 

CONCLUSIONS 

The increasing popularity of structural proteomics is in 

stark contrast with the availability of efficient tools to 

visualize this multitude of data. There are however some 

tools available that aid data interpretation; but these are 

approach-specific and are aimed primarily at mass 

spectrometrists with a specific focus on the experimental 

mass spectrometry data and their processing and 

interpretation. PepShell on the other hand is intended to 

support downstream users to interpret the results obtained 

from a variety of conformational proteomics approaches. 

PepShell uses the peptide lists to compare different 

experimental conditions and allows the visualization of 

these differences onto the structure of the protein. As such, 

PepShell bridges the gap between mass spectrometrybased 

proteomics data and their interpretation in the 

context of protein structure and dynamics. 

PepShell is an open source Java application. Its binaries, 

source code and documentation can be found at: 

compomics.github.io/projects/pepshell.html 

REFERENCES 

Fannes T et al. J Proteome Res 12, 2253-2259 (2013). 

Vandermarliere E et al. J Proteome Res 14, 1987-1990 (2015). 

42





O23. INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK 

Thomas Moerman 1,2,5* , Dries Decap 3,5 , Toni Verbeiren 2,5 , Jan Fostier 3,5 , Joke Reumers 4,5 , Jan Aerts 2,5 . 

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Visual Data Analysis Lab, ESAT – 

STADIUS, Dept. of Electrical Engineering, KU Leuven – iMinds Medical IT 2 ; Department of Information Technology, 

Ghent University – iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium 3 ; Janssen Research & Development, 

a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium 4 ; ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, 

Belgium 5 . * thomas.moerman@esat.kuleuven.be 

Researchers benefit greatly from tools that allow hands-on, interactive and visual experimentation with data, unimpeded 

by setup complexities nor scaling issues resulting from large data sizes. In our contribution we present an implementation 

of an interactive VCF comparison tool, making use of a technology stack based on Apache Spark [1], Big Data 

Genomics Adam [2] and Spark Notebook [3]. 

INTRODUCTION 

Current genomics data formats and processing pipelines 

are not designed to scale well to large datasets [1]. They 

were also not conceived to be used in an interactive 

environment. The bioinformatics field typically struggles 

with these difficulties as high-throughput, next-generation 

sequencing jobs produce large data files. Although many 

high-quality bioinformatics processing tools exist, it is 

often hard to express analyses in a consolidated and 

reproducible fashion. These tools typically do not allow to 

interactively iterate on an analysis while visualizing 

results. 

OBJECTIVE 

Analysis tools preferably provide the expressive power to 

define ad hoc queries on data. Biologists or clinical 

researchers, when dealing with genomic variants encoded 

in VCF files, typically perform queries comparing one 

protocol to another, tumor to normal, treated to untreated 

cell lines and so on. Ideally these comparisons make use 

of all quality-related metrics stored in VCF files (e.g. 

coverage depth, quality score) as well as the actual region 

annotations (e.g. repeat regions, exonic regions) and 

generate visual output. We aim to implement a tool that 

provides the necessary expressiveness as well as the 

computational power needed for making these types of 

analyses practical and interactive. 

APPROACH 

Recent advances in computation platform technology 

(Spark) and notebook technologies (Spark Notebook) 

enable orchestration of distributed jobs on cluster 

infrastructure from a programmable environment running 

in a browser. These technologies, combined with Adam 

[2], a library specifically designed for processing nextgeneration 

sequencing data, provide the necessary 

architectural bedrock for our purposes. 

Analyses are expressed in a high-level programming 

language (Scala), operating on specialized data structures 

(Spark resilient distributed datasets, or RDDs [1]) that 

make abstraction of the complexity of defining distributed 

computations on data sets too large for single node 

processing. Adam meets the need for an explicit data 

schema for abstraction of the different bioinformatics file 

formats. 

RESULTS & CONTRIBUTIONS 

Our work focuses on the pairwise comparison of annotated 

VCF files. Our contributions consist of two open-source 

Scala libraries: VCF-comp [4] and Adam-FX [5]. VCFcomp 

implements the concordance by variant position 

algorithm, which segregates the variants from two VCF 

inputs (A, B) into 5 categories: A/B-unique, concordant 

(equal variants on position) and A/B-discordant (different 

variants on position). This results in a distributed data 

structure from which we project visualizations, presented 

to the user by means of the Spark Notebook interface. 

FIGURE 1 Allele frequency distribution for concordant and unique 

variants in a tumor vs. normal VCF comparison. 

FIGURE 2 Functional impact (SnpEff annotation) histogram for 

concordant, unique and discordant variants in a tumor vs. normal VCF 

comparison. 

Adam-FX extends the Adam data structures and file 

parsing logic in order to support queries on SnpEff [6], 

SnpSift [7], dbSNP and Clinvar annotations. 

We believe our tool facilitates the comparison of 

annotated VCF files in an interactive manner while 

reducing runtime by leveraging the Spark platform. 

REFERENCES 

[1] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant 

abstraction for in-memory cluster computing." 

[2] Massie, Matt, et al. "Adam: Genomics formats and processing 

patterns for cloud scale computing." 

[3] https://github.com/andypetrella/spark-notebook 

[4] https://github.com/tmoerman/vcf-comp 

[5] https://github.com/tmoerman/adam-fx 

[6] Cingolani, P, et al. "A program for annotating and predicting the 

effects of single nucleotide polymorphisms, SnpEff: SNPs in the 

genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Fly 

(Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672 

43





O24. 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL 

LONG-RANGE INTERACTIONS WITH CANCER GENES 

Sepideh Babaei 1 , Waseem Akhtar 2 , Johann de Jong 3 , Marcel Reinders 1 & Jeroen de Ridder 1* . 

Delft Bioinformatics Lab, Delft University of Technology 1 ; Division of Molecular Genetics 2 ; 

Division of Molecular Carcinogenesis, The Netherlands Cancer Institute 3 . * j.deridder@tudelft.nl 

Genomically distal mutations can contribute to deregulation of cancer genes by engaging in chromatin interactions. To 

study this, we overlay viral cancer-causing insertions obtained in a murine retroviral insertional mutagenesis screen with 

genome-wide chromatin conformation capture data. In this talk, we show that insertions tend to cluster in 3D hotspots 

within the nucleus. The identified hotspots are significantly enriched for known cancer genes, and bear the expected 

characteristics of bona-fide regulatory interactions, such as enrichment for transcription factor binding sites. 

Additionally, we observe a striking pattern of mutual exclusive integration. This is an indication that insertions in these 

loci target the same gene, either in their linear genomic vicinity or in their 3D spatial vicinity. Our findings shed new 

light on the repertoire of targets obtained from insertional mutagenesis screening and underlines the importance of 

considering the genome as a 3D structure when studying effects of genomic perturbations. 

Evidence is mounting that the organization of the genome 

in the cell nucleus is extremely important for gene 

regulation. This finding is facilitated by recent 

technological advances (i.e. Hi-C) that enabled researchers 

to accurately capture the 3D conformation of 

chromosomes in the cellular nucleus at a high resolution. 

We have exploited a large existing Hi-C dataset to take 3D 

chromosome conformation into account while determining 

hotspots of viral cancer-causing mutations. These 

identified hotspots are significantly enriched for known 

cancer genes, and bear the expected characteristics of 

bona-fide regulatory interactions, such as enrichment for 

transcription factor binding sites. Additionally, we observe 

a striking pattern of mutual exclusive integration. This is 

an indication that insertions in these loci target the same 

gene through long-range interactions (1). 

In a second study (2), we performed a similar analysis that 

shows a striking relation between genome conformation 

and expression correlation in the brain. Although recent 

studies have shown a strong correlation between 

chromatin interactions and gene co-expression exists, 

predicting gene co-expression from frequent long-range 

chromatin interactions remains challenging. We address 

this by characterizing the topology of the cortical 

chromatin interaction network using scale-aware 

topological measures. We demonstrate that based on these 

characterizations it is possible to accurately predict spatial 

co-expression between genes in the mouse cortex. 

Consistent with previous findings, we find that the 

chromatin interaction profile of a gene-pair is a good 

predictor of their spatial co-expression. However, the 

accuracy of the prediction can be substantially improved 

when chromatin interactions are described using scaleaware 

topological measures of the multi-resolution 

chromatin interaction network. We conclude that, for coexpression 

prediction, it is necessary to take into account 

different levels of chromatin interactions ranging from 

direct interaction between genes (i.e. small-scale) to 

chromatin compartment interactions (i.e. large-scale). 

In this talk, I will focus on the computational and 

statistical methods that are required to make an insightful 

overlaying high-resolution conformation maps obtained 

using Hi-C with ~20.000 cancer-causing retroviral 

mutations and expression maps from the Allen Brain 

Atlas. 

FIGURE 1. Circos visualization of the insertions clusters that co-localize 

with the Notch1 locus. 

REFERENCES 

(1) Babaei, S. et al. Nature Communications (2015). 

(2) Babaei and Mahfouz et al. PLoS Computational Biology (2015) 

44


Abstract ID: P 

Poster 


P1. KNN-MDR APPROACH FOR DETECTING GENE-GENE 

INTERACTIONS 

Sinan Abo alchamlat 1 & Frédéric Farnir 1,* . 

Fundamental and Applied Research for Animals & Health (FARAH), Sustainable Animal Production, University of 

Liège 1 . * f.farnir@ulg.ac.be 

These last years have seen the emergence of a wealth of biological information. Facilitated access to the genome 

sequence, along with massive data on genes expression and on proteins have revolutionized the research in many fields 

of biology. For example, the identification of up to several millions SNPs in many species and the development of chips 

allowing for an effective genotyping of these SNPs in large cohorts have triggered the need for statistical models able to 

identify the effects of individual and of interacting SNPs on phenotypic traits in this new high-dimensional landscape. 

Our work is a contribution to this field............................................................................................................... 

INTRODUCTION 

GWAS has allowed the identification of hundreds of 

genetic variants associated to complex diseases and traits, 

and provided valuable information into their genetic 

architecture (Wu M et al., 2010). Nevertheless, most 

variants identified so far have been found to confer 

relatively small information about the relationship 

between changes at the genomic level and phenotypes 

because of the lack of reproducibility of the findings, or 

because these variants most of the time explain only a 

small proportion of the underlying genetic variation (Fang 

G et al., 2012). This observation, quoted as the ‘missing 

heritability’ problem (Manolio T et al., 2009) of course 

raises the question: where does the unexplained genetic 

variation come from? A tentative explanation is that genes 

do not work in isolation, leading to the idea that sets of 

genes (or genes networks) could have a major effect on the 

tested traits while almost no marginal – i.e. individual 

gene – effect is detectable. Consequently, an important 

question concerns the exact relationship between the 

genomic configuration, including the interactions between 

the involved genes, and the phenotypic expression. 

METHODS 

To tackle this subject, different statistical methods such as 

MDR (Multi Dimensional Reduction) have been proposed 

for detecting gene-gene interaction (Ritchie, D., et al., 

2001); their relative performances remain largely unclear, 

and their extension to situations combining many variants 

turns out to be challenging. So we propose a novel MDR 

approach using K-Nearest Neighbors (KNN) methodology 

(KNN-MDR) for detecting gene-gene interaction as a 

possible alternative, especially when the number of 

involved determinants is potentially high. The idea behind 

our method is to replace the status allocation used in 

classical MDR methods by a KNN approach: the majority 

vote occurs in the k (a parameter that must be tuned and 

depends on the various possible scenarios) nearest 

neighbors instead of within the (potentially empty) cell 

determined by the tested attributes of the individual to be 

classified. The steps other than classification are identical 

in both methods (i.e. cross-validation, attributes selection, 

training and tests balanced accuracy computations, best 

model selection procedure). 


Experimental results on both simulated data and real 

genome-wide data from Wellcome Trust Case Control 

Consortium (WTCCC) (Wellcome Trust Case Control C., 

2007) show that KNN-MDR has interesting properties in 

terms of accuracy and power, and that, in many cases, it 

significantly outperforms its recent competitors. 

FIGURE 1. Comparison of the inter-chromosomal interactions detected 

on the WTCCC dataset by KNN-MDR and other interaction methods 

using this same dataset as example (Shchetynsky et al. (2015); Zhang et 

al. (2012)) 

The results of this study allow us to draw some 

conclusions about the performance of KNN-MDR: on the 

one hand, the performance of the KNN-MDR method to 

detect gene-gene interactions are similar to the 

performance of MDR for small problems. On the other 

hand, KNN-MDR has significant advantages in large 

samples and large number of markers (such as GWAS) to 

detect the existence of genes effect. So KNN-MDR can be 

seen as a new and more comprehensive method than MDR 

and other competitors for detecting gene-gene interaction. 

REFERENCES 

Wu M et al. American journal of human genetics 86, 929-942 (2010). 

Fang G et al. PloS one 7, 1932-6203 (2012). 

Manolio T et al. Nature 461, 747-753 (2009). 

Ritchie, D., et al. Am J Hum Genet,69, 138-147 (2001). 

Wellcome Trust Case Control C. Nature, 447(7145):661-678 (2007). 

Shchetynsky K et al. Clinical immunology 158(1):19-28 (2015). 

Zhang J et al. American Medical Journal 3(1) (2015). 

45



Poster 


P2. CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC 

PATHWAYS IN FUNGI 

Maria Victoria Aguilar Pontes*, Eline Majoor, Claire Khosravi, Ronald P. de Vries, Miaomiao Zhou 

Fungal Physiology, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands; Fungal Molecular Physiology, 

Utrecht University, The Netherlands.*v.aguilar@cbs.knaw.nl, e.majoor@cbs.knaw.nl, c.khosravi@cbs.knaw.nl, 

r.devries@cbs.knaw.nl, m.zhou@cbs.knaw.nl 

INTRODUCTION 

Plant polysaccharides are among the major substrates for 

many fungi. After extracellular degradation, the 

monomeric components (mainly monosaccharides) are 

taken up by the cells and used as carbon sources to enable 

the fungus to grow. This would also imply that the range 

of catabolic pathways of a fungus may be correlated to the 

decomposition of the polysaccharides it can degrade. 

Several carbon catabolic pathways have been studied in 

different fungi able to grow on plant biomass such as 

Aspergillus niger (De Vries, et al., 2012). 

In this study we have tested this hypothesis by identified 

the presence of genes of a number of catabolic pathways 

in selected fungi from the Ascomycota and the 

Basidiomycota. 

METHODS 

A total of 104 fungal genomes were identified from the 

JGI fungal program (Grigoriev IV, et al., 2011), Broad 

Institute of Harvard and MIT, AspGD (Arnaud, et al., 

2012) and NCBI genbank (Benson, et al., 2012) (data 

version March 2013). 

We identified A. niger genes involved in individual 

pathways from literature. Genome scale protein ortholog 

clusters were detected according to (Li, et al., 2003), using 

inflation factor 1, E-value cutoff 1E-3, percentage match 

cut off 60% as for identification of distant homologs 

(Boekhorst, et al., 2007). The all-vs-all BlastP search 

required by OrthoMCL was carried out in a grid of 500 

computers by parallel fashion. The orthologs clusters were 

then curated manually by expert knowledge and literature 

search. Manual curation was aided by aligning the amino 

acid sequences of the hits for each query together with a 

suitable outgroup by MAFFT (Katoh, et al., 2009; Katoh, 

et al., 2005), after which neighbor joining trees were 

generated using MEGA5 with 1000 bootstraps. Genes that 

were clearly separated from the query branch in the trees 

were removed from the results. 


Patterns of pathway gene presence are conserved among 

clades. Galacturonic acid and rhamnose pathways are 

missing in yeast. Pentose pathway is conserved in 

Pezizomycetes and Basidiomycota, which explains their 

ability to grow on pentose as carbon source (www.funggrowth.org). 

These results may indicate that different evolutionary 

tracks have led to different metabolic strategies. 

The expression of metabolic genes will be evaluated for 

those species for which transcriptome data are available. 

The results will be compared to growth profiling data of 

the species on a set of plant-related poly- and 

monosaccharides to determine to which extent the genome 

content fits the physiological ability of the species. 

ACKNOWLEDGEMENTS 

The comparative genomics analysis was carried out on the 

Dutch national e-infrastructure with the support of SURF 

Foundation (e-infra1300787). 

REFERENCES 

Arnaud, M.B., et al., Nucleic Acids Res, 40, 653-659 (2012). 

Benson, D.A., et al., Nucleic Acids Res, 40, 48-53 (2012). 

Boekhorst, J., et al., BMC Bioinformatics, 8, 356-363 (2007). 

De Vries, R.P., et al. Pan Stanford Publishing Pte. Ltd, Singapore (2012). 

Grigoriev IV, et al., Mycology, 2, 192-209 (2011). 

Katoh, K., et al., Methods Mol Biol, 537, 39-64 (2009). 

Katoh, K., et al., Nucleic Acids Res, 33, 511-518 (2005). 

Li, L., et al., Genome Res, 13, 2178-2189 (2003). 

46



Poster 


P3. VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS 

USING POLIMERO AND POLIMERO-BIO 

Daniel Alcaide 1,2* , Ryo Sakai 1,2 , Raf Winand 1,2 , Toni Verbeiren 1,2 , Thomas Moerman 1,2 , Jansi Thiyagarajan & Jan Aerts. 

KU Leuven Department of Electrical Engineering-ESAT, STADIUS, VDA-lab, Belgium 1 ; iMinds Medical IT, Leuven, 

Belgium. * daniel.alcaide@esat.kuleuven.be 

Although there are currently several tools for fast prototyping in data visualization, the specifics of the biological domain 

often require the development of custom visuals. This leads to the issue that we end up re-implementing the base visuals 

over and over if we want to build them into a specific analysis tool. This work presents a proof-of-principle library for 

creating composable linked data visualizations, including an initial collection of parsers and visuals with an emphasis on 

biology. With Polimero and Polimero-bio, we want to create a library to build scalable domain-specific visual data 

exploration tools using a collection of D3-based reusable web components. 

INTRODUCTION 

As a visual data analysis lab, we often combine 

(brush/link) well-known data visualization techniques 

(scatterplots, barcharts, etc.). Despite it is possible to use 

general-purpose tools like Tableau or Excel, the singular 

needs of the biological field usually demand the creation 

of particular data visualizations which are not included in 

these commercial solutions (Figure 1). 

These visuals implementations need to be re-implemented 

for each new tool created. The present solution tries to be 

an alternative to create composable linked data 

visualizations. 

 

 

 

 

 

 

Modular: Each element is an independent module 

that has a specific purpose (data, visualization, 

computation) 

Composable: The elements can be combined 

setting up new functionalities (linking, filtering, 

reading different data sources) 

Encapsulated: Web components aim to provide 

the user a simple element interface, avoiding to 

have to deal with the underlying code. 

Reusable: The same element can be used in the 

same project for different objectives. 

Linkable: Polimero elements can speak to each 

other, allowing the use of events for brushing and 

linking. 

Embeddable: The elements can be added to any 

existing frameworks that use HTML (e.g. ipython 

notebook). 

FIGURE 1. Klaudia-plot - Visualization created with Polimero that shows 

the read pairs mapped around a deletion in the NA12878 genome on 

chromosome 20. 

METHODS 

Polimero is a library that uses Polymer implementation for 

creating visual web components. (www.polymerproject.org). 

Web components are an emerging W3C standard for 

extending the HTML platform to create web-based apps. 

This new technology includes custom elements, HTML 

templates, shadow DOM, and HTML imports (Figure 2). 

The D3-based custom elements that Polimero and 

Polimero-bio offer, allow us to create a scalable 

framework for building domain-specific visual data 

exploration tools. 

Leveraging the web components concepts, the main 

characteristics of Polimero library are: 

FIGURE 2. HTML example – Representing Polimero elements to create 

visualization. 


This library makes it possible to create applications that 

are composable, encapsulated, and reusable. This is 

valuable both for the developer/designer who can easily 

create and plug-in custom visual encodings, and for the 

end-user who can create linked visualizations by dragging 

existing components onto a canvas using the Polimerodesigner. 

Polimero and Polimero-bio are still in development but 

they are available at www.bitbucket.org/vda-lab/polimero. 

47



Poster 


P4. DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND 

Ganna Androsova 1* , Reinhard Schneider 1 & Roland Krause 1 . 

Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belval, Luxembourg 1 . 

* ganna.androsova@uni.lu 

INTRODUCTION 

Molecular interaction networks are dense structures of 

protein interactions, from which we would like to extract 

relevant sub-networks specific to the disease of interest. 

Such a disease-specific network is often constructed by the 

seed-and-extend algorithm, which extracts the relevant 

genes from an organism-wide, weighted interaction 

network, typically as its first-neighbourhood. Seed-andextend 

is suitable when disease biomarkers are poorly 

investigated and the knowledge about biomarker 

interaction partners is missing or when the interacting 

partners are established but the connections are missing 

between them. 

Our syndrome of interest is the postoperative cognitive 

impairment frequently experienced by elderly patients, 

characterized by progressive cognitive and sensory decline. 

The acute phase of cognitive impairment is postoperative 

delirium (POD). The underlying pathophysiological 

mechanisms have not been studied in depth due to 

mulitifactorial pathogenesis of this postoperative cognitive 

impairment. The known POD-related genes can be 

integrated into the draft network for exploration on a 

systems level. 

Here, we investigate how stable the results of such 

analysis are when the input set of seed genes is varied, and 

what is the role of stringency in the initial selection of the 

networks. Ideally, we would like to find the “sweet spot” 

that provides a biologically meaningful trade-off between 

false-positives and -negatives to be used for such analyses. 

METHODS 

The list of disease-related genes/proteins was retrieved 

from literature studies in the PubMed database. 

We extended the seed list with directly linked interactors 

by seed-and-extend from protein-protein interaction 

network databases. We extracted all interactions between 

seeds and connected neighbours, which resulted in the 

first-degree network. 

Next, we evaluated a biological enrichment of the 

extracted network, its topological parameters, overlap with 

other diseases and clustered the network into the smaller 

sub-networks. 


The POD network (Figure 1) follows a free-scale 

distribution and consists of 541 proteins with 5,242 

interactions between them. 

FIGURE 1. Postoperative delirium molecular network. 

The network was evaluated topologically by degree 

assortativity, density, shortest path, eccentricity and other 

measures. Pathways enrichment analysis showed 

glucocorticoid receptor signalling, immune response, and 

dopamine signalling as relevant to POD (Figure 2). 

FIGURE 2. Postoperative delirium pathway enrichment analysis. 

Top 5 hub proteins included UBC_HUMAN, 

GCR_HUMAN, P53_HUMAN, HS90A_HUMAN and 

EGFR_HUMAN. Appearance of p53 and other very 

frequent genes among top 5 hubs in our but also several 

other studies, motivated us to investigate its relevance to 

the disease and question the possible data bias. We 

compare how size, specificity and completeness of the 

input seed list can affect the resulting network and 

retrieval of the other disease-related proteins. 

48



Poster 


P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW 

COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE 

AND HIVE 

Amin Ardeshirdavani 1* , Erika Souche 2 , Martijn Oldenhof 3 & Yves Moreau 1 . 

KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven 

Department of Human Genetics 2; KU Leuven Facilities for Research 3. *amin.ardeshirdavani@esat.kuleuven.be 

Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others, 

efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and 

optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate 

supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and 

act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data 

from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache 

Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files. 

INTRODUCTION 

The goal of this project is trying to overcome the 

difficulties between massive NGS data and low data 

process ability. We want propose a data process and 

storage model specifically for NGS data. To address our 

goal we develop an application based on this model to test 

whether its process ability is highly increased. The target 

users of this application are researchers with intermediatelevel 

computer skills. The new model should meet certain 

demands, which are scalable, high tolerant and availability. 

Data import procedure should be fast and occupies the 

smallest storage volume. It also needs to make querying 

data faster and possible from remote place. In order to 

achieve these demands, three open source projects: 

Apache Hadoop, HBase and Hive are integrated as the 

backbone and on top of them a user-friendly interface 

designed application is developed to make this integration 

more straightforward. 

METHODS 

Generally, Hadoop is for utilizing distributed MapReduce 

data processing, HBase is the platform for complex 

structured data storage and Hive is for data retrieve from 

HBase using of Structural Query Language (SQL) syntax. 

Though Hadoop and HBase are popular recently, the 

combination of Hadoop, HBase and Hive is rare to be 

implemented in bioinformatics field. 

Here we mainly discuss gene variation data analysis. Thus 

the application developing is focusing on parsing and 

storing VCF (Variant Call Format) file. The application is 

designed to dynamically adapt VCF file structures with 

respect to variant callers. For example in 

UnifiedGenotyper calls SNPs and InDels separately by 

considering each variant is independent, yet the other 

caller HaplotypeCaller calls variants by using local 

assembly. For gene variation analysis, the VCF files of 

different samples need to be queried and the results should 

be able to export for further usage. Normally a VCF file 

for each sample or a group of samples is considerably 

large, so the efficiency of processing is for sure very 

crucial. 

The model we have decided is the integration of Hadoop, 

HBase and Hive; Hadoop will be used for data processing, 

HBase for storage and Hive for querying. Since all of 

these projects need distributed cluster to optimize the 

performance, it is crucial to decide the suitable 

architecture for our application. The cluster will be the 

major processing and storage platform. The single server 

outside the cluster will act as a client for users. Our 

application can connect remotely to the Hive server for 

researchers. 


The tests show clearly that the Apache integration 

performances much better than SQL model when dealing 

with large size VCF files. Also, for small VCF files, the 

integration performance is acceptable. So we conclude that 

Apache integration could be a good solution for this kind 

of file management. Our newly developed application H3 

VCF with user-friendly interface is a nice tool for users 

without high level IT knowledge so they can conveniently 

use the integration to tackle VCF files. User can either 

choose to build his/ her own local computer cluster or use 

Amazon EMR to easily create a cluster with Apache 

projects for a few dollars. 

49



Poster 


P6. ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING 

LONG-TERM PATIENT GUT COLONIZATION 

Jumamurat R. Bayjanov 1* , Jery Baan 1 , Mark de Been 1 , Mick Watson 2 & Willem van Schaik 1 . 

Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands 1 ; Edinburgh 

Genomics, The University of Edinburgh, Edinburgh, Scotland 2 . * J.Bayjanov@umcutrecht.nl 

Enterococcus faecium – recently evolved multi-drug resistant nosocomial pathogen – is able to rapidly colonize human 

gut. Previous work on animal, healthy human and clinical E. faecium strains has shown that clinical isolates form a 

distinct lineage. However, these studies lack detailed niche-specific and longitudinal evolutionary dynamics analysis of 

this organism. Here we show longitudinal within-host evolutionary dynamics analysis of E. faecium gut isolates, which 

were sampled from five patients over the period of 8 years. Whole-genome sequencing analysis showed that rapid 

diversification of E. faecium clones in patient gut is mainly due to recombinations and phages. High diversification 

allows E. faecium clones to acquire new genes including antibiotic resistance genes, which allows this bacterium to 

rapidly colonize hostile environments. 

INTRODUCTION 

In recent decades, Enterococcus faecium, normally a 

harmless gut commensal, has emerged as an important 

multi-drug resistant nosocomial pathogen. Previous work 

has shown that clinical isolates of E. faecium form a subpopulation 

that is distinct from strains isolated from 

animals and healthy humans (Lebreton et al., 2013). We 

used whole-genome sequencing to characterize how 

clinical E. faecium strains evolve during long-term patient 

gut colonization. 

METHODS 

The genomes of 96 E. faecium gut isolates, obtained over 

8 years from 5 different patients, were sequenced using 

Illumina HiSeq 2x100bp paired-end sequencing. Quality 

filtering of sequence reads was performed using Nesoni 

(version 0.117) (Nesoni, 2014) and high-quality reads 

were assembled into contiguous sequences using Spades 

assembler (version 3.1.0) (Bankevich et al., 2012). 

Subsequently, assembled sequences were annotated using 

Prokka (v 1.10) (Seeman T, 2014). In addition to these 96 

genomes, we also included publicly available genome 

sequences of 70 E. faecium strains, which were 

downloaded from NCBI Genbank database. In the set of 

166 strains, orthology between genes were identified using 

orthAgogue (Ekseth et al., 2014) and orthologous genes 

were clustered into ortholog groups using MCL algorithm 

(Enright et al., 2002). Core genome alignments were then 

constructed by concatenating core gene sequences and 

were filtered for recombinations using Gubbins (Croucher 

et al., 2015). Subsequently, recombination-filtered core 

genome alignments were used to construct a phylogenetic 

tree. In addition to core-genome based analyses, we have 

also studied gene gain and loss across time. 


As expected all of 96 isolates were grouped in E. faecium 

clade A, with only one strain clustering in clade A-2, 

which mainly contains animal isolates. The remaining 95 

strains were assigned to clade A-1, which is almost 

exclusively comprised of clinical isolates. The 

phylogenetic tree showed 5 clusters of closely related 

strains of patients, revealing the microevolution of E. 

faecium strains during gut colonization. We also anticipate 

that direct transfer of strains had occurred between 

patients during hospitalization in the same ward. 

Additionally, analysis of gene gain and loss across time 

showed that loss and gain of prophages is an important 

factor in generating genetic diversity during gut 

colonization. 

This study highlights the ability of E. faecium clones to 

rapidly diversify, which may contribute to the ability of 

this bacterium to efficiently colonize new environments 

and rapidly acquire antibiotic resistance determinants. 

REFERENCES 

Lebreton F, et. al. “Emergence of epidemic multidrug-resistant 

Enterococcus faecium from animal and commensal strains”. MBio. 

4(4):e00534-13, 2013. 

Nesoni. https://github.com/Victorian-Bioinformatics-Consortium/nesoni 

Bankevich A, et. al. "SPAdes: A New Genome Assembly Algorithm and 

Its Applications to Single-Cell Sequencing". Journal of 

Computational Biology 19(5):455-477, 2012 

Seemann T. "Prokka: rapid prokaryotic genome annotation". 

Bioinformatics. 30(14):2068-9, 2014. 

Ekseth OK, et. al. "orthAgogue: an agile tool for the rapid prediction of 

orthology relations". Bioinformatics. 30(5):734-6, 2014. 

Enright AJ, et. al. "An efficient algorithm for large-scale detection of 

protein families". Nucleic Acids Res. 40:1575-1584, 2002. 

Croucher NJ, et. al. "Rapid phylogenetic analysis of large samples of 

recombinant bacterial whole genome sequences using Gubbins". 

Nucleic Acids Res. 43(3):e15, 2015. 

50



Poster 


P7. XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC 

Charlie Beirnaert 1,2* , Matthias Cuykx 3 , Adrian Covaci 3 & Kris Laukens 1,2 . 

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical Informatics Research Centre 

Antwerp (biomina) 2 ; Toxicological Centre, University of Antwerp 3 . * charlie.beirnaert@uantwerpen.be 

In high-throughput untargeted metabolomics studies, quality control is still a prominent bottleneck. In analogy to a 

recently developed QC tool for proteomics, work in our research group aims to develop a QC environment specific for 

metabolomics. One component in this work is the XCMS analysis software for LC-MS data, which is very inputparameter-sensitive. 

The presented work deals with the automatic optimisation of the XCMS parameters by building 

further upon an existing framework for XCMS optimisation. The additions to this framework will be the inclusion of 

quantified resolution data by using the otherwise ignored profile-data and intelligent use of the isotopic profile of 

measured compounds. 

INTRODUCTION 

Metabolomics is the study of small molecules or 

metabolites. These metabolites have an enormous 

chemical diversity and are only now starting to be 

identified in a high-throughput fashion. Reason for this is 

the adoption of high performance liquid chromatography 

mass spectrometry and nuclear magnetic resonance 

spectroscopy. However, the data analysis of these large 

datasets is not trivial, specifically for LC-MS there are 

almost more ways of analysing data than there are 

researchers. Arguably, the most common used software 

platform for the initial analysis is XCMS (Smith et al., 

2006). However, the output of XCMS is very dependent 

on the input-parameters. Often the default parameters are 

chosen or they are adapted to the intuition of the 

researcher, with no account of the introduction of false 

positives etc. Optimization algorithms have been 

constructed by using a dilution series (Eliasson et al., 

2012) and by using the carbon isotope (Libiseller et al., 

2015). In this work, we build further upon the latter by 

including quantified information from the profile m/z 

domain (the continuous data in the m/z dimension) where 

accurate resolutions can be obtained for the mono-isotopic 

peaks and other isotopes. The developed optimisation can 

be used for both the data analysis and the quality control 

framework that is under development. 

METHODS 

The proposed work uses XCMS to find the peaks of 

interest in the data. To optimise this process, the results 

from XCMS are analysed for the occurrence of peaks and 

their isotopes. In this step, the raw profile data is inspected 

around the, by XCMS, identified peaks for the 

quantification of the peak resolution and for the 

occurrence of missed isotopes. 

Centroid vs Profile data: Modern day MS specialists use 

centroid data because the file size is considerably lower. 

The mass spectrometer converts the continuous data in the 

m/z dimension to a collection of spikes where each 

approximately Gaussian peak is converted to a single 

spike (delta function with the same height as the original 

peak). All other data is discarded. The result is a huge 

reduction in the file size but a loss of the peak shape and, 

as a result, no quantification of the resolution is possible. 

Optimization parameter: The peaks and their isotopes 

are characterized by a Gaussian in the chromatographic 

dimension and spaced apart by 1.0063 Da in the m/z 

dimension. When an isotope is missing or the extracted 

peak does not appear in enough samples (for example in 

50% of the samples in the sample group), the peak is 

categorized as “unreliable”. When a peak is present in all 

samples or has a clear isotopic distribution it is considered 

as “reliable”. With these measures a so called peak picking 

score can be calculated, which in turn can be optimised by 

a variety of methods. This results in an increase in reliable 

peaks, while not increasing false positives. 

Analysis & Quality control: The optimisation of the 

XCMs parameters is useful both in the analysis of the data 

itself, but it is also applicable in quality control for large 

scale LC-MS experiments. By being able to quantify the 

resolutions of all relevant peaks in a dataset corresponding 

to a control sample, it is possible to monitor the quality of 

spectra, and when combining this with other QC 

frameworks, like iMonDB (Bittremieux et al., 2015) it is 

possible to assure the quality of all experiments in a long 

lasting study. 


The aim is to use the profile data to improve the available 

optimization algorithms available. It remains to be seen 

whether the extra information in this data (compared to 

centroid data) justifies the increased need of computer 

resources. Nonetheless, profile data provides a valuable 

contribution in LC-MS optimization, because it enables 

researchers to evaluate (quantitatively) and improve the 

m/z resolution. 

REFERENCES 

Smith CA et al. Anal. Chem. 78(3), 779-789, (2006). 

Eliasson M. et al. Anal. Chem. 84(15), 6869-6876, (2012). 

Libiseller G. et al. BMC Bioinformatics 16:118, (2015). 

Bittremieux W. et al. J. Proteome Res. 14(5), 2360-2366, (2015). 

51



Poster 


P8. IDENTIFICATION OF NUMTS THROUGH NGS DATA 

Vincent Branders 1,2* , Chedly Kastally 2 & Patrick Mardulyn 2 . 

Machine Learning Group, Institute of Information and Communication Technologies, Electronics and Applied 

Mathematics (ICTEAM), Université catholique de Louvain 1 ; Evolutionary Biology and Ecology, Université libre de 

Bruxelles 2 . * vincent.branders@uclouvain.be 

Numts are copies of mitochondrial DNA sequences that have been transferred into the nuclear genome. Due to their 

similarity with mitochondrial DNA sequences, numts have led to many misinterpretations from overestimation of 

diversity to wrong association between cystic fibrosis and mitochondrial genome variation. To avoid such bias induced 

by numts, theses sequences have to be identified. Current methodologies are based on comparisons of existing nuclear 

and mitochondrial sequences and searches for similarities. The Pacific Biosciences (PacBio) new technology generates 

sequencing reads that span thousands of base pairs, which gives the opportunity to identify numts by looking for reads 

with regions similar to mitochondrial sequences and surrounded by regions highly different from it. It should allow the 

systematic identification of numts without a complete known nuclear reference. 

INTRODUCTION 

The transfer of DNA from mitochondria to the nucleus 

generates nuclear copies of mitochondrial DNA (numts). 

Numts have been found in many species including yeasts, 

rodents and plants. Due to their similarity to mitochondrial 

DNA, numts are responsible for many misinterpretations, 

both in mitochondrial disease studies and phylogenetic 

reconstructions (Hazkani-Covo et al., 2010). Numt 

variation have commonly been misreported as 

mitochondrial mutations in patients (Yao et al., 2008). 

Moreover, DNA barcoding was found to overestimate the 

number of species when numts are coamplified (Song et 

al., 2008). Current methods identify such sequences by 

aligning mitochondrial sequences against the nuclear 

genome and identifying similar regions (Figure 1, left). 

The PacBio technology allows the sequencing of DNA 

fragments spanning thousands of bases pairs. This size 

should allow the identification of numts without the need 

of a complete nuclear reference (the insect species 

Gonioctena intermedia for example). Indeed, it should be 

possible to use a mitochondrial assembly to identify 

PacBio reads with a central region similar to the 

mitochondrial sequence enclosed by nuclear regions that 

are dissimilar to it (Figure 1, right). 

FIGURE 1. Identification of numts – Existing methods (left) and proposed 

method (right). Comparison of mitochondrial sequence to nuclear 

sequence (left) or long reads (right). 

METHODS 

The proposed approach aligns PacBio reads to a 

mitochondrial genome (here de novo assemblies of PacBio 

reads and Illumina HiSeq 2000 reads are used). In these 

long reads, numts are identified with one region similar 

to the mitochondrial genome but surrounded by regions 

that are not similar. We introduce different criteria to 

distinguish reads that are presumably numts and reads of 

mitochondrial origin (Figure 2). DNA sequences comes 

from an insect (Gonioctena intermedia) without reference 

genome. 

FIGURE 2. Mitochondrial reads and numts with nuclear borders. 


A systematic identification of potential numts is proposed: 

through alignments, we identify 10 mitochondrial reads 

and 34 reads with potential numt for one particular 

mitochondrial region (the widely studied cytochrome 

oxidase I gene). As an exploratory research, we highlight 

the usefulness of Pacific Biosciences data in the 

identification of numts when no nuclear reference is 

available. It only requires PacBio reads and a 

mitochondrial assembly. The proposed approach is more 

efficient than an identification of numts through short 

reads that would require the complete reconstruction of 

both mitochondrial and nuclear genomes. A systematic 

identification of numts in non-models organisms should 

avoid misinterpretations in studies where numts could be 

sources of bias. Our current distinction of numts and 

mitochondrial reads is quite simple. A detailed analysis of 

this distinction could be a perspective of improvements. 

REFERENCES 

Hazkani-Covo E. et al. PLOS Genetics 6, 1-11 (2010). 

Song H. et al. PNAS 105, 13486-13491 (2008). 

Yao Y. G. et al. Journal of Medical Genetics 45, 769-772 (2008). 

52



Poster 


P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING 

SCHEMES FOR BACTERIA 

Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*. 

Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium. 

INTRODUCTION 

As next-generation sequencing in general, and whole 

genome sequencing (WGS) in particular, is increasingly 

adopted in public health for routine surveillance tasks, 

there is a clear need to incorporate this new technology in 

the day-to-day operational workflow of a public health 

institute. As cluster detection based on WGS data is 

evolving into a commodity, thanks to technologies such as 

whole genome multi-locus sequence typing (wgMLST), 

the question remains as to how WGS-based data analysis 

can be used to build up a human-friendly but highprecision 

and epidemiologically consistent naming 

strategy for communication purposes. 

METHODS 

For various organisms, the use of so-called ‘SNP 

addresses’ (based on single nucleotide polymorphisms or 

SNPs) has been proposed to build up a hierarchical 

naming scheme (see [1], [2]). This idea relies on single 

linkage clustering of isolates at different levels of 

similarity or distance, hence leading to a hierarchical name. 

However, the main difficulty here is to define the 

appropriate levels of similarity to cluster on, and the 

dependence of the naming scheme on the samples at hand. 

Moreover, the SNP approach might not provide the best 

type of data for this due to its relatively large volatility. 

In this work, we present a mathematical framework to 

define the levels of similarity upon which single linkage 

clustering makes sense. For this, we model the observed 

multimodal distribution of pairwise similarities between 

samples to obtain a theoretical model of the similarity 

distribution, and from there infer the most likely breaking 

points for stable similarity cutoffs. This is done in a dataindependent 

manner, and is therefore applicable to SNP 

data, but also to wgMLST data and even gene presenceabsence 

data. We assess the stability of the naming 

scheme by using a cross-validation approach. 


We apply our methods to propose a wgMLST-based 

naming scheme for Listeria monocytogenes. Using a 

reference dataset of the diversity within Listeria 

monocytogenes, and an extensive data set of over 4000 

isolates from real-time surveillance, we show the stability 

of the naming scheme, and the epidemiological 

concordance. 

REFERENCES 

[1] Dallman T et al., Applying phylogenomics to understand the 

2 emergence of Shiga Toxin producing Escherichia coli 

3 O157:H7 strains causing severe human disease in the 

4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029 

[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium 

tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi: 

10.1016/j.tube.2014.02.005 

53



Poster 


P10. FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN 

BIOLOGICAL INTERPRETATION OF GWAS RESULTS 

Elisa Cirillo 1,* , Michiel Adriaens 2 & Chris T Evelo 1,2 . 

1 Department of Bioinformatics – BiGCaT, Maastricht University, The Netherlands 

2 Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, The Netherlands 

* elisa.cirillo@maastrichtuniversity.nl 

Pathway and network analysis are established and powerful methods for providing a biological context for a variety of 

omics data, including transcriptomics, proteomics and metabolomics. These approaches could in theory also be a boon 

for the interpretation of genetic variation data, for instance in the context of Genome Wide Association Studies (GWAS), 

as it would allow the study of genetic variants in the context of the biological processes in which the implicated genes 

and proteins are involved. However, currently genetic variation data cannot easily be integrated into pathways. 

Additionally, it is not clear how to visualise and interpret genetic variation data once connected to pathway content. In 

this project we take up that challenge and aim to (i) visualise SNPs from a Type 2 Diabetes Mellitus (T2DM) GWAS 

dataset on pathways and (ii) generate and analyze a network of all associated genes and pathways. Together, this could 

enable a comprehensive pathway and network interpretation of genetic variations in the context of T2DM. 

INTRODUCTION 

GWAS has become a common approach for discovery of 

gene disease relationships, in particular for complex 

diseases like T2DM (Wellcome Trust Case Control, 

2009). However, biological interpretation remains a 

challenge, especially when it concerns connecting genetic 

findings with known biological processes. We wish to 

improve the interpretation of GWAS results, using a 

meaningful network representation that links SNPs to 

biological processes. 

METHODS 

We selected a GWAS data set related to T2DM from a 

meta GWAS resource for diseases created by Jhonson et 

al. (2009), and we extracted 1971 SNPs associated with 

T2DM. 

We identified the location for each SNP using Variant 

Effect Prediction (VeP) (http://www.ensembl.org) and we 

classified them in 5 categories (Figure 1): exonic, 3' UTR, 

5' UTR, intronic and intergenic. SNPs located in the first 

three categories are easily connected to genes using 

BioMart Ensembl (http://www.ensembl.org/). Pathways 

related with these genes are identified from the curated 

collection of WikiPathways (Kutmon et al., 2015). SNPs, 

genes and pathways are visualized in networks using 

Cytoscape (Shannon et al., 2003). 


We analysed four gene related SNP categories: 3' and 5' 

UTR, intronic and exonic. The exonic category was 

divided into 8 SNP sub-categories based on sequence 

interpretation: up- and downstream, splice region, 

synonymous, missense, stop/gain, transcription factor 

binding, and non-coding transcript. For each of the 11 

resulting categories we created a SNP-disease genepathway 

network. Disease related genes are not always 

included in pathways and this is also the case for disease 

genes in which GWAS resulting SNPs were found. For the 

SNPs that are related to genes in pathways we did a 

pathway gene set enrichment analysis and evaluated 

whether the resulting pathways were already known to be 

related to T2DM. 

SNPs in intergenic region need to be analysed and 

visualized differently. A possible approach might be using 

the expression quantitative trait locus (eQTL) data, which 

relates SNPs in intergenic regions to modulation of gene 

expression distally. Such datasets are available for many 

different human tissues and can provide additional 

regulatory information for pathways and the genes they 

comprise. 

FIGURE 1. Pie chart of the 5 SNPs categories. The total number of SNPs 

is 2767. 

REFERENCES 

Wellcome Trust Case Control Genome-wide association study of 14,000 

cases of seven common diseases and 3,000 shared controls. Nature. 

2007;447(7145):661-78. 

Johnson A, O'Donnell C. An Open Access Database of Genome-wide 

Association Results. BMC Medical Genetics. 2009;10(1):6. 

Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen E, Bohler A, 

Mélius J, Waagmeester A, Sinha S, Miller R, Coort S, Cirillo E 

Smeets B, Evelo C, Pico A. WikiPathways: Capturing the Full 

Diversity of Pathway Knowledge . Accepted September 2015, NAR- 

02735- E- Database issue 2016. 

Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. 

Cytoscape: A Software Environment for Integrated Models of 

Biomolecular Interaction Networks. Genome Research. 

2003;13(11):2498-504. 

54



Poster 


P11. IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS 

IN SETS OF FUNCTIONALLY RELATED GENES 

Pieter De Bleser 1,2,4* , Arne Soetens 1,2,4 & Yvan Saeys 1,3,4 . 

VIB Inflammation Research Center 1 ; Department of Biomedical Molecular Biology 2 , Department of Respiratory 

Medicine 3 , Ghent University 4 . * pieterdb@irc.vib-ugent.be 

Co-associations between transcription factors (TFs) have been studied genome-wide and resulted in the identification of 

frequently co-associated pairs of TFs. Co-association of TFs at distinct binding sites is contextual: different combinations 

of TFs co-associate at different genomic locations, producing a condition-dependent gene expression profile for a cell. 

Here, we present a novel method to identify these condition-dependent co-associations of TFs in sets of functionally 

related genes. 

INTRODUCTION 

The functional expression of genes is achieved by 

particular interactions of regulatory transcription factors 

(TFs) operating at specific DNA binding sites of their 

target genes. Dissecting the specific co-associations of TFs 

that bind each target gene represent a difficult challenge. 

Co-associations of transcription factor pairs have been 

studied genome-wide and resulted in the identification of 

frequently co-associated pairs of TFs (ENCODE Project 

Consortium, 2012). It was found that TFs co-associate in a 

context-specific fashion: different combinations of TFs 

bind different target sites and the binding of one TF might 

influence the preferred binding partners of other TFs. Here, 

we present a tool to identify these condition-dependent coassociations 

of TFs in sets of functionally related genes 

(e.g. metabolic pathways, tissues, sets of TF target genes, 

sets of differentially regulated genes). 

METHODS 

In a first step, we determine the set of regulatory TFs for 

each gene (Tang et al., 2011) in the set using the ChIP-Seq 

binding data for 237 TFs from the ReMap database 

(Griffon et al., 2015). This results in a number of 

regulatory ChIP-Seq binding regions per TF per gene, 

represented as a matrix in which each row corresponds to 

a gene while the columns correspond to the used TF. In a 

next step, this matrix is used as input to the distance 

difference matrix (DDM) algorithm, modified to 

accommodate this data. The DDM algorithm is a method 

that simultaneously integrates statistical over 

representation and co-association of TFs (De Bleser et al., 

2007). The result matrix is subsequently reduced, retaining 

only the columns of over-represented and co-associated 

TFs. Visualization is done by (1) hierarchical clustering of 

the reduced result matrix and reordering of the columns 

and (2) conversion of the reduced result matrix into a SIF 

(simple interaction file format) file, summarizing the 

regulator-regulated relationships between transcription 

factors and target genes. This SIF file can be imported into 

CytoScape for visualization of the regulatory network. 


FOXF1, TBX3, GATA6, IRX3, PITX2, DLL1 and 

NKX2-5 are experimentally verified target genes of the 

EZH2 transcription factor (Grote et al., 2013). 

Running the transcription factor co-association analysis 

method on this data set results in the clustering solution 

plot shown in Figure 1. 

The strongest associations between TFs are found between 

EZH2, POU5F1, SUZ12 and CTBP2. A secondary cluster 

of transcription factor associations is composed of 

EOMES, SMAD2+3 and NANOG. 

The finding of SUZ12 as a cofactor can be accounted for: 

EZH2 and SUZ12 are subunits of Polycomb repressive 

complex 2 (PRC2), which is responsible for the repressive 

histone 3 lysine 27 trimethylation (H3K27me3) chromatin 

modification (Yoo and Hennighausen, 2012). CTBP2 is a 

known transcriptional repressor (Turner and Crossley, 

2001). 

The method has been applied previously for the 

identification of TFs associated with both high tissuespecificity 

and high gene expression levels (Rincon et al., 

2015). The method will be made available as a web tool. 

FIGURE 1. Transcription factor co-associations in the EZH2 data set. 

Note the tendency of EZH2 to co-localize with POU5F1, SUZ12 and 

CTBP2. 

REFERENCES 

De Bleser,P. et al. (2007) A distance difference matrix approach to identifying 

transcription factors that regulate differential gene expression. Genome Biol., 8, 

R83. 

ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements 

in the human genome. Nature, 489, 57–74. 

Griffon,A. et al. (2015) Integrative analysis of public ChIP-seq experiments reveals 

a complex multi-cell regulatory landscape. Nucleic Acids Res., 43, e27. 

Grote,P. et al. (2013) The tissue-specific lncRNA Fendrr is an essential regulator of 

heart and body wall development in the mouse. Dev. Cell, 24, 206–214. 

Rincon,M.Y. et al. (2015) Genome-wide computational analysis reveals 

cardiomyocyte-specific transcriptional Cis-regulatory motifs that enable 

efficient cardiac gene therapy. Mol. Ther. J. Am. Soc. Gene Ther., 23, 43–52. 

Tang,Q. et al. (2011) A comprehensive view of nuclear receptor cancer cistromes. 

Cancer Res., 71, 6940–6947. 

Turner,J. and Crossley,M. (2001) The CtBP family: enigmatic and enzymatic 

transcriptional co-repressors. BioEssays News Rev. Mol. Cell. Dev. Biol., 23, 

683–690. 

Yoo,K.H. and Hennighausen,L. (2012) EZH2 methyltransferase and H3K27 

methylation in breast cancer. Int. J. Biol. Sci., 8, 59–65. 

55



Poster 


P12. PHENETIC: MULTI-OMICS DATA INTERPRETATION USING 

INTERACTION NETWORKS 

Dries De Maeyer 1,2,3* , Bram Weytjens 1,2,3 , Luc De Raedt 4 & Kathleen Marchal 2,3 . 

Centre for Microbial and Plant Genetics, KULeuven 1 ; Department for Information Sciences (INTEC, IMinds), UGent 2 ; 

Department for Plant Biotechnology and Bioinformatics, UGent 3 ; Department of Computer Science, KULeuven 4 . 

* dries.demaeyer@biw.kuleuven.be 

The omics revolution has introduced new challenges when studying interesting phenotypes. High throughput omics 

technologies such as next-generation sequencing and microarray technologies generate large amounts of data. 

Interpreting the resulting data from these experiments is not trivial due to the data’s size and the inherent noise of the 

underlying technologies. In addition to this, the “omics” technologies have led to an ever expanding biological 

knowledge which has to be taken into account when interpreting new experimental results. Interaction network in 

combination with subnetwork inference methods provide a solution to this problem by mining the current public 

interactomics knowledge using experimental omics data to better understand the molecular mechanisms driving the 

interesting phenotypes under study. 

INTRODUCTION 

Computational methods are becoming essential for 

analyzing large scale omics datasets in the light of current 

knowledge. By representing publicly available 

interactomics knowledge as interaction networks 

subnetwork inference methods can extract the actual 

molecular mechanisms that drive an interesting phenotype. 

The PheNetic framework is such a method that allows for 

mining interaction networks with multi-omics datasets. 

Using this framework different types of biological 

applications have been analyzed in the past such as KOtranscriptomics 

interpretation (De Maeyer, 2013), 

expression analysis (De Maeyer, 2015) and distinguishing 

driver from passenger mutation from eQTL experiments 

(De Maeyer). 

METHODS 

Interaction networks provide a flexible representation of 

public biological interactomics knowledge. These 

networks represent the physical interactions between 

genes and their corresponding gene products in the 

interactome of the organism under research (Cloots, 2011). 

The interaction network integrates different layers of 

homogeneous interactomics data, e.g. signalling, proteinprotein, 

(post)transcriptional and metabolic interactomics 

data, into a single heterogeneous network representation. 

The PheNetic framework uses interaction networks to find 

biologically valid paths which connect (in)activated genes 

selected from multi-omics data sets. These paths provide a 

biological explanation of how the genes from these data 

sets can trigger each other. Finding the best explanations 

or paths in the interaction network corresponds to finding 

that subnetwork that best explains the observed results and 

provides an insight into the molecular mechanisms that 

drive the interesting phenotype. Depending on the type of 

biological application and provided data different types of 

paths can be used to infer the subnetwork such as KOtranscriptomics 

interpretation (De Maeyer, 2013), 

expression analysis (De Maeyer, 2015) and interpreting 

eQTL experiments (De Maeyer). 


In a first setup PheNetic was used to study the pathways 

and processes involved in acid resistance in Escherichia 

coli (De Maeyer, 2013). Using our framework we were 

able to determine the different molecular pathways that 

drive acid resistance and identify the regulators that 

underlie this phenotype. It was shown that subnetwork 

inference methods outperform naïve gene rankings in 

identifying the biological pathways associated with the 

phenotype under research based. 

In a second setup PheNetic was used to interpret 

expression data (De Maeyer, 2015) to extract from the 

interaction network those parts of the interaction network 

that show differences in expression. This method was 

provided as a web server that can be accessed at 

http://bioinformatics.intec.ugent.be/ 

phenetic and that allows for an intuitive and visual 

interpretation of the inferred subnetworks. 

In a third setup PheNetic was used to select driver 

mutations from passenger mutations in coupled genetictranscriptomics 

data sets from evolution experiments (De 

Maeyer). Evolved strains with the same phenotype are 

expected to have consistent changes in the same pathways. 

Therefore, finding the subnetwork that best connects the 

mutations to the differentially expressed genes over all 

strains is expected to identify the driver mutations over 

passenger mutations in combination with identifying the 

molecular mechanisms that induce the observed change in 

phenotype. This approach provides a systemic insight in 

both the biological processes and genetic background that 

induces phenotype. 

Based on the different approaches it can be concluded that 

PheNetic is a flexible framework for subnetwork selection 

that allows for solving a large variety of biological 

applications using multi-omics data sets. 

REFERENCES 

Cloots, L., & Marchal, K. (2011). Curr Opin Microbiol, 14(5), 599-607. 

De Maeyer, D., Renkens, J., Cloots, L., De Raedt, L., & Marchal, K. 

(2013). Mol Biosyst, 9(7), 1594-1603. 

De Maeyer, D., Weytjens, B., Renkens, J., De Raedt, L., & Marchal, K. 

(2015). Nucleic Acids Res, 43(W1), W244-250. 

De Maeyer, D., Weytjens, B., De Raedt, L., & Marchal, K. Molecular 

biology and evolution. Submitted 

56



Poster 


P13. THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS 

SUSCEPTIBILITY IN ALLOGENEIC TRANSPLANT POPULATIONS 

Nicolas De Neuter 1,2* , Benson Ogunjimi 3 , Anke Verlinden 4 , Kris Laukens 1,2 & Pieter Meysman 1,2 . 


Antwerpen (biomina) 2 ; Centre for Health Economics Research and Modeling Infectious Diseases (CHERMID), Vaccine 

and Infectious Disease Institute, University of Antwerp 3 ; Antwerp University Hospital 4 . 

* nicolas.deneuter@uantwerpen.be 

In this study, we aim to characterize those HLA alleles that increase or decrease the risk of cytomegalovirus infections 

following tissue or organ transplants. This HLA-dependent susceptibility will then be explained using state-of-the-art 

HLA peptide affinity methods to identify the underlying molecular reason. This insight can greatly aid prediction of 

those transplantation patients that are most at risk from cytomegalovirus infection. 

INTRODUCTION 

Patients suffering from disorders of the hematopoietic 

system or with chemo-, radio-, or immuno- sensitive 

malignancies such as leukemia often receive 

hematopoietic stem cell transplantation therapy (HSCT). 

The transplantation is preceded by a conditioning regimen 

that eradicates the recipient’s malignant cell population 

through intensive chemotherapy and irradiation, 

simultaneously ablating the recipient’s bone marrow. Self 

(autologous) or non-self (allogeneic) hematopoietic stem 

cells are then reintroduced into the recipient after which 

they are allowed to reestablish hematopoietic functions. 

HSCT is associated with high morbidity and mortality and 

requires careful monitoring of patients during the weeks 

following transplantation. Opportunistic cytomegalovirus 

(CMV) infections are one of the major causes of this high 

morbidity and mortality and can occur in up to 80% of 

HSCT patients, depending on the use of prophylactic 

treatment or pre-emptive therapy and the serological CMV 

status of donor and recipient. CMV disease can manifest 

itself as life-threatening pneumonia, gastrointestinal 

disease, retinitis, encephalitis or hepatitis. 

The relevance of HLA alleles in varicella zoster virus 

associated disease has recently been demonstrated by our 

group (Meysman et al., 2015) and similar insights might 

be gained in CMV related disease. Several studies have 

already shown a correlation between the incidence of 

CMV infection and the presence of certain human 

leukocyte antigens (HLA) alleles in the transplant 

recipient. However, the exact alleles identified in previous 

studies are very inconsistent, likely due to small sample 

sizes and type I multiple testing errors. 

METHODS 

Anonymized patient records on the HLA alleles, CMV 

infection and serological status of 1284 transplant 

recipients were collected from the Antwerp University 

Hospital (UZA). This data set was further extended with 

publicly available HLA data from transplant patient and 

the counts for the HLA alleles of each loci present were 

combined. A hypergeometric distribution was used to test 

HLA loci (A, B, C, DRB1, DQB1 and DPB1) for 

statistical over- or underrepresentation of their respective 

alleles. HLA alleles were tested for over- or 

underrepresentation in two test populations: recipients 

who were seropositive for CMV before transplantation 

and recipients who developed a CMV infection posttransplantation. 

In the later case, we also examined if 

donor seropositivity had an influence on the CMV 

infection status. The P value cutoff used is 0.05 and was 

adjusted with a Bonferroni correction for multiple testing, 

in this case the number of alleles tested per loci. 

Putative nonameric peptides were generated in silico from 

CMV protein sequences available in online protein 

sequence repositories such as the UniProt Knowledgebase. 

Three complementary methods were employed to predict 

the affinity of each putative nonameric peptide to the 

significantly enriched or depleted HLA alleles. The 

methods used were: NetCTLpan, the stabilized matrix 

method (SMM) and an in-house-developed approach 

called CRFMHC. Peptide-binding affinity results of each 

predictor were normalized against the affinity of a 

restricted panel of human proteins and used to compare 

results between predictors. Additionally, each CMV 

protein was assessed for depletion of high-affinity 

peptides using a hypergeometric distribution. 

RESULTS 

Preliminary results on a small portion of the UZA data 

reveals HLA alleles underlying either CMV seropositivity 

or CMV infection with a trend towards significance but do 

not reach the Bonferroni corrected threshold. We expect 

the additional data to increase the power of the analysis. 

REFERENCES 

Meysman,P. et al. (2015) Varicella-Zoster Virus-Derived Major 

Histocompatibility Complex Class I-Restricted Peptide Affinity Is 

a Determining Factor in the HLA Risk Profile for the 

Development of Postherpetic Neuralgia. J. Virol., 89, 962–969. 

57



Poster 


P14. NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM 

WHOLE GENOME NGS DATA 

Nicolas Dierckxsens 1,2* , Olivier Hardy 2 , Ludwig Triest 3 , Patrick Mardulyn 2 & Guillaume Smits 1,4 . 

Interuniversity Institute of Bioinformatics Brussels (IB2), ULB-VUB, Triomflaan CP 263, 1050 Brussels, Belgium 1 ; 

Evolutionary Biology and Ecology Unit, CP 160/12, Faculté des Sciences, Université Libre de Bruxelles, Av. F. D. 

Roosevelt 50, B-1050 Brussels, Belgium 2 ; Plant Biology and Nature Management, Vrije Universiteit Brussel, Brussels, 

Belgium 3 ; Department of Paediatrics, Hôpital Universitaire des Enfants Reine Fabiola (HUDERF), Université Libre de 

Bruxelles (ULB), Brussels, Belgium 4 . * nicolasdierckxsens@hotmail.com 

Thanks to the evolution in next-generation sequencer (NGS) technology, whole genome data can be readily obtained 

from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on 

assembling the plastid genomes. Therefore we developed a new algorithm that solely assembles the plastid genomes 

from whole genome data, starting from a single seed. The algorithm is capable of utilizing the full advantage of very high 

coverage, which makes it even capable of assembling through problematic regions (AT-rich). The algorithm has been 

tested on several whole genome Illumina datasets and it outperformed other assemblers in runtime and specificity. Every 

assembly resulted in a single contig for any chloroplast or mitochondrial genome and this always within a timeframe of 

30 minutes. 

INTRODUCTION 

Chloroplasts and mitochondria are both responsible for 

generating metabolic energy within eukaryotic cells. Both 

plastids are maternally inherited and have a persistent gene 

organization, what makes them ideal for phylogenetic 

studies or as a barcode in plant and food identification 

(Brozynska et al., 2014). But assembling these plastids 

genomes is not always that straightforward with the 

currently available tools. Therefore we developed a new 

algorithm, specifically for the assembly of plastid 

genomes from whole genome data. 

METHODS 

The algorithm is written in Perl. All assemblies were 

executed on Intel Xeon CPU machine containing 24 cores 

of 2.93 GHz with a total of 96,8 GB of RAM. All nonhuman 

samples were sequenced on the Illumina HiSeq 

platform (101 bp paired-end reads). The human 

mitochondria samples (PCR-free) were sequenced on the 

Illumina HiSeqX platform (150 bp paired-end reads). The 

Gonioctena intermedia sample was also sequenced on the 

PacBio platform. 


Algorithm. The algorithm is similar to string overlap 

algorithms like SSAKE (Warren et al., 2007) and VCAKE 

(Jeck et al., 2007). It starts with reading the sequences into 

a hash table, which facilitates a quick accessibility. The 

assembly has to be initiated by a seed that will be 

extended bidirectionally in iterations. The seed input is 

quite flexible, it can be one sequence read, a conserved 

gene or even a complete mitochondrial genome from a 

distant species. Every base extension is determined by a 

consensus between the overlapping reads. Unlike most 

assemblers, NOVOPlasty doesn’t try to assemble every 

read, but will extend the given seed until the circular 

plastid is formed. 

Assemblies. NOVOPlasty has currently been tested for the 

assembly of 8 chloroplasts and 6 mitochondria. Since 

chloroplasts contain an inverted repeat, two versions of the 

assembly are generated. The differ only in the orientation 

of the region between the two repeats; the correct one will 

have to be resolved manually. Besides the mitochondrion 

of the leaf beetle Gonioctena intermedia, all assemblies 

resulted in a complete circular genome. A comparative 

study of four assemblers for the mitochondrial genome of 

G. intermedia clearly shows the speed and specificity of 

NOVOPlasty (Table 1). 

NOVO 

Plasty 

MIRA MITO bim ARC 

Duration (min) 12 536 4777* 586 

Memory (GB) 15 57,6 63,4 1,9 

Storage (GB) 0 144 418 12 

Total contigs 1 3434 2221 2502 

Mitochondrial contigs 1 1 4 48 

Coverage (%) 98 94 94 84 

Mismatches 10 25 26 2 

Unidentified nucleotides 43 194 197 0 

TABLE 1. Benchmarking results between four assemblies of the 

mitochondrial genome of Gonioctena intermedia. The assemblies were 

constructed with MITObim (Hahn et al., 2013), MIRA (Chevreux et al., 

1999), ARC (Hunter et al., 2015) and NOVOPlasty.*manually terminated 

Discussion. Despite the many available assemblers, many 

researchers still struggle to find a good assembler for 

plastids genomes. NOVOPlasty offers an assembler 

specifically designed for plastids that will deliver the 

complete genome within 30 minutes. The algorithm will 

be tested on more datasets and a comparative study with 

other assemblers is in progress. 

REFERENCES 

Brozynska et al. PLoS One 9 (2014). 

Chevreux et al. Computer Science and Biology: Proceedings of the 

German Conference on Bioinformatics (GCB) (1999). 

Hahn et al. Nucleic Acids Research, 1-9 (2013). 

Hunter et al. http://dx.doi.org/10.1101/014662 (2015). 

Jeck et al. BMC Bioinformatics 23, 2942-2944 (2007). 

Warren et al. BMC Bioinformatics 23, 500-501 (2007). 

58



Poster 


P15. ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR 

NANOMATERIAL SAFETY EVALUATION 

Friederike Ehrhart 1 , Linda Rieswijk 1 , Chris T. Evelo 1 , Haralambos Sarimveis 2 , Philip Doganis 2 , Georgios Drakakis 2 , 

Bengt Fadeel 3 , Barry Hardy 4 , Janna Hastings 5 , Christoph Helma 6 , Nina Jeliazkova 7 , Vedrin Jeliazkov 7 , Pekka Kohonen 89 , 

Roland Grafström 9 , Pantelis Sopasakis 10 , Georgia Tsiliki 2 & Egon Willighagen 1 . 

Department of Bioinformatics - BiGCaT, Maastricht University 1 ; National Technical University of Athens 2 ; Karolinska 

Institutet 3 ; Douglas Connect 4 ; European Molecular Biology Laboratory – European Bioinformatics Institute 5 ; In silico 

toxicology 6 ; Ideaconsult Ltd. 7 ; VTT Technical Research Centre of Finland 8 ; Misvik Biology 9 ; IMT Institute for Advanced 

Studies 10 . *friederike.ehrhart@maastrichtuniversity.nl 

eNanoMapper is an open computational infrastructure for engineered nanomaterial data: it comprises a semantic web 

supported database, ontology, and user applications for up- and download of experimental data, and tools for modelling. 

INTRODUCTION 

Nanomaterials are defined by size: between 1 nm and 100 

nm in at least one dimension. The properties of these 

material do not always resemble those of the bulk 

material, i.e. micro- and bigger particles, or solutions. 

Nanomaterials can differ in reactivity, toxicity in 

biological organisms and ecosystems depending on their 

size and surface properties and the possibility for 

“leakage” of the material it is made off. That is why it is 

so difficult to assess the safety of nanomaterials and why 

the NanoSafety Cluster defined a need for a new 

computational infrastructure in 2012. eNanoMapper is a 

European project with partners from eight European 

countries. This project has been developing an 

computational infrastructure consisting of a semantic web 

assisted database, a modular ontology, and tools to use 

them for nanomaterial safety assessment. Data sharing, 

data storage, data analysis tools, and web services are 

currently under development, being developed and tested, 

and put into production use. The project website can be 

found at www.enanomapper.net. 

PROBLEM 

The eNanoMapper platform is designed to support hosting 

of data on nanomaterial properties relevant for nanosafety 

assessment as found in existing databases like the 

NanoMaterial Registry, DaNa Knowledge Base, 

Nanoparticle Information Library NIL, Nanomaterial- 

Biological Interactions Knowledgebase, caNanoLab, 

InterNano, Nano-EHS Database Analysis Tool, nanoHUB, 

etc. Each of them has different data formats and 

descriptors, like CODATA-VAMAS’ Universal 

Description System, ISO-Tab(-Nano), OECD templates, 

custom spreadsheets, and images. Interoperability is a 

main aim and semi-automatic import or upload of 

information and to integrate it in the eNanoMapper data 

structure is being enabled. Vice versa, retrieval or 

download of experimental data from the database for (re- 

)analysis should be provided too, using programmable 

interfaces to the data and the ontology. Database and 

search functionality should be semantic web compatible: 

the project developed and maintain a nanosafety ontology 

to support this. This eNanoMapper ontology was 

developed using the Web Ontology Language and the 

challenge is to map nanomaterial terms to their multiple 

ontology terms, namely physico-chemical properties, 

biological and ecological impact, experimental assay 

description, and known safety aspects. 


The current eNanoMapper demo database instance, 

available at https://data.enanomapper.net/, contains the 

physico-chemical, biologic and environmental properties 

of nanomaterials of 465 different nanomaterials 1 . Loading 

data into the database supports various formats, including 

the OECD Harmonized Templates and the data structure 

used by the NanoWiki 2 . A web interface is designed to 

support all interactions with the database you may want to 

perform, including uploading of experimental data, as well 

as querying data to support analysis and modelling of 

nanoparticle properties. The eNanoMapper ontology is 

available 

under 

http://purl.enanomapper.net/onto/enanomapper.owl and is 

based on a multi-faceted description of nanoparticles 

concerning nanoparticle types, physico-chemical 

description, life cycle, biological and environmental 

characterisation including experimental methods and 

protocols, and safety information 3 . The terms are verified 

against the definitions of REACH, ISO, or common 

practices used in science in general. The often confused 

different meanings of endpoints and assays were 

discriminated in the definitions, e.g. size and size 

measurement assay. It was partly possible to use existing 

ontologies as basis, e.g. NPO, ChEBI, GO, etc. but many 

terms had to be added manually. Currently, there are 4592 

classes defined. Users can get access and download the 

ontology from the U.S. National Center for BioMedical 

Ontologies BioPortal platform, 

http://bioportal.bioontology.org/ontologies/ENM. 

REFERENCES 

1 Jeliazkova, N. et al. The eNanoMapper database for 

nanomaterial safety information. Beilstein Journal of 

Nanotechnology 6, 1609-1634, doi:10.3762/bjnano.6.165 

(2015). 

2 Willighagen, E.; doi: org/10.6084/m9.figshare.1330208 

3 Hastings, J. et al. eNanoMapper: harnessing ontologies to 

enable data integration for nanomaterial risk assessment. J 

Biomed Semantics 6, 10, doi:10.1186/s13326-015-0005-5 


59



Poster 


P16. BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: 

SOMETIMES LESS IS MORE 

Sarah ElShal 1,2* , Jesse Davis 3 & Yves Moreau 1,2 . 

Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data 

Analytics Department, KU Leuven 1 ; iMinds Future Health Department, KU Leuven 2 ; Department of Computer Science, 

KU Leuven 3 . * sarah.elshal@esat.kuleuven.be 

Biomedical text is increasingly being made available online in either abstract or full article formats. This goes in parallel 

with the knowledge desire to extract information from such text (e.g. finding links between diseases and genes). 

Consequently text mining is very popular in the biomedical domain given that it provides the possibility to automatically 

analyze these texts in order to extract knowledge. One of the big challenges in text mining is recognizing named entities 

(e.g. disease and gene entities) inside a given text, which is widely known as Named Entity Recognition (NER). We 

studied two biomedical taggers that apply different NER methods on MEDLINE abstracts. Here, we compare the 

contribution of each of the two taggers in associating genes with diseases. We show that with fewer recognized entities 

we gain more knowledge and we better associate genes with diseases. 

INTRODUCTION 

MEDLINE currently has more than 25 million biomedical 

citations from different journals all over the world. With 

this vast amount of text available, it is increasingly 

important to mine such data and find the best ways to 

extract relevant knowledge out of it. One example of such 

knowledge is links between diseases and genes. However 

it is very challenging and time consuming to recognize 

biomedical entities inside a given text with the evolving 

number of dictionaries and tagging strategies. Different 

taggers exist that map MEDLINE abstracts to biomedical 

entities. Such tagged entities can be used to generate 

disease and gene profiles and by applying certain 

similarity measures, we can extract knowledge and 

generate disease-gene hypothesis. 

METHODS 

We compare two MEDLINE taggers that map the whole 

set of MEDLINE abstracts to biomedical entities (e.g. 

genes, diseases, GO and MeSH terms …). The first one is 

MetaMap (Aronson et al., 2010), and the second one has 

been used as a text mining pipeline in many resources, 

latest in Diseases (Pletscher-Frankild et al., 2015). For 

sake of simplicity, we will refer to the second tagger by 

m_tagger throughout the rest of the abstract. For each 

MEDLINE abstract we could obtain two sets of mapped 

entities: (1) the metamap set, and (2) the m_tagger set. The 

metamap set (given all the abstracts) corresponds to 

78,298 distinct entities vs. 29,536 for M_tagger. 

In order to compare the contribution of each tagger to the 

disease-gene association process, we proceeded as follows. 

First, we generated a validation set from the OMIM 

database to acquire a list of experimentally-validated 

disease-gene pairs. Second, we generated an entity profile 

for every gene in our database and for every disease in our 

validation set. This profile corresponds to the TF-IDF 

score of a given entity in one profile, which is calculated 

according to the set of abstracts found to be linked with a 

disease or gene. Then for every disease, we computed the 

cosine similarity between its profile and all the gene 

profiles. Hence we could have a similarity score for each 

disease and gene pair, which we used to rank the genes for 

a given disease. We computed the average recall at the top 

10, 25, 50, and 100 ranked genes. We ran this analysis 

once according to the metamap set and once according to 

the m_tagger set. We also tried another association 

measure where we filtered the profiles such that they only 

contain gene entities. Then we ranked the genes according 

to their TF-IDF scores in a given disease profile. This 

corresponds to 9290 gene entities in the metamap set, and 

10,003 entities in the m_tagger set. Again we measured 

the average recall at the different rank thresholds, and we 

repeated the analysis using the metamap and m_tagger 

profiles. 


Figure 1 presents the recall results on the OMIM 

validation set. We observe that MetaMap and M_tagger 

result in comparable recall when ranking the genes 

according to their cosine similarity with the disease 

profiles. We also observe that M_tagger results in the best 

recall when simply ranking the genes according to their 

TF-IDF scores inside the disease profile. 

FIGURE 1. Recall results on the OMIM validation set: comparing the 

contribution of MetaMap and M_tagger, once with cosine similarity and 

once with TF-IDF ranks. 

Even though using the m_tagger set implies using less 

entities than the metamap one, we could gain the same 

knowledge to associate genes with diseases. Moreover, 

when we further reduced this set of entities to only genes, 

we gained even more knowledge and better associated 

genes with diseases. 

REFERENCES 

Aronson A.R. et al. J. Am. Med. Inform. Assoc. An overview of MetaMap: historical 

perspective and recent advances. 17, 229-236 (2010). 

Pletscher-Frankild S. et al. DISEASES: text mining and data integration of diseasegene 

associations. Methods. 74, 83-89 (2015). 

60



Poster 


P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS 

Bertrand Escaliere 1,2 , Nicolas Simonis 1,3 , Gianluca Bontempi 1,2 & Guillaume Smits 1,4 . 

Interuniversity Institute of Bioinformatics in Brussels 1 ; Machine Learning Group, Université Libre de Bruxelles 2 ; Institut 

de Pathologie et de Génétique 3 ; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles 4 . 

NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing 

variants. A better combination between existing tools and the right choice of parameters can lead to more specific and 

sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a 

simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled 

way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine 

exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a 

pipeline one variable at the time. 

INTRODUCTION 

Identification of anomalies causing genetic disorders is 

difficult. It can be limited by scarcity of affliction 

concerned, by disorder genetic heterogeneity, or by 

phenotypic pleiotropy associated with the anomalies in a 

single gene. Exome and genome sequencing allowed the 

identification of many genetic diseases causes, whose 

origin remained inaccessible up to now by the usual 

techniques of research in genetics (Ng et al., 2009), 

(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al., 

2014). Exome and genome sequencing data analysis 

pipelines are constituted by several steps (roughly: 

alignment, quality filters, variant calling) and several 

software are available for those steps. Evaluation and 

comparison of those tools are crucial in order to improve 

pipelines accuracy. Exome and genome sequencing 

simulations should allow to determine the veracity of 

called variants (false positives and false negatives). 

METHODS 

We implemented TuneSIM, a wrapper around NGS 

dwgsim (http://sourceforge.net/projects/dnaa/) reads 

simulator with realistic mutations. Generated reads contain 

real mutations from 1KG project and dbsnp138. We use 

existing tool dwgsim for reads generations. In order to 

generate data as realistic as possible we decided to keep 

the haplotype blocks structure. We computed blocks using 

vcf files from 1KG project phase 3 in european individuals 

with Plink (Purcell et al., 2007). For each block, we 

obtained a frequency of each combination of variants and 

we used these frequencies for blocks selection. We also 

insert variants in an independent way using their 

frequencies in dbSNP (Smigielski et al., 2000). Using 33 

in house samples, we computed global allele frequency 

variants distributions in coding and non coding regions 

and we select the variants according to those frequencies. 

Similar operation has been performed for CNVs insertion 

using 1KG data. We are developing a web interface 

allowing users to download existing generated datasets. 

After running their pipelines they can upload their output 

and see accuracy of their pipelines. 


Simulations with different coverage, rate of indels have 

been performed and analysed with different pipelines. 

Results will be presented. 

REFERENCES 

Gilissen, et al. (2012). Disease gene identification strategies for exome 

sequencing. Eur J Hum Genet, 20, 490–497. 

Gilissen, et al. (2014). Genome sequencing identifies major causes of 

severe intellectual disability. Nature, 511, 344–347. 

Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a 

mendelian disorder. Nature Genetics, 42, 30–35. 

Purcell, et al. (2007). PLINK: a tool set for whole-genome association 

and population-based linkage analyses. American journal of human 

genetics, 81, 559–575. 

Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp: 

a database of single nucleotide polymorphisms. Nucleic Acids 

Research, 28, 352–355. 

Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis 

of Mendelian Disorders. N Engl J Med, 369, 1502–1511. 

61



Poster 


P18. RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH 

ALTERNATIVE FUNCTIONALITY IN MUSHROOMS 

Thies Gehrmann 1 , Jordi F. Pelkmans 2 , Han Wösten 2 , Marcel J.T. Reinders 1 & Thomas Abeel 1* . 

Delft Bioinformatics Lab, Delft Technical University 1 ; Fungal Microbiology, Science Faculty, Utrecht University 2 ; 

* T.Abeel@tudelft.nl 

Alternative splicing is well studied in mammalian genomes, and alternative transcripts are often associated with disease 

and their role in regulation is gradually being unveiled. In fungi, the study of alternative splicing has only scratched the 

surface. Using RNA-Seq data, we predict alternative transcripts based on existing gene predictions in two mushroom 

forming fungi. We study the alternative functionality of genes through functional domains, developmental stages, tissue 

and time. This analysis reveals the amount of alternative functionality induced by alternative splicing which was 

previously unknown in fungi, and asserts the need for further research. 

INTRODUCTION 

Transcriptreconstruction algorithms rely on the sparsity 

(intergenic regions) of the genome in order distinguish 

between genes. In fungi, due to the density of the genome, 

transcripts overlap in the up and down-stream untranslated 

regions (UTRs) and prevent the use of existing tools for 

transcript prediction (Roberts et. al. 2011). Previous 

studies (Xie et. al. 2015, Zhao et. al. 2013), were limited 

to the study of splice junctions, more advanced functional 

analyses. We transform the genomes of S. commune and A. 

bisporusin order to enable the prediction of alternative 

transcripts applying existing transcript reconstruction 

algorithms to RNA-Seq data from different tissue types 

and developmental stages. We present a functional 

analysis of the resulting transcripts. 

METHODS 

We apply a transformation on our fungal genomes in order 

to reduce the impact of overlapping UTRs which prevent 

the prediction of alternative transcripts. We split the 

genome into chunks, with each chunk being defined by 

existing gene annotations. Thus, the transformation 

essentially removes intergenic regions (which contain the 

UTRs). Each chunk is then analyzed separately by 

Cufflinks (Roberts et. al. 2011). Predicted transcripts are 

filtered based on read information and ORF sanity. Protein 

domain annotations are predicted for each transcript using 

InterPro (Zdobnov & Apweiler 2001). 

For each gene with multiple alternative transcripts, we 

construct a consensus sequence which allows us to call 

specific splicing events without the influence of erroneous 

reference annotations. 


For both fungi, we find that alternative splicing is 

prevalent and many genes have multiple alternative 

transcripts (see Table 1). 

# Orig. Genes # Filt. # Transcripts 

Genes 

S. commune 

16,319 14,615 20,077 

A. bisporus 

10,438 9612 14,320 

TABLE 1. The number of originally annotated genes in S. Commune and 

A. Bisporus is decreased after prediction based on RNA-Seq data filters 

them out. The number of new transcripts predicted indicates that 

alternative splicing is not a rare event in these fungi. 

The frequency of specific events in the two fungi are 

similar and match what is seen in humans (Sammeth, M, 

et. al. 2008). However, there are significant differences in 

the event usage. While most transcripts in S. commune 

only have one event associated with it, most transcripts in 

A. Bisporushave at least two events. We show that this is a 

result of co-operative events. 

As our dataset consists of multiple developmental timepoints 

and tissue types, we are able to observe the 

alternative use of transcripts through time. If a gene swaps 

transcript usage at a certain time point, this is indicative of 

a functional involvement of that particular transcript (Lees 

et. al. 2015). We find multiple transcripts in both S. 

commune and A. bisporus which are activated in specific 

developmental stages of the mushroom. Furthermore, in A. 

bisporus, we are able to identify transcripts which are 

activated specifically for certain tissue types through 

development. 

Using protein domain predictions for each transcript in a 

gene, we can measure how gene functionality changes 

across its transcripts. Figure 1 shows that functional 

annotations are not always preserved across all transcripts, 

indicating alternative functionality. 

FIGURE 1. Many genes in S. commune demonstrate alternative 

functionality through alternative splicing 

This is the first genome-wide functional analysis of 

alternative splicing in fungi from RNA-Seq data. We find 

a wealth of alternative splicing events in two fungi, 

resulting in many newly discovered transcripts. Although 

their functional influence is not yet demonstrated, we 

present evidence to suggest that they are relevant to 

mushroom development. 

REFERENCES 

Lees, J. G., et. al. BMC Genomics, 16:1 (2015) 

Roberts, A., et. al. Bioinformatics 27:17, 2325–2329. (2011) 

Sammeth, M., et. al. PLoS Computational Biology, 4:8. (2008) 

Xie, B.-B., et. al.. BMC Genomics, 16:54(2015). 

Zdobnov, E. M., & Apweiler, R. Bioinformatics 17:9 (2001) 

Zhao, C., et. al. BMC Genomics, 14:21. (2013). 

62



Poster 


P19. MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE 

QUANTIFICATION IN LABEL-FREE MASS SPECTROMETRY-BASED 

QUANTITATIVE PROTEOMICS 

Ludger Goeminne 1,2,3* , Kris Gevaert 2,3 & Lieven Clement 1 . 

Department of Applied Mathematics, Computer Science and Statistics, Ghent University 1 ; VIB Medical Biotechnology 

Center 2 ; Department of Biochemistry, Ghent University 3 . * ludger.goeminne@UGent.be 

MSqRob is an R/Bioconductor package that uses robust ridge regression on peptide-level data for robust relative 

quantification of proteins in label-free data-dependent acquisition (DDA) mass spectrometry (MS)-based proteomic 

experiments. It has been shown that statistical methods inferring at the peptide-level outperform workflows that 

summarize peptide intensities prior to inference. MSqRob improves upon existing peptide-level methods by three 

modular extensions: (1) ridge regression, (2) empirical Bayes variance estimation and (3) M-estimation with Huber 

weights. The extensions make MSqRob less sensitive towards outliers and missing peptides, enabling more proteins to be 

processed. Our software provides streamlined data analysis pipelines for experiments with simple layouts as well as for 

more complex multi-factorial designs. Using a spike-in dataset, we illustrate that MSqRob grants more stable protein fold 

change estimates and improves the differential abundance (DA) ranking. 

INTRODUCTION 

In a typical label-free DDA LC-MS/MS-based proteomic 

workflow, proteins are digested to peptides, separated by 

RP-HPLC and analyzed by a mass spectrometer. However, 

several issues inherent to the protocol make data analysis 

non-trivial. Most of the common data analysis procedures 

use summarization-based workflows. We have previously 

shown that inference at the peptide level outperforms these 

summarization-based approaches (Goeminne et al., 2015). 

However, even these pipelines are sensitive to outliers and 

suffer from overfitting. Here, we present MSqRob, an 

R/Bioconductor package that starts form peptide-level data 

and provides robust inference on DA at the protein level. 

METHODS 

Dataset. To demonstrate the performance of our package, 

we use the CPTAC dataset, in which 48 known human 

proteins were spiked-in at different concentrations in a 

yeast proteome background. Ideally, when comparing 

different spike-in conditions, only the human proteins 

should be flagged as differentially abundant. 

Competing analytical methods. MaxLFQ+Perseus, 

which summarizes peptide data followed by pairwise t- 

tests. 

LM model. Generally, peptide-based models are 

constructed as follows: 

y ijklmn 

= treat ij + pep ik + biorep il + techrep im 

+ ε ijklmn 

with y ijklmn the n th log 2 -transformed normalized feature 

intensity for the i th protein under the j th treatment treat ij , 

the k th peptide sequence pep ik , the lth biological repeat 

biorep il and the m th technical repeat techrep im , and 

ε ijklmn a normally distributed error term with mean zero 

and variance σ i 

2 . 

MSqRob. MSqRob adds the following improvements to 

the LM model: 

1. Ridge regression: shrink parameter estimates 

towards 0 by adding a ridge penalty term to the 

loss function. 

2. Stabilize variance estimation by borrowing 

information across proteins with empirical 

Bayes (EB): shrink individual variances towards 

the pooled variance. 

3. M estimation with Huber weights: weigh down 

observations with large errors. 


MSqRob uses MaxQuant or Mascot peptide-level data as 

input. It performs preprocessing, robust model fitting and 

returns log 2 fold change estimates and FDR corrected p- 

values for all model parameters and/or (user specified) 

contrasts. Advanced users have the flexibility to (a) adopt 

their own preprocessing pipeline (e.g. transformation, 

normalization, drop contaminants…) and (b) specify the 

appropriate model structure. Compared to competing 

methods, MSqRob returns more stable log 2 fold change 

estimates, improves DA ranking (Figure 1) and is able to 

discern between consistently strong DA and an accidental 

hit caused by outliers or a small variance due to random 

chance in low-abundant proteins. 

FIGURE 1. Receiver operating characteristic (ROC) curves showing the 

superior performance of MSqRob compared to a simple linear model 

(LM) and a summarizarion-based approach (MaxLFQ+Perseus) when 

comparing the lowest spike-in concentration 6A with the second lowest 

spike-in concentration 6B. Stars denote the methods’ cut off at an 

estimated 5 % FDR. 

REFERENCES 

Goeminne LJE et al. Journal of Proteome Research 14, 2457-2465 


63



Poster 


P20. A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF 

MONOALLELICALLY EXPRESSED LOCI AND THEIR DEREGULATION IN 

CANCER 

Tine Goovaerts 1 , Sandra Steyaert 1 , Jeroen Galle 1 , Wim Van Criekinge 1 & Tim De Meyer 1* . 

BIOBIX lab of Bioinformatics and Computational Genomics, Department of Mathematical Modelling, 

Statistics and Bioinformatics, Ghent University 1 . * tim.demeyer@ugent.be 

Imprinting is a phenomenon featured by parent-specific monoallelic gene expression. Its deregulation has been 

associated with non-Mendelian inherited genetic diseases but is also a common feature of cancer. As imprinting does not 

alter the genome yet is mitotically inherited, epigenetics is deemed to be a key regulator. Current knowledge in the field 

is particularly hampered by a lack of accurate computational techniques suitable for omics data. Here we introduce a 

mixture model for the identification of monoallelically expressed loci based on large scale omics data that can also be 

exploited to identify samples and loci featured by loss of imprinting / monoallelic expression. 

INTRODUCTION 

The genome-wide identification of mono-allelically 

expressed or epigenetically modified loci typically 

requires the presence of SNPs to discriminate both alleles. 

Current methods predominantly rely on genotyping for the 

identification of heterozygous loci in a limited sample set, 

followed by testing whether the expression/epigenetic 

modification levels for both alleles deviate from a 1:1 ratio 

for those loci (Wang et al., 2014). This approach is limited 

by the genotyping step and the required presence of 

heterozygous individuals. As large scale omics data is 

becoming increasingly available, an alternative strategy 

may be to screen larger numbers (e.g. hundreds) of 

samples, ensuring the presence of heterozygous 

individuals at predictable rates, thereby also avoiding the 

need for and limitations of a prior genotyping step. 

Based on this concept, a previous strategy (Steyaert et al., 

2014) enabled us to identify and validate approximately 80 

loci featured by monoallelic DNA methylation, but had 

several drawbacks, such as computational inefficiency, 

heavy reliance on Hardy-Weinberg equilibrium (HWE), 

need for 100% imprinting and low power, which limited 

its practical use. Here we present a novel mixture model 

for the identification of monoallelically modified or 

expressed loci from large-scale omics data (without 

known genotypes) that largely circumvents previous 

drawbacks. 

METHODS 

The rationale of the methodology is that RNA-seq and 

ChIP-seq(-like) derived SNP data for monoallelic loci are 

featured by a general lack of apparent heterozygosity. 

More specifically, under the null-hypothesis (no 

imprinting) the homozygous and heterozygous sample 

fractions can be modelled as a mixture of (beta-)binomial 

distributions, with weights according to HWE or 

empirically derived. For imprinted loci however, the 

heterozygous fraction is split and shifted towards the two 

homozygous fractions (Figure 1), which can be evaluated 

with a likelihood ratio test. The model does not require but 

can incorporate prior genotyping data and allows for 

deviation from HWE, sequencing errors and efficiency 

differences and partial monoallelic events. Once loci 

featured by monoallelic events have been identified in 

control data, a loss of imprinting index can be calculated 

for each non-normal sample based on the mixture model 

likelihoods and loci generally featured by loss of 

imprinting in the pathology under study can be identified. 


We demonstrate the applicability of the novel mixture 

model with simulations and a proof of concept study using 

breast cancer and control RNA-seq data from The Cancer 

Genome Atlas (TCGA Research Network, 2008). Well 

known imprinted loci such as IGF2 (Figure 1) and H19 

were indeed identified. Ongoing efforts are directed 

towards artefact-free RNA/ChIP-seq data based allele 

frequency inference and the efficient implementation of a 

beta-binomial based mixture. 

FIGURE 1. Observed (red) and modelled (green) allele frequencies for a 

100% (right, no observable heterozygotes) and a partially imprinted 

(left) SNP of the IGF2 gene 

In conclusion, we introduce a novel mixture model for the 

identification of loci featured by monoallelic events which 

can subsequently be exploited to determine their 

deregulation in the pathology of interest. 

REFERENCES 

Steyaert S et al. Nucleic Acids Research 42, e157 (2014). 

TCGA Research Network. Nature 455, 1061-1068 (2008). 

Wang X & Clark AG. Heredity 113, 156-166 (2014). 

64



Poster 


P21. GEVACT: GENOMIC VARIANT CLASSIFIER TOOL 

Isel Grau 1,4 , Dorien Daneels 2,3 , Sonia Van Dooren 2,3 , Maryse Bonduelle 2 , 

Dewan Md. Farid 1,3 , Didier Croes 2,3 , Ann Nowé 1,3 & Dipankar Sengupta 1,3* . 

Como - Artificial Intelligence Lab, Vrije Universiteit Brussel 1 ; Centre for Medical Genetics, Reproduction and Genetics, 

Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel,UZ Brussel 2 ; Interuniversity Institute of 

Bioinformatics in Brussels, ULB-VUB 3 ; Department of Computer Sciences, Universidad Central de Las Villas 4 . 

* Dipankar.Sengupta@vub.ac.be 

High throughput screening (HTS) techniques, like genome or exome screening are becoming norms in the conventional 

clinical analysis. However, classifying the identified variants to be pathogenic, or potentially pathogenic or nonpathogenic, 

is still a manual, tedious and time consuming process for clinicians or geneticists. Thus, to facilitate the 

variant classification process, we have developed G E V A CT, a Java based tool, designed on an algorithm, i.e. based on the 

existing literature and knowledge of clinical geneticists. G E V A CT can classify variants annotated by Alamut Batch, with 

a future plan to support for inputs from other annotation software's also. 

INTRODUCTION 

With the emergence of new screening techniques, targeted 

or whole exome and genome screening are becoming 

standard diagnostic norms in clinical settings to identify 

the variants for a genetic disease (Ng et al., 2010; 

Saunders et al., 2012). However, development of 

bioinformatics solutions for pathogenic classification of 

the variants still remains a big challenge and henceforth, 

making the process ponderous for geneticists and 

clinicians. In this work, we describe G E V A CT (Genomic 

Variant Classifier Tool), a tool for classification of 

genomic single nucleotide and short insertion/deletion 

variants. The aim of this study was to design and 

implement a variant classification algorithm, based on a 

literature review of cardiac arrhythmia syndromes 

(Hofman et al., 2013; Schulze-Bahr et al., 2000; Wilde & 

Tan, 2007) and existing knowledge of clinical geneticists. 

METHODS 

The algorithm we propose for G E V A CT is based on a 

published variant classification schema for cardiac 

arrhythmia syndromes. This approach is based on the yield 

of DNA testing over a time span of 15 years (1996-2011), 

between probands with isolated/familial cases, and also 

between probands with or without clear disease-specific 

clinical characteristics (Hofman et al., 2013). It proposes 

two varying approaches: one to classify missense variants 

and another to classify nonsense and frameshift variants. 

The algorithm is implemented in two phases: preprocessing 

and classification. In the pre-processing phase, 

the annotated tab-delimited variant file (vcf.ann) from the 

Alamut batch, is refined based on the gene list for the 

disease-of-interest, so as to reduce the number of variants 

for the analysis. Filters are applied to look for variants that 

have already been reported in the Human Genome 

Mutation Database (Stenson et al., 2003) and in ClinVar 

(Landrum et al., 2014), or that have previously been 

detected and classified in an internal patient population. 

And lastly, the variants are filtered based on their location 

in the genome and their coding effect, followed by the 

check for minor allele frequency of the variant in a control 

population (Sherry ST et al. 2001). Thereafter, in the 

classification phase, the filtered variants are classified as 

missense or nonsense and frameshift variants. For 

missense variants the classification is based on the 

parameters: amino acid substitution and its impact on 

protein function (Adzhubei et al., 2010; Kumar et al., 

2009), biochemical variation (Mathe et al., 2006), 

conservation (Pollard et al., 2010), frequency of variant 

alleles in a control population (ExAC, 2015), effects on 

splicing (Desmet et al., 2009), family and phenotype 

information and functional analysis. Whereas, for the 

nonsense and frameshift variants, it is based on: effects on 

splicing, frequency of variant alleles in a control 

population, family and phenotype information and 

functional analysis. For each parameter, a score is given to 

the variant, which is subsequently cumulated. 

Conclusively, based on the cumulative score each variant 

is classified into one of the five categories: Class I - Non- 

Pathogenic; Class II - VUS1 (unlikely pathogenic); Class 

III - VUS2 (unclear); Class IV - VUS3 (likely 

pathogenic); Class V - Pathogenic (Sharon et al., 2008). 


In this study, we report a Java based tool called G E V A CT, 

developed for classification of genomic variants. Input for 

the tool is an annotated vcf file, while the output depicts 

the cumulative classification score along with the class 

label for a variant. The tool was tested on a dataset of 130 

cardiac arrhythmia syndrome patients, available at UZ 

Brussel. The results of the variant classification made by 

the tool were cross-validated by manual curation, 

performed by the clinical geneticist. Definitively, the 

study indicates the tool to be promising but needs to be 

further validated on datasets from other diseases. In 

addition to, we are working on the tool to be adaptable for 

file inputs from other annotation software. 

REFERENCES 

Adzhubei IA et al. Nat Methods 7(4), 248-249 (2010). 

Desmet et al. Nucleic Acids Res 37 (9): e67 (2009). 

Exome Aggregation Consortium (ExAC), Cambridge, MA (2015). 

Hofman N et al. Circulation 128(14),1513-21 (2013). 

Kumar P et al. Nat Protoc 4(7), 1073–1081 (2009). 

Landrum MJ et al. Nucleic Acids Res 42(1), D980-5 (2014). 

Mathe E et al. Nucleic Acids Res 34(5),1317-25 (2006). 

Ng SB et al. Nat Genetics 42, 30–35 (2010). 

Pollard K et al. Genome Res 20, 110-121 (2010). 

Saunders CJ et al. Sci Transl Med 4, 154ra135 (2012). 

Sharon EP et al. Hum Mutat. 29(11), 1282–1291 (2008). 

Sherry ST et al. Nucleic Acids Res 29(1),308-11 (2001). 

Schulze-Bahr E et al. Z Kardiol 89 Suppl 4:IV12-22 (2000). 

Stenson et al. Hum Mutat. 21:577-581 (2003). 

Wilde AA & Tan HL Circ J 71, Suppl A:A12-9 (2007). 

65



Poster 


P22. MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH 

THROUGHPUT INTERACTOMICS DATA FROM ARRAY-MAPPIT 

EXPERIMENTS 

Surya Gupta 1,2,3 , Jan Tavernier 1,2 & Lennart Martens 1,2,3 . 

Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent, Belgium 2 ; 

Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 . 

INTRODUCTION 

Proteins are highly interesting objects of study, involved 

in different cellular and molecular functions. Identification 

and quantification of these proteins along with their 

interacting proteins, nucleic acids and molecules can 

provide insight into development and disease mechanisms 

at the systems level. Yet studying these interactions is not 

trivial. In vivo methods exist to determine these 

interactions, but these suffer from several drawbacks [4]. 

To overcome existing problems, an innovative approach 

called MAPPIT (Mammalian Protein-Protein Interaction 

Trap) [2] has been established in the Cytokine Receptor 

Lab to determine interacting partners of proteins in 

mammalian cells. To allow screening of thousands of 

interactors simultaneously, MAPPIT has been parallelized 

in the array MAPPIT system [3]. 

AIM 

However, no effective pipeline existed to process the highthrough 

put data generated from array MAPPIT. We 

therefore established an automated high-throughput data 

analysis system called MAPPI-DAT (Mappit Array 

Protein Protein Interaction- Database & Analysis Tool). 

METHODS 

In the array-MAPPIT platform the interaction of two 

proteins (bait-prey) restores a mutated JAK-STAT 

signaling pathway which leads to the expression of 

florescence emitting genes. In order to rank the positive 

interactions based on fluorescence intensity, RankProd [1] 

is used. This method was originally developed to 

determine differentially expressed genes in microarray 

experiments and is available as R package. To minimize 

false positive hits from RankProd output, quartile based 

filtration was applied. MySQL platform was used to build 

the data management system for the array-MAPPIT 

system. 

RESULTS 

To extend and ease the usage of the analysis pipeline and 

database system, an interface has been developed called 

MAPPI-DAT. MAPPI-DAT is capable of processing 

many thousand data points for each experiment, and 

comprising a data storage system that stores the 

experimental data in a structured way for meta-analysis. 

REFERENCES 

[1] Breitling, R., Armengaud, P., Amtmann, A., & Herzyk, P. (2004). 

Rank products: A simple, yet powerful, new method to detect 

differentially regulated genes in replicated microarray experiments. 

FEBS Letters, 573(1-3), 83–92. 

[2] Lievens, S., Peelman, F., De Bosscher, K., Lemmens, I., & 

Tavernier, J. (2011). MAPPIT: A protein interaction toolbox built on 

insights in cytokine receptor signaling. Cytokine and Growth Factor 

Reviews, 22(5-6), 321–329. 

[3] Lievens, S., Vanderroost, N., Heyden, J. Van Der, Gesellchen, V., 

Vidal, M., Tavernier, J., & Heyden, V. Der. (2009). Array MAPPIT : 

High-Throughput Interactome Analysis in Mammalian Cells Array 

MAPPIT : High-Throughput Interactome Analysis in Mammalian Cells, 

877–886. 

[4] S.Gopichandran and S.Ranganathan. (2013). Protein-protein 

Interactions and Prediction: A Comprehensive Overview. Protein and 

Peptide Letters, 779–789 

66



Poster 


P23. HIGHLANDER: VARIANT FILTERING MADE EASIER 

Raphael Helaers 1* & Miikka Vikkula 1 . 

Human Molecular Genetics (GEHU), de Duve Institute, Université catholique de Louvain 1 . 

* Raphael.helaers@UCLouvain.be 

The field of human genetics is being revolutionized by exome and genome sequencing. A massive amount of data is 

being produced at ever-increasing rates. Targeted exome sequencing can be completed in a few days using NGS, 

allowing for new variant discovery in a matter of weeks. The technology generates considerable numbers of false 

positives, and the differentiation of sequencing errors from true mutations is not a straightforward task. Moreover, the 

identification of changes-of-interest from amongst tens of thousands of variants requires annotation drawn from various 

sources, as well as advanced filtering capabilities. We have developed Highlander, a Java software coupled to a MySQL 

database, in order to centralize all variant data and annotations from the lab, and to provide powerful filtering tools that 

are easily accessible to the biologist. Data can be generated by any NGS machine (such as Illumina’s HiSeq, or Life 

Technologies’ Solid or Ion Torrent) and most variant callers (such as Broad Institute’s GATK or Life Technologies’ 

LifeScope). Variant calls are annotated using DBNSFP (providing predictions from 6 different programs, and MAF from 

1000G and ESP), GoNL and SnpEff, subsequently imported into the database. The database is used to compute global 

statistics, allowing for the discrimination of variants based on their representation in the database. The Highlander GUI 

easily allows for complex queries to this database, using shortcuts for certain standard criteria, such as “sample-specific 

variants”, “variants common to specific samples” or “combined-heterozygous genes”. Users can browse through query 

results using sorting, masking and highlighting of information. Highlander also gives access to useful additional tools, 

including direct access to IGV, and an algorithm that checks all available alignments for allele-calls at specific positions. 

67



Poster 


P24. DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR 

GENE REGULATORY NETWORK INFERENCE FROM GENE EXPRESSION 

DATA WITH MULTIPLE DOSES AND TIME POINTS 

Diana M Hendrickx 1* , Danyel G J Jennen 1 & Jos C S Kleinjans 1 . 

Department of Toxicogenomics, Maastricht University, The Netherlands 1 . 

*d.hendrickx@maastrichtuniversity.nl 

Toxicogenomics, the application of ‘omics’ technologies to toxicology, is a rapidly growing field due to the need for 

alternatives to animal experiments for toxicity testing of compounds. Identification of gene regulatory networks affected 

by compounds is important to gain more insight into the mode of action of a toxic compound. The response to a toxic 

compound is both time and dose dependent. Therefore, toxicogenomics data are often measured across several time 

points and doses. However, to our knowledge, there does not exist a method for gene regulatory network inference that 

takes into account both time and dose dependencies. Here we present Dose-Time Network Identification (DTNI), a novel 

gene regulatory network inference algorithm that takes into account both dose and time dependencies in the data. We 

show that DTNI can be used to infer gene regulatory networks affected by a group of compounds with the same mode of 

action. This is illustrated with gene expression (microarray) data from COX inhibitors, measured in human hepatocytes. 

INTRODUCTION 

Identifying and understanding gene regulatory networks 

(GRN) influenced by chemical compounds is one of the 

main challenges of systems toxicology. A GRN affected 

by one or more compounds evolves over time and with 

dose. The analysis of gene expression data measured at 

multiple time points and for multiple doses can provide 

more insight in the effects of compounds. Therefore, there 

is a need for mathematical approaches for GRN 

identification from this type of data. 

METHODS 

One of the mathematical approaches currently used for 

GRN inference is based on ordinary differential equations 

(ODE), where changes in gene expression over time are 

related to each other and to the external perturbation (i.e. 

the dose of the compound). Because gene expression data 

usually have less data points than variables (genes), ODE 

approaches are often combined with interpolation and/or 

dimension reduction techniques (PCA). A current method 

that combines ODE with both interpolation and dimension 

reduction techniques is Time Series Network 

Identification (TSNI) (Bansal et al., 2006). 

Here, we present Dose-Time Network Identification 

(DTNI), a method that extends TSNI by including ODE 

that describe changes in gene expression over dose in 

relation to each other and to time. We also adapted the 

original method so that it can include data from multiple 

perturbations (compounds). 


By exploiting simulated data, we show that including 

ODE for expression changes over dose leads to improved 

GRN identification compared with including only ODE 

that describe changes over time. Furthermore, we show 

that DTNI performs better when including data from 

multiple perturbations (compounds) than when applying 

DTNI to data from a single perturbation. This suggests 

that the method is suitable to infer a GRN affected by 

compounds with the same mode of action. As an example, 

we infer the network affected by COX inhibitors from 

public microarray data of 6 COX inhibitors, measured in 

human hepatocytes, available from Open TG-Gates 

(http://toxico.nibio.go.jp/english/index.html) (Noriyuki et 

al., 2012). The interactions in the inferred network were 

compared to interactions from ConsensusPathDB, a 

database including interactions from 32 different sources 

(Kamburov et al., 2013). The inferred network was 

validated by leave-one out cross-validation (LOOCV). Six 

datasets were created from the original data by leaving out 

the data of one compound. The network constructed from 

the whole data set showed large overlap with the networks 

constructed from each of the LOOCV datasets. Edges in 

the network constructed from the whole data set, but not in 

the networks constructed from the LOOCV datasets were 

removed from the network. The remaining novel 

interactions, i.e. those that are not in ConsensusPathDB, 

have to be validated experimentally, e.g. by geneknockdown 

experiments. 

FIGURE 1. Workflow for identifying a gene regulatory network affected 

by a group of compounds with the same mode of action. 

REFERENCES 

Bansal M et al. Bioinformatics 22, 815-822 (2006). 

Noriyuki N et al. J Toxicol Sci 37,791-801 (2012). 

Kamburov A et al. Nucl Acids Res 41, D793-D800 (2013). 

68



Category: Poster 


P25. IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS 

USING A “DUMMY” LIGAND APPROACH 

Susanne M.A. Hermans, Christopher Pfleger & Holger Gohlke * . 

Department of Mathematics and Natural Sciences, Institute for Pharmaceutical and Medicinal Chemistry, Heinrich- 

Heine-University, Düsseldorf, Germany. * gohlke@uni-duesseldorf.de 

Targeting allosteric sites is a promising strategy in drug discovery due to their regulatory role in almost all cellular 

processes. Currently, there is no standard method to identify novel pockets and to detect whether a pocket has a 

regulatory effect on the protein. Here, we present a new and efficient approach to probe information transfer through 

proteins in the context of dynamically dominated allostery that exploits “dummy” ligands as surrogates for allosteric 

modulators. 

INTRODUCTION 

Allosteric regulation is the coupling between separated 

sites in biomacromolecules such that an action at one site 

changes the function at a distant site. Allosteric drugs are 

popular, they often have less side effects then orthosteric 

drugs because the allosteric sites are less conserved. The 

identification of novel allosteric pockets is complicated by 

the large variation in allosteric regulation, ranging from 

rigid body motions to disorder/order transitions, with 

dynamically dominated allostery in between (Motlagh et 

al., 2014). Here we focus on dynamically dominated 

allostery with minimal or no conformational changes. 

Novel pockets do not have a known ligand, therefore we 

generate “dummy” ligands to function as surrogates for 

allosteric ligands. We have developed an efficient 

approach to probe information transfer through proteins 

using “dummy” ligands and detect if allosteric coupling is 

present between the novel pocket and the orthosteric site. 

METHODS 

In a preliminary study to test the general feasibility, the 

approach was applied to conformations extracted from a 

MD trajectory of the holo and apo structures of LFA1. 

The grid-based PocketAnalyzer program (Craig et al., 

2011) is used to detect putative binding sites. “Dummy” 

ligands were generated for each detected pocket along the 

ensemble. Finally, the Constraint Network Analysis 

(CNA) software, which links biomacromolecular structure, 

(thermo-)stability, and function, is used to probe the 

allosteric response by monitoring altered stability 

characteristics of the protein due to the presence of the 

“dummy” ligand (Pfleger et al., 2013; Krüger et al., 2013; 

Pfleger, 2014). The results were compared to those of the 

holo structure with the bound allosteric ligand to validate 

the “dummy” ligand approach. 


Remarkably, the usage of “dummy” ligands almost 

perfectly reproduced the results obtained from the known 

allosteric effector. Although it turned out that the intrinsic 

rigidity of the “dummy” ligands over-stabilizes the LFA1 

structure, these results are already encouraging. Even for 

the LFA1 apo structures, where the allosteric pocket is 

partially closed, the results are in agreement with known 

allosteric effectors. Overall, the results obtained from the 

validation of the “dummy” ligand approach are 

encouraging. This suggests that our “dummy” ligand 

approach for the characterization of unexplored allosteric 

pockets is a promising step towards identifying novel drug 

targets. 

REFERENCES 

Craig, I.R. et al. J. Chem. Inf. Model. 51 2666–2679 (2011). 

Krüger, D. M. et al. Nucleic Acids Res. 41 340–348 (2013). 

Motlagh, H.N. et al. Nature 508 7496 331–339 (2014). 

Pfleger, C. et al. J. Chem. Inf. Model. 53 1007–1015 (2013). 

Pfleger, C. Doctoral Thesis, Heinrich Heine University, Düsseldorf, 

Germany (2014). 

69



Poster 


P26. PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL 

GENETICALLY MODIFIED CONGENIC MICE 

Paco Hulpiau 1,2,3 *, Liesbet Martens 1,2,3 *, Yvan Saeys 1,2,3 , Peter Vandenabeele 1,2,4 & Tom Vanden Berghe 1,2 . 

Inflammation Research Center, VIB, Ghent, Belgium 1 ; Department of Biomedical Molecular Biology, Ghent University, 

Ghent, Belgium 2 ; Data Mining and Modelling for Biomedicine (DaMBi), Ghent, Belgium 3 ; Methusalem Program, Ghent 

University, Belgium 4 . *paco.hulpiau@irc.vib-ugent.be, liesbet.martens@irc.vib-ugent.be 

Targeted mutagenesis in mice is a powerful tool for functional analysis of genes. However, genetic variation between 

embryonic stem cells (ESCs) used for targeting (previously almost exclusively 129-derived) and recipient strains (often 

C57BL/6J) typically results in congenic mice in which the targeted gene is flanked by ESC-derived passenger DNA 

potentially containing mutations. Comparative genomic analysis of 129 and C57BL/6J mouse strains revealed indels and 

single nucleotide polymorphisms resulting in alternative or aberrant amino acid sequences in 1,084 genes in the 129- 

strain genome. 

INTRODUCTION 

Annotating the passenger mutations to the reported 

genetically modified congenic mice that were generated 

using 129-strain ESCs revealed that nearly all these mice 

possess multiple passenger mutations potentially 

influencing the phenotypic outcome. We illustrated this 

phenotypic interference of 129-derived passenger 

mutations with several case studies and developed a Me- 

PaMuFind-It web tool to estimate the number and possible 

effect of passenger mutations in transgenic mice of interest. 

METHODS 

We analyzed the SNP data release v3 from the Mouse 

Genome Project available at Sanger Institute (Keane et al., 

2011). The data in the indel vcf file and SNP vcf file were 

filtered to retrieve indels and SNPs present in at least one 

of the three 129 strains (129P2/OlaH, 129S1/SvIm and 

129S5SvEvB) and affecting the protein coding sequence 

of the genes. These so-called protein coding variants are 

based on the following sequence ontology (SO) terms: 

stop gained, stop lost, inframe insertion, inframe deletion, 

frameshift variant, splice donor variant, splice acceptor 

variant, and coding sequence variant. In total, 949 indels 

and 446 SNPs affecting 1,084 mouse genes were retained. 

We gathered chromosome and gene start and end positions 

for 1,084 genes covering 1,395 variations. The Ensembl 

gene ID was used to find the most upstream and 

downstream start and stop in all Ensembl transcripts for 

that gene. Next these genome coordinates were used to 

search for flanking genes within 2, 10, and 20 Mbps 

upstream and downstream. We then downloaded all mouse 

phenotypic allele data from the MGI resource and 

extracted the data of genetically modified mouse lines. 

Information on 5,322 genes (corresponding to 7,979 129- 

derived genetically modified mouse lines) was connected 

to genes with passenger mutations and affected genes. 

Additionally we filtered the data to identify putative 

regulatory variants. All data were stored in a MySQL 

database and can be queried using the publicly available 

web tool Me-PaMuFind-It: 

http://me-pamufind-it.org/ 

Passenger genome mutations in gene-targeted mice (Nechanitzky and 

Mak, 2015) 


The vast majority of existing and well-characterized 

genetically engineered congenic mice have been created 

using 129 ESCs. 99.5% of these mouse lines are affected 

by a median number of 20 passenger mutations within a 

10 cM flanking region. This implies that nearly all 

genetically modified congenic mice contain multiple 

passenger mutations despite intensive backcrossing. 

Consequently, the phenotypes observed in these mice 

might be due to flanking passenger mutations rather than a 

defect in the targeted gene (Vanden Berghe et al, 2015). 

REFERENCES 

Keane, T.M., Goodstadt, L., Danecek, P., White, M.A., Wong, K., Yalcin, 

B., Heger, A., Agam, A., Slater, G., Goodson, M., et al. (2011). 

Mouse genomic variation and its effect on phenotypes and gene 

regulation. Nature 477, 289–294. 

Nechanitzky R, Mak TW (2015). Passenger Mutations Identified in the 

Blink of an Eye. Immunity 43(1), 9-11. 

Vanden Berghe, T., Hulpiau, P., Martens, L. et al (2015). Passenger 

Mutations Confound Interpretation of All Genetically Modified 

Congenic Mice. Immunity 43(1), 200-9. 

70


Abstract ID: 000 Category: Abstract template 


P27. DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION 

AND DIFFERENCES IN DRUG SUSCEPTIBILITY WITH WGS DATA 

Arlin Keo 1 & Thomas Abeel 1,2,* . 

Delft Bioinformatics Lab, Delft University of Technology , Delft, the Netherlands 1 ; Broad Institute of MIT 

and Harvard, Cambridge, MA, USA 2 . * t.abeel@ tudelft.nl 

Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis and infects millions of people. When a 

person is infected with more than one distinct strain type of tuberculosis (TB), referred to as mixed infection, diagnosis 

and treatment is complicated. Due to difficulty of diagnosis the prevalence of mixed infections among TB patients 

remain uncertain. Whole genome sequencing (WGS) yields a great number of single nucleotide polymorphisms (SNPs) 

and offers increased resolution to distinguish distinct strains. Here, we present a tool that maps sample reads against 21 

bp cluster specific SNP markers to detect putative mixed infections and estimate the frequencies of the present 

subpopulations. 

INTRODUCTION 

Mycobacterium tuberculosis is a clonal, bacterial pathogen 

that causes the pulmonary disease tuberculosis (TB), and it 

infects and kills millions of people worldwide [1]. The 

study of genetic diversity within the M. tuberculosis 

complex (MTBC) is complicated by mixed TB infections, 

which happens when a person is infected with more than 

one distinct strain type of MTBC. This often results in 

poor diagnosis and treatment of patients as the bacterial 

subpopulation may have undetected differences in drug 

susceptibility [2]. A strain typing method should be able to 

distinguish closely related strains, to also allow the 

detection of a mixed infection at finer resolutions [3]. This 

study aims to detect a possible mixed TB infection at 

different levels in MTBC and to determine the frequencies 

of the present strains based on established tree paths in the 

MTBC phylogenetic tree. 

METHODS 

A global comprehensive dataset of 5992 MTBC strains 

was used for analysis, and 226570 SNPs were extracted 

from this set to construct a SNP-based phylogenetic tree 

with RAxML. In this bifurcating tree, each branch 

represents a cluster of strains and splits into two new 

monophyletic subclusters of genetically more closely 

related strain. These ¨splits¨ were used to define clusters 

and subclusters that contain more than 10. Global SNP 

association was done for each cluster to get clusterspecific 

SNPs, those for which the true positive rate, true 

negative rate, positive predictive value, and negative 

predictive value were >0.95. Markers were generated from 

these SNPs by extending them with 10 bp sequence on 

each side based on reference genome H37Rv. Each 

hierarchical cluster now has a set of specific SNP markers. 

By mapping sample reads against these 21 bp clusterspecific 

SNP markers the tool determines the presence of 

paths in the phylogenetic tree that start at the MTBC root 

node. Paths that split indicate the presence of multiple 

strains and thus a mixed infection. 

The read depth at the root node represents a frequency of 1 

of the present MTBC species. If the path splits further in 

the tree, the total read depth is divided over the two 

subpaths and determines the frequencies of those present 

subclusters (Figure 1). 

FIGURE 1. Detection of mixed TB infection with hierarchical clusters. 

The detected strains are combined with detected drug 

susceptibility profiles. A minimized reference genome 

consisting of drug resistance genes and 1000 bp flanking 

regions is used to map sample reads with BWA, and call 

variants with Pilon. Ambiguous variation calls may 

indicate that present strains in a mixed infection sample 

also have differences in drug susceptibility. 


In the phylogenetic tree 308 clusters (MTBC root 

excluded) were defined and there are 14823 SNP markers 

in total that are specific to a cluster and unique within the 

cluster. The known MTBC lineages 1 to 6 have between 

355-614 markers. 

7661 TB samples were tested, present strain(s) and 

frequencies could be predicted for 7495 samples of which 

914 (~12%) are mixed infections (Table 1). 

# of subpopulations 1 2 3 >3 

# of samples 6581 798 95 21 

TABLE 1. 914 Out of 7495 samples is a mixed infection. 

REFERENCES 

1. World Health Organization. Global Tuberculosis Report. World 

Health Organization, Geneva, Switzerland, 2014. 

2. Zetola et al. Mixed Mycobacterium tuberculosis complex infections 

and false-negative results for rifampicin resistance by GeneXpert 

MTB/RIF are associated with poor clinical outcomes. Journal of 

Clin. Microb., 52:2422-2429, 2014. 

3. G. Plazzotta, T. Cohen, and C. Colijn. Magnitude and sources of bias 

in the detection of mixed strain M. tuberculosis infection. Journal of 

theoretical biology, 368:67–73, 2015. 

71



Poster 


P28. APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO 

CIRCULATING MICRORNAS REVEALS NOVEL BIOMARKERS FOR DRUG- 

INDUCED LIVER INJURY 

Julian Krauskopf 1* , Florian Caiment 1 , Sandra Claessen 1 , Kent J. Johnson 2 , Roscoe L. Warner 2 , Shelli J. Schomaker 3 , 

Deborah A. Burt 3 , Jiri Aubrecht 3 , Jos C. Kleinjans 1 . 

Department of Toxicogenomics, Maastricht University, Maastricht 6200 MD, The Netherlands 1 ; Pathology Department, 

University of Michigan, Ann Arbor, MI 48109, USA 2 ; Drug Safety Research and Development, Pfizer, Inc., Groton, CT 

06340, USA 2 . *j.krauskopf@maastrichtuniversity.nl 

Drug-induced liver-injury (DILI) is a leading cause of acute liver failure and the major reason for withdrawal of drugs 

from the market. Preclinical evaluation of drug candidates has failed to detect about 40% of potentially hepatotoxic 

compounds in humans. At the onset of liver injury in humans, currently used biomarkers have difficulty differentiating 

severe DILI from mild, and/or predict the outcome of injury for individual subjects. Therefore, new biomarker 

approaches for predicting and diagnosing DILI in humans are urgently needed. Recently, circulating microRNAs 

(miRNAs) such as miR-122 and miR-192 have emerged as promising biomarkers of liver injury in preclinical species 

and in DILI patients. In this study, we focused on examining global circulating miRNA profiles in serum samples from 

subjects with liver injury caused by accidental acetaminophen (APAP)-overdose. Upon applying next generation highthroughput 

sequencing of small RNA libraries, we identified 36 miRNAs, including three novel miRNA-like small 

nuclear RNAs, which were enriched in serum of APAP overdosed subjects. The set comprised miRNAs that are 

functionally associated with liver-specific biological processes and relevant to APAP toxic mechanisms. Although more 

patients need to be investigated, our study suggests that profiles of circulating miRNAs in human serum might provide 

additional biomarker candidates and possibly mechanistic information relevant to liver injury. 

72



Poster 


P29. INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION 

Ajay Anand Kumar 1,2 * , Geert Vandeweyer 1,2 , Lut Van Laer 1,2 & Bart Loeys 1,2 . 

Department of Medical Genetics, University of Antwerp 1 ; Biomedical informatics, Antwerp University Hospital 2 . 

*ajay.kumar@uantwerpen.be 

The identification of top candidate genes involved in human diseases from a list of candidate genes remains 

computationally challenging. Many tools exist for this computational prioritization, of which the core typically utilizes 

fusion or integration of various genomic annotation data sources. However, due to the rapid generation of novel data 

high-throughput experiments, annotation sources often become outdated, lead to annotation errors. Hence, predictions 

based on these computational tools are not reliable. To tackle this, we propose an information theoretic model that 

effectively fuses annotation sources and regression model under Bayesian framework to prioritize candidate genes. Our 

method is fast and performs better as compared to four existing tools on their own benchmark dataset. 

INTRODUCTION 

Gene Prioritizaton has become a central research problem 

in the bioinformatics domain. With the advent of exome 

sequencing in clinical genetics, it became a necessity to 

automate the identification of the top most genes likely 

involved in the disease from a given pool of affected 

genes. Various annotation sources can be integrated or 

fused to learn multiple functionality of genes and then 

design a classifiers/regressor for prioritization. We 

propose here an early data integration method that 

implements an information retrieval model to fusing the 

data at functional feature level and then designing a 

discriminative regression model in Bayesian framework to 

prioritize candidate genes. 

METHODS 

Principle behind our approach is based on guilt-byassociation. 

Genes that are known to be disease associated 

might also share similar functions. The idea is that a 

classifier or regressor can be trained on the linear 

mapping between functional proximity profiles of genes 

and their phenotypic proximity profiles. We implemented 

Bayesian regressor to infer the degree of association of the 

test genes with the query disease. The work-flow of is 

shown in the Figure 1. The details are: 

1. Functional annotation: Text, Ontologies (GO, MPO), 

Sequence similarity, Pathways, Interactions. Phenotype 

annotation: Human Phenotype Ontology (HPO), Disease 

Ontology (DO), HuGe/ MeSh terms and GAD 

2. TF - IDF (Term Frequency – Inverse document 

frequency) methodology is used to assign statistical 

weights to the functional attributes of genes form these 

annotation sources. TF-IDF is data driven model 

traditionally used for information retrieval. We apply same 

methodology for weighing features. Together, it gives 

gene-by-gene functional & phenotypic proximity profiles. 

3. Finally, the Bayesian linear regression model for a 

given set of query disease or training genes it learns the 

linear mapping between functional & phenotypic 

proximity profiles. Y = βX + η, where is Gaussian 

distributed. We have incorporated traditional noninformative 

Normal-Inverse Gama (NIG) priors for 

estimating the unknowns namely β and б. 


We performed leave-one-out cross validation experiment 

on the benchmark data set that was used to compare four 

other tools whose design principles are similar to our 

method [1]. Our dataset consisted of 1040 disease genes 

categorized under manually curated 12 different disease 

classes [2]. In our preliminary results for 1154 

prioritizations under the cut-off of top 5%, 10% and 30% 

genes ranked in random control dataset we achieved 

AUROC of 86.31 % against their best achieved score of 

83.0%. This clearly indicates our method is comparatively 

better with other tools mentioned in the comparative 

analysis. 

FIGURE 1. Workflow of Bayesian regression model for gene 

prioritization. 

Currently, we are incurring large-scale cross-validation 

with manually curated 6762 disease gene association with 

more number of tools and benchmark data [3]. 

Additionally, we also plan to explore to develop 

probabilistic generative approach to model cooccurrences, 

dependencies of features for effective data 

fusion that can help in finding novel disease causing 

genes. 

REFERENCES 

1. Chen B et.al BMC Med Genomics. 2015;8 Suppl 3:S2 

2. Goh et.al Proc Natl Acad Sci USA 2007, 104(21):8685-8690 

3. Börnigen, Daniela, et al. Bioinformatics 28.23 (2012): 3081-3088. 

73



Poster 


P30. GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS 

FROM GENE EXPRESSION DATA 

Griet Laenen 1,2,* , Amin Ardeshirdavani 1,2 , Yves Moreau 1,2 & Lieven Thorrez 1,3 . 

Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, 

KU Leuven 1 ; iMinds Medical IT Dept., KU Leuven 2 ; Dept. of Development and Regeneration @ Kulak, KU Leuven 3 . 

* griet.laenen@esat.kuleuven.be 

Galahad (https://galahad.esat.kuleuven.be) is a web-based application for the analysis of gene expression data from drug 

treatment versus control experiments, aimed at predicting a drug’s molecular targets and biological effects. Galahad 

provides data quality assessment and exploratory analysis, as well as computation of differential expression. Based on 

the obtained differential expression values, drug target prioritization and both pathway and disease enrichment can be 

calculated and visualized. Drug target prioritization is based on the integration of the gene expression data with a 

functional protein association network. 

INTRODUCTION 

Gene expression analysis is frequently employed to study 

the effects of drug compounds on cells. The observed 

transcriptional patterns can provide valuable information 

for identifying compound–protein inter-actions as well as 

resulting biological effects. To facilitate the analysis of 

this particular data type and enable an in-depth exploration 

of a drug’s mode of effect, we have developed Galahad 1 . 

INPUT 

The main input for Galahad are raw Affymetrix human, 

mouse or rat DNA microarray data derived from both 

untreated control samples and samples treated with a drug 

of interest. In addition, Galahad provides the possibility to 

start from differential expression data derived with other 

platforms to perform drug target prioritization and 

enrichment analysis. 

METHODS 

The different analyses are depicted in Figure 1 and 

include: 

 

 

 

 

 

 

preprocessing of the raw data with RMA or 

MAS5.0, as indicated by the user; 

quality assessment and exploratory analysis to 

ascertain data quality, uncover experimental 

issues, and help in deciding whether certain 

arrays need to be considered as outlying; 

differential expression analysis to determine the 

significance of gene up- and downregulation 

following drug treatment; 

genome-wide drug target prioritization by 

means of an in-house developed algorithm for 

network neighborhood analysis integrating the 

expression data with functional protein 

association infor-mation 2 ; 

prediction of molecular pathways involved in the 

drug’s mode of effect; 

identification of associated disease phenotypes 

enabling side effect prediction and drug 

repositioning. 

OUTPUT 

The output is displayed in a series of tabs corresponding to 

the different analyses selected by the user: 

 

 

 

 

in the Quality Control and Data Exploration 

tabs, several diagnostic plots are displayed along 

with a short explanation; 

the Differential Expression tab contains a sorted 

table listing all genes together with their log 2 

ratios and P-values for differential expression, as 

well as links to the corresponding GeneCards 

sections; 

in the Drug Target Prioritization tab, a ranked 

list of genes as potential targets of the drug can be 

found, together with the network diffusion-based 

scores and P-values for prioritization, and links to 

the corresponding GeneCards section; in addition, 

a network-based visualization is available for 

each gene, showing the 10 interaction partners 

contrib-uting most to the gene’s ranking; 

the tabs summarizing the results for Pathway 

and Disease Enrichment contain a sorted table 

with pathway or disease ontology IDs, names, 

and database links, together with the number of 

differentially expressed genes in the 

corresponding gene sets and the accompanying P- 

values; in addition, network graphs are available, 

consisting of the top 10 most significant 

pathways or disease phenotypes, along with their 

associated genes colored according to fold change. 

FIGURE 1. Overview of the Galahad analysis steps. 

REFERENCES 

1. Laenen G. et al. Nucl Acids Res 43, W208-W212 (2015). 

2. Laenen G. et al. Mol BioSyst 9, 1676-1685 (2013). 

74




P31. KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT 

FOR INTRINSICALLY DISORDERED PROTEINS 

Joanna Lange 1,2 , Lucjan S Wyrwicz 1 & Gert Vriend 2* . 

Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center; 

Institute of Oncology 1 , CMBI, Radboud University Nijmegen 2 . * vriend@cmbi.ru.nl 

INTRODUCTION 

Intrinsically disordered proteins (IDPs) lack tertiary 

structure and thus differ from globular proteins in terms of 

their sequence – structure – function relations. IDPs have a 

lower sequence conservation, different types of active 

sites, and a different distribution of functionally important 

regions, which altogether makes their multiple sequence 

alignment (MSA) difficult. 

Algorithms underlying existing MSA programs are 

directly or indirectly based on knowledge obtained from 

studying three dimensional protein structures. Hereby we 

introduce a tool for Knowledge based Multiple sequence 

Alignment for intrinsically Disordered proteins, KMAD, 

that incorporates SLiM, domain, and PTM annotations to 

improve the alignments. 

KMAD web server is accessible at 

http://www.cmbi.ru.nl/kmad/. A standalone version is 

freely available. 

METHODS 

Dataset of proteins experimentally proven to be disordered 

was obtained from DisProt (Sickmeier et al., 2007). For 

each IDP all homologous sequences were extracted from 

SwissProt (The Uniprot Consortium, 2014) using BLAST. 

The sequence sets were aligned with several MSA tools. 

Apart from manual validation we also performed a 

benchmark validation on reference sets from BAliBASE 

(Thompson et al., 2005) and PREFAB holding structurebased 

'gold standard' sequence alignments. For this 

purpose we used KMAD and a modified version of 

KMAD, which performs a ’refinement’ of Clustal Omega 

(Sievers et al., 2011) alignments. 


Manual validation showed that KMAD bypasses many 

mistakes made by Clustal Omega. An example of an 

alignment mistake is shown on Figure 1. 

a) Clustal Omega 

b) KMAD 

FIGURE 1. Excerpts from Clustal Omega and KMAD alignments of 

human sialoprotein (SIAL HUMAN) with four homologues. Various PTM 

kinds are highlighted with bright colours 

In the field of sequence alignment research it is common 

practice to compare the sequence alignments obtained with 

MSA software with those that are obtained from structure 

superpositions. IDPs do not possess a static 3D structure 

so that this method is not applicable to KMAD alignments. 

Both of the validation methods that we used have their 

disadvantages, but so far there is no alternative. Validation 

on benchmark alignments of structured proteins is biased 

towards Clustal Omega, because it was optimized to work 

with structured proteins. On the other hand, the manual 

inspection based on the same features that influence the 

alignment is not a very elegant method, but given the 

nature of IDPs probably the best we can do. 

REFERENCES 

Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high 

accuracy and high throughput. Nucleic Acids Research, 32(5), 1792– 

1797. 

Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., 

Lopez, R., McWilliam, H., Remmert, M., S öding, J., Thompson, J. 

D., and Higgins, D. G. (2011). Fast, scalable generation of highquality 

protein multiple sequence alignments using Clustal Omega. 

Molecular System Biology, 7(539), 539. 

Sickmeier, M., Hamilton, J. a., LeGall, T., Vacic, V., Cortese, M. S., 

Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N., 

Obradovic, Z., and Dunker, a. K. (2007). DisProt: the Database of 

Disordered Proteins. Nucleic Acids Research, 35(Database issue), 

D786–93. 

The Uniprot Consortium (2014). Activities at the Universal Protein 

Resource (UniProt). Nucleic Acids Research, 42(Database issue), 

D191–8. 

Thompson, J. D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE 

3.0: latest developments of the multiple sequence alignment 

benchmark. Proteins: Structure, Function, and Bioinformatics, 

61(1), 127–136. 

75



Poster 


P32. ON THE LZ DISTANCE FOR DEREPLICATING 

REDUNDANT PROKARYOTIC GENOMES 

Raphaël R. Léonard 1,2* , Damien Sirjacobs², Eric Sauvage 1 , Frédéric Kerff 1 & Denis Baurain². 

Centre for Protein Engineering, University of Liège 1 ; PhytoSYSTEMS, University of Liège 2 . * rleonard@doct.ulg.ac.be 

The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem 

when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of 

the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to 

single species, such as Escherichia coli (almost 2000 genomes as of Sept 2015), while the continuous flow of newly 

discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes 

combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An 

automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the 

quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes 

without redundancy. We are planning to release this tool on a website that will be made publicly available. 

INTRODUCTION 

The LZ distance (Lempel and Ziv, 1977; Otu and Sayood, 

2003) is inspired by compression algorithms, such as gzip 

or WinRAR. This distance, amongst others, has already 

been used in attempts to produce alignment-free 

phylogenetic trees (Bacha and Baurain, 2005; Hohl et al. 

2007), though the results were disappointing in such a 

context (due to the heterogeneity of the substitution 

process at large evolutionary scales). However, the LZ 

distance is likely to provide enough resolving power to 

identify groups of redundant genomes and to keep only 

one representative for each group. 

METHODS 

For each pair of genomes A and B, the LZ distance is 

computed from the gzip-compressed file lengths of the 

corresponding nucleotide assemblies s(A) and s(B) and of 

their concatenations s(A+B) and s(B+A). These distances, 

along with taxonomic information, are stored in a 

database. 

A clustering method is then applied to regroup the similar 

genomes into a user-specified number of groups. For each 

of these groups, a representative is chosen based on the 

quality of the genomic assemblies (chromosomes rather 

than scaffolds) and of the protein annotations (e.g., few 

rather than many “unknown proteins”). 


Our method using the LZ distance is currently under 

development using the genomes from the release 28 of 

Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/ 

bacteria/release-28/). It contains 20,950 unique 

prokaryotic genomes, composed of 286 Archaea and 

20,664 Bacteria. The three most represented phyla are the 

Proteobacteria (8642, of which 1980 E. coli), the 

Firmicutes (7766) and the Actinobacteria (2673). These 

genomes are already the result of a pre-processing step 

designed to remove extra assemblies for strains present in 

multiple copies (due to parallel sequencing or 

resequencing in different labs). 

We are working on different approaches for validating our 

dereplication method, based on (1) current taxonomy, (2) 

16S rRNA phylogeny, and (3) clustering using genomic 

signatures (Moreno-Hagelsieb et al. 2013). 

First, we compute a central measure of the taxonomic 

“purity” of all genome clusters, which reflects the amount 

of “mixture” at different taxonomic levels (phylum, class, 

order etc). A good clustering should regroup different 

genera (or species) without amalgamating distinct classes 

(or phyla). Second, we cut the branches of a large 16S 

rRNA tree based on the same genome collection to 

produce an equal number of groups to compare with our 

clustering method. We then compute a statistic of the 

overlap between the 16S subtrees and the LZ clusters. A 

good clustering should have a reasonable overlap with the 

gold standard that is the 16S rRNA tree. Third, using the 

same overlap metric, we compare the LZ clusters to 

clusters obtained using the genomic signature. 

Finally, an interactive tool will be made available through 

a website. It will allow the users to download precomputed 

sets of representative genomes for either the 

complete database or for taxonomic subsets. We are also 

planning to allow users to upload their own genomes to 

cluster them with the LZ method. 

REFERENCES 

Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data 

Compression.’ IEEE Transactions on Information Theory 23.3. 

doi:10.1109/TIT.1977.1055714. 

Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for 

Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130. 

doi:10.1093/bioinformatics/btg295. 

Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013. 

‘Phylogenomic Clustering for Selecting Non-Redundant Genomes 

for Comparative Genomics.’ Bioinformatics 29.1: 947–949. 

doi:10.1093/bioinformatics/btt064. 

Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment 

Required for Accurate Inference of Phylogeny?’ Systematic biology 

56.2: 206–221. doi:10.1080/10635150701294741. 

Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity 

to alignment-free sequence comparison of protein families’. 

Benelux Bioinformatics Conference 2005. 

http://hdl.handle.net/2268/80179 

76



Poster 


P33. THE ROLE OF MIRNAS IN ALZHEIMER’S DISEASE 

Ashley Lu 1,2* , Annerieke Sierksma 1,2 , Bart De Strooper 1,2 & Mark Fiers 1,2 . 

VIB Center for the Biology of Disease 1 ; KU Leuven Center for Human Genetics 2 . * ashley.lu@cme.vib-kuleuven.be 

MicroRNAs (miRNA) play an important role in post-transcriptional regulation and were shown to be dysregulated in 

Alzheimer’s disease. By analysing the hippocampal miRNA and mRNA expression of two mouse models of Alzheimer’s 

disease, we identify a set of miRNAs that are dysregulated with the onset of cognitive impairments. Using GO 

enrichment analysis we aim to identify miRNAs that likely play a role in learning and memory. 

INTRODUCTION 

MiRNAs are small non-coding RNAs involved in posttranscriptional 

regulation through mRNA inhibition or 

degradation. Past studies have suggested miRNAs to play 

a direct role in Alzheimer’s disease (AD), e.g. by 

modulating the expression of genes involved in the 

formation of neuropathological protein aggregates (Lau P 

& De Strooper B, 2010). In this study, we investigated the 

changes in miRNA and mRNA expression in two AD 

mouse models: APPswe/PS1 L166P (Radde R, 2006) and 

Thy-Tau22 (Schindowski K, 2006), which have similar 

patterns of cognitive impairment, but different pathology. 

We aim to better understand the functional role of 

miRNAs in AD-related cognitive impairments. 

METHODS 

RNA was extracted from the left hippocampus of 96 mice. 

The experiment covers the two models (APPswe/PS1 L166P 

& Thy-Tau22), with wild type controls for each. All 

genotypes are tested at two ages (4 and 10 months); before 

and after onset of cognitive impairment. This yields eight 

experimental groups with twelve mice each. 

Expression profiles of miRNAs and mRNAs were 

generated using Illumina single-end sequencing. 

Differential Expression (DE) analysis was performed 

using the limma package of R/Bioconductor with a linear 

model to test the effects of age, genotype and their 

interaction. 

Functional analysis of the mRNAs and miRNAs are 

conducted separately. For mRNAs, gene ontology analysis 

was applied to sets of the most up- and down regulated 

genes. 

To determine the functional impact of dysregulated 

miRNAs we determined which mRNAs are the most likely 

direct targets of each miRNA using the following 

approach: 1) for each miRNA we calculated the Pearson’s 

correlation coefficient to each mRNA based on the 

miRNA and mRNA expression data. 2) For each miRNA 

we extracted the predicted set of targets from Targetscan 

(Lewis BP & Burge CB & Bartel DP, 2005), with Diana 

(Maragkakis M et al. 2011) as backup when Targetscan 

had no record. 3) We filtered the miRNA target genes by 

determining the leading edge set in a GSEA PreRanked 

analysis (Subramanian A. et al, 2005) using the predicted 

target mRNAs of each miRNA against the mRNAs ranked 

according to the Pearson’s scores generated in step 1. We 

additionally investigated target sets based on a Pearson’s 

correlation coefficient cut-off of -0.2, -0.3, and -0.4. 4) 

Gene-ontology analysis was then applied to these 

candidate target sets to infer the likely biological function 

of each miRNA. 


DE analysis showed that the direction of expression level 

changes in mRNAs are similar between APPswe/PS1 166P 

and Thy-Tau22 in terms of age*genotype interaction 

effects. However, for the miRNAs the expression pattern 

is less obvious. Overall, the effect size is more pronounced 

in APPswe/PS1 L166P mouse than the Thy-Tau22 for both 

miRNAs and mRNAs. 

Functional analyses of the down-regulated mRNAs show a 

clear enrichment in cognition and neural development 

related categories, whereas up-regulated genes show a 

clear inflammatory signature. 

Combining miRNA target prediction with miRNA/mRNA 

correlation analysis shows a marked increase of GO 

enrichment scores. This analysis strongly suggests a 

regulatory role for miRNAs in the down regulation of 

genes involved in learning, cognition and related 

categories. 

This analysis workflow has allowed focusing on a list of 

miRNAs that likely play a direct role in the observed 

learning and memory deficits in AD mouse models, and 

have been used to select candidate miRNAs for 

downstream in vivo experiments, which will hopefully 

provide a deeper understanding in the impact of AD on 

learning and cognition. 

REFERENCES 

Lau P & De Strooper B. Seminars in Cell & Developmental Biology, 

21(7), 768–773, (2010). 

Radde R. EMBO reports, 7(9), 940–946, (2006). 

Schindowski K. The American Journal of Pathology, 169(2),599–616, 

(2006). 

Lewis BP & Burge CB & Bartel DP. Cell, 120,15-20 (2005). 

Maragkakis M et al. Nucleic Acids Research (2011) 

Subramanian A. et al. Proceedings of the National Academy of Sciences 

of the United States of America, 102(43), 15545–15550, (2005) 

77



Poster 


P34. FUNCTIONAL SUBGRAPH ENRICHMENTS 

FOR NODE SETS IN REGULATORY NETWORKS 

Pieter Meysman 1,2* , Yvan Saeys 3,4 , Ehsan Sabaghian 5,6 , Wout Bittremieux 1,2 , 

Yves van de Peer 5,6 , Bart Goethals 1 & Kris Laukens 1,2 . 


Antwerpen (biomina) 2 ; VIB Inflammation Research Center 3 ; Department of Respiratory Medicine, Ghent University 4 ; 

Department of Plant Biotechnology and Bioinformatics, Ghent University 5 ; Department of Plant Systems Biology, 

VIB/Ghent University 6 . * pieter.meysman@uantwerpen.be 

We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given 

set of nodes. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment 

based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused 

extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it 

on two transcriptional regulatory networks and show that we can find relevant functional subgraphs enriched for the 

selected nodes. 

INTRODUCTION 

Frequent subgraph mining (FSM) is a common but 

complex problem within the data mining field that has 

gained in importance as more graph data has become 

available. However traditional FSM finds all frequent 

subgraphs within the graph dataset, while often a more 

interesting query is to find the subgraphs that are most 

associated with a specific set of nodes. Nodes of interest 

might be those that are associated with a specific disease, 

or those that are differentially expressed in an omics 

experiment. 

METHODS 

To address this issue, we developed a novel subgraph 

mining algorithm that can efficiently construct, match and 

test candidate subgraphs against the given graph for 

enrichment within a specific set of nodes (Meysman et al. 

2015). To allow the enrichment testing, each candidate 

subgraph is built around a ‘source’ node. A subgraph 

match where the source node corresponds to a node of 

interest is counted as a ‘hit’. If the source node is not a 

node of interest, it is counted as a background hit. In this 

manner the problem of enrichment can be easily tested 

using a hypergeometric test. Furthermore, we show that 

this definition of enrichment allows us to drastically prune 

the search space that the algorithm must traverse to find all 

enriched subgraphs. 

An implementation of the algorithm is available at 

http://adrem.ua.ac.be/sigsubgraph. 


The first data set concerned the yeast genes that have 

remained in duplicate following the most recent whole 

genome duplication. Within the yeast transcriptional 

network, we found that these duplicate genes were 

enriched for self-regulating motifs (e.g. feedback loops, 

self edges, etc.), which matches the duplicated nature of 

these genes (Figure 1). 

FIGURE 1. Enriched subgraphs for yeast duplicated genes 

The second data set concerned mining the subgraphs 

associated with the homologs of the PhoR transcription 

factor across seven different inferred bacterial regulatory 

networks from Colombos expression data (Meysman et al. 

2014). These PhoR homologs were found to be 

significantly associated with several complex regulatory 

motifs. 

REFERENCES 

Meysman P et al. Discovery of Significantly Enriched 

Subgraphs Associated with Selected Vertices in a 

Single Graph. Proceedings of the 14th International 

Workshop on Data Mining in Bioinformatics (2015). 

Meysman P et al. COLOMBOS v2. 0: an ever expanding 

collection of bacterial expression compendia. Nucleic 

acids research 42 (D1), D649-D653 (2014). 

78


Abstract ID: 000 

Category: Poster 


P35. HUMANS DROVE THE INTRODUCTION & SPREAD OF 

MYCOBACTERIUM ULCERANS IN AFRICA 

Koen Vandelannoote 1,2,* , Conor Meehan 1* , Miriam Eddyani 1 , Dissou Affolabi 3 , Delphin Mavinga Phanzu 4 , Sara 

Eyangoh 5 , Kurt Jordaens 6 , Françoise Portaels 1 , Kirstie Mangas 7 , Torsten Seemann 7 , Herwig Leirs 2 , Tim Stinear 7 & 

Bouke C. de Jong 1 . 

Institute of Tropical Medicine, Antwerp, Belgium 1 ; Evolutionary Ecology Group, University of Antwerp, Antwerp, 

Belgium 2 ; Laboratoire de Référence des Mycobactéries, Cotonou, Benin 3 ; Institut Médical Evangélique, Kimpese, 

Democratic Republic of Congo 4 ; Centre Pasteur du Cameroun, Yaoundé, Cameroun 5 ; Joint Experimental Molecular 

Unit, Royal Museum for Central Africa, Tervuren, Belgium 6 ; Department of Microbiology and Immunology, University 

of Melbourne, Melbourne, Australia 7 . *cmeehan@itg.be 

Buruli ulcer (BU) is an insidious neglected tropical disease. BU is reported around the world but the rural regions of 

West and Central Africa are most affected. How BU is transmitted and spreads has remained a mystery, even though the 

causative agent, Mycobacterium ulcerans, has been known for more than 70 years. Here, using the tools of population 

genomics, we reconstruct the evolutionary history of M. ulcerans by comparing 167 isolates spanning 48 years and 

representing 11 endemic countries across Africa. The genetic diversity of African M. ulcerans proved very limited 

because of its slow substitution rate coupled with its recent origin. We show for the first time how M. ulcerans has 

existed in Africa for several hundreds of years but was recently re-introduced during the period of Neo-imperialism. We 

also provide evidence of the role that the so-called “Scramble for Africa” played in the spread of the disease. 

INTRODUCTION 

The clonal population structure of M. ulcerans has meant 

that conventional genetic fingerprinting methods have 

largely failed to differentiate clinical disease isolates, 

complicating molecular analyses on the elucidation of the 

population structure, and the evolutionary history of the 

pathogen. Whole genome sequencing (WGS) is currently 

replacing conventional genotyping methods for M. 

ulcerans. 

METHODS 

We analyzed a panel of 165 M. ulcerans disease isolates 

originating from disease foci in 11 different African 

countries that had been cultured between 1964 and 2012. 

Index-tagged paired-end sequencing-ready libraries were 

prepared from gDNA extracts. Genome sequencing was 

performed on the Illumina HiSeq 2000 DNA sequencer or 

the Illumina MiSeq sequencing platform with respectively 

2x150bp and 2x250bp paired-end sequencing chemistry. 

Read mapping and SNP detection were performed using 

the Snippy v.2.6 pipeline. Bayesian model-based inference 

of the genetic population structure was performed using 

BAPS v.6.0. 1 Evidence for recombination between 

different BAPS-clusters was assessed using BRAT- 

NextGen 2 . We used BEAST2 v2.2.1 3 to date evolutionary 

events, determine the substitution rate and produce a timetree 

of African M. ulcerans. A permutation test was used 

to assess the validity of the temporal signal in the data. To 

assess the geospatial distribution of African M. ulcerans 

through time, an additional BEAST2 analysis was 

performed with a discrete BSSVS geospatial model 4 . 


Resulting sequence reads were mapped to the Ghanaian M. 

ulcerans Agy99 reference genome and, after excluding 

mobile repetitive elements and small indels, we detected a 

total of 9,193 SNPs randomly distributed across the M. 

ulcerans chromosome with approximately 1 SNP per 613 

bp (0.15% nucleotide divergence). We explored the 

distribution of DNA chromosomal deletions and identified 

differential genome reduction that strongly supports the 

existence of two specific M. ulcerans lineages within the 

African continent, hereafter referred to as Lineage Africa I 

(Mu_A1) and Lineage Africa II (Mu_A2). Subsequent 

SNP-based exploration of the genetic population structure 

agreed with the above deletion analysis and subdivided the 

African M. ulcerans population into four major clusters. 

BRAT-NextGen did not detect any recombined segments 

in any isolate, supporting a strongly clonal population 

structure for M. ulcerans that is evolving by vertically 

inherited mutations. Within the phylogenetic tree, isolates 

formed tight, shallow-rooted phylogenetic clusters which 

are suggestive of contemporary dispersal. We estimated a 

very slow mean genome wide substitution rate of 6.32E-8 

per site per year. The Bayesian analysis demonstrated that 

Mu_A1 has existed in Africa for several hundreds of years 

and that Mu_A2 was recently introduced on the continent. 

The re-introduction event coincides well with a historical 

event of particular interest: the period of Neo-imperialism 

(1881-1914). Since tMCRA(Mu_A2) did not predate 

colonization it seems very likely that lineage Mu_A2 was 

introduced after the instigation of colonial rule through an 

influx of BU infected humans. The time-tree of African M. 

ulcerans also reveals evidence of the likely role that the 

so-called “Scramble for Africa” played in the spread of 

endemic Mu_A1 clones in three hydrological basins 

(Congo, Oueme & Nyong) that are particularly well 

covered by our isolate panel. 

REFERENCES 

1. Corander, J., et al. (2008) BMC bioinformatics. 9: p. 539. 

2. Marttinen, P., et al. (2012) Nucleic acids research. 40(1): p. e6. 

3. Bouckaert, R., et al. (2014) PLoS computational biology. 10(4): p. 

e1003537. 

4. Lemey, P., et al., (2009) PLoS computational biology. 5(9): p. 

e1000520. 

79



Poster 


P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA 

DETECTION AND CLASSIFICATION IN PLANTS 

Lionel Morgado 1* & Frank Johannes 2,3 . 

Groningen Bioinformatics Centre (GBiC), University of Groningen 1 ; Department of Plant Sciences, Center of Life and 

Food Sciences Weihenstephan, Technical University Munich 2 ; Institute of Advanced Studies, Technical University 

Munich 3 . * lionelmorgado@gmail.com 

Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional 

silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a 

given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in 

combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are 

bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing 

data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our 

SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and 

classification in plants. 

INTRODUCTION 

Small RNA molecules are known to have an important 

role in gene expression control. It is therefore of extreme 

interest to be able to detect them and determine the 

regulatory pathways in which they are involved. With the 

current laboratorial methods it is unfeasible to test the high 

number of sRNA candidates, but there are computational 

methods that can greatly narrow down the list. 

Nevertheless, sRNA activity is still far from being fully 

understood and that is reflected in the very high false 

positive rate of the prediction tools currently available. 

High throughput sequencing in combination with 

immunoprecipitation (IP) techniques make nowadays 

possible to access sRNA sequences associated with 

specific AGO. AGO-sRNA binding is a fundamental step 

for the activation of specific silencing pathways. Here, 

AGO-sRNA data acquired from A. thaliana is explored 

with SVM-based algorithms to learn which sequence 

features drive different AGO-sRNA associations. Using 

this knowledge, a framework for in silico sRNA detection 

and classification in plants is presented. 

METHODS 

A system with 3 layers of classifiers (see figure 1) was 

designed to identify different kinds of sRNA: the 1 st layer 

includes a binary SVM model that filters out sequences 

that don’t bind to AGO and are therefore most probably 

inactive; 2 nd layer is composed by an ensemble of binary 

classifiers, each trained to explore the differences in sRNA 

bound to a specific AGO against all others; and finally, the 

3 rd layer comprises a multiclass linear model to assign the 

most akin AGO to a given sRNA, using scores produced 

in the previous layer. 

Diverse AGO-sRNA libraries from A. thaliana were 

explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10. 

After the typical RNA-seq library preprocessing, quality 

check and genome mapping, several features were 

extracted from the remaining sequences, namely: position 

specific base composition, sequence length, k-mer 

composition and entropy scores. The different feature sets 

were explored separately and in different combinations. 

Initially, highly correlated features (pearson score>0.75) 

were removed, and the remaining ones were further 

subjected to selection using SVM-RFE (Guyon et al., 

2002) with a linear kernel to handle the large data set size. 

A 10-fold cross-validation procedure was executed to 

modulate the variation in the data, being the best features 

of each round determined as the ones with the highest 

average weight across the models with the best ROC-AUC 

score in each cross-validation subset. Each round, 1/3 of 

the remaining features with the worst performance were 

eliminated, being the process repeated until no more 

features were available. The best features found were then 

used to train the final classifiers using RBF kernels with 

optimal parameters. This was repeated for all models in 

layers 1 and 2. 

AGO1 

vs 

otherAGO 

AGO vs noAGO 

AGO2 

vs 

otherAGO 

… 

Final AGO prediction 

FIGURE 1. Proposed architecture for the SVM-based framework. 


AGO10 

vs 

otherAGO 

Layer 1 

Layer 2 

Layer 3 

Although the classifiers are still being optimized, 

preliminary results from the 2 nd layer of the framework 

(see figure 1) show that the top ranked features by SVM- 

RFE reflect indeed significant biological patterns for 

AGO-sRNA association. Among others, the relevance of 

the 5’ terminal nucleotide was observed, in agreement 

with findings from previous work (Mi et al., 2008). 

Additionally, the accuracy for the models trained span 

values that range from 71% to 86%, showing their 

capacity to recognize specific AGO-binding patterns. 

REFERENCES 

Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn 

46:389-422 (2002) 

Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the 

5’terminal nucleotide. Cell 133(1): 116-27 (2008). 

Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5: 

413, 209-215 (2014). 

80



Poster 


P37. ANALYSIS OF RELATIONSHIP PATTERNS 

IN UNASSIGNED MS/MS SPECTRA 

Aida Mrzic 1,2* , Wout Bittremieux 1,2 , Trung Nghia Vu 4 , Dirk Valkenborg 3,5,6 , Bart Goethals 1 & Kris Laukens 1,2 . 


Antwerpen (biomina) 2 ; Flemish Institute for Technological Research (VITO), Mol 3 ; Karolinska Institutet, Stockholm 4 ; 

CFP, University of Antwerp 5 ; I-BioStat, Hasselt University 6 . * aida.mrzic@uantwerpen.be 

Tandem mass spectrometry (MS/MS) spectra generated in proteomics experiments often contain a large portion of 

unexplained peaks, despite continuous search engines improvements. Here we use pattern mining technique to determine 

the origin of these unassigned spectra. We discover patterns that indicate the presence of chimeric spectra and missed 

post-translational modifications (PTMs). 

INTRODUCTION 

Regardless of being a rich source of information, mass 

spectra acquired in mass spectrometry proteomics 

experiments often contain a significant number of 

unexplained peaks, or even remain completely 

unidentified. The unexplained fraction of mass spectra 

may come from low-quality or chimeric MS/MS spectra, 

or unexpected PTMs. To interpret the unexplained data, 

we propose a structured analysis of the peaks occurring in 

MS/MS spectra. We employ an unsupervised pattern 

mining technique (Naulaerts et al., 2013) to discover 

which peaks are associated with each other, and therefore 

are likely to have a common origin. 

METHODS 

Frequent itemset mining 

The technique we used to discover relationships between 

frequently co-occurring peaks in MS/MS data is frequent 

itemset mining, a class of data mining techniques that is 

specifically designed to discover co-occurring items in 

transactional datasets. The typical example of frequent 

itemset mining is the discovery of sets of products that are 

frequently bought together. Here, every set of products 

purchased together represents a single transaction, which 

results in a dataset consisting of a large number of 

supermarket basket transactions that can be mined for 

frequent patterns (Figure 1). In our approach a transaction 

consists of the mass differences between relevant peaks in 

the MS/MS spectrum. 

FIGURE 1. Frequent itemset mining principle. 

Mass differences associations 

In order to detect relationships between different types of 

mass spectrometry peaks, a distinction is made between 

peaks that were relevant for spectrum identification 

(assigned peaks) and peaks that were not used for the 

identification (unassigned peaks) (Vu et al., 2013). The 

mass differences between peaks (either assigned, 

unassigned, or both) are then calculated so that for each 

MS/MS spectrum in the dataset there is a single 

transaction consisting of all its mass differences. 

After obtaining these transactions for all MS/MS spectra 

in the dataset, frequent itemset mining can be employed to 

detect relationship patterns (Figure 2). These patterns can 

indicate previously unknown characteristics of the spectra, 

or even detect novel PTMs. 

FIGURE 2. Outline of the approach. 


In order to evaluate our approach, we used MS/MS 

datasets from the PRoteomics IDEntifications (PRIDE) 

database (Vizcaino et al., 2013). This database contains a 

large number of publicly available datasets from massspectrometry-based 

proteomics experiments. However, the 

quality of the submitted datasets can be subject to a large 

variability, which makes it a proper candidate for our 

pattern mining approach. 

Preliminary results show that the detected patterns are able 

to capture valid information in a spectrum. The obtained 

patterns indicate peaks originating from the same peptide 

in case of chimeric spectra and mass differences 

originating from common PTMs. 

REFERENCES 

Naulaerts et al. Brief Bioinform, 16(2): 216–231 (2015). 

Vizcaino et al. Nucleic Acids Res, 41(D1):D1063-9 (2013). 

Vu et al. Proteome Science, 12:54 (2014). 

81



Poster 


P38. MINING ACROSS “OMICS” DATA FOR DRUG PRIORITIZATION 

Stefan Naulaerts 1,2* , Pieter Meysman 1,2 , Bart Goethals 1 , Wim Vanden Berghe ,3 & Kris Laukens 1,2 . 


Antwerpen (biomina) 2 ; Department for Biomedical Sciences, University of Antwerp 3 . * stefan.naulaerts@uantwerpen.be 

Drug resistance and response have traditionally been investigated by means of case-by-case studies. The process to 

profile drug compounds is time and resource intensive. Large scale information on gene expression and protein 

abundance, protein interactions, as well as functional and pathways annotations exist nowadays, as well as freely 

accessible repositories for drug targets. Also structural evidence of select drug compounds is publicly available. These 

data offer an enormous opportunity for data integration and pattern mining efforts across each of these levels. Here, we 

apply frequent itemset mining to identify structurally similar compounds, and to detect patterns within the biological 

effect profiles of these chemical compound families. Next, we explore how we can link both types of patterns to metainformation 

(such as drug interactions) in a bid to identify promising compounds and speed up the drug discovery 

process by means of candidate prioritization. 

INTRODUCTION 

In the last decades, several widely used databases have 

emerged. These vary from gene expression data and massspectrometric 

protein identifications to resources covering 

interaction graphs or functional annotations of proteins 

and chemicals. 

The presence of these resources offers interesting 

opportunities to gain deeper insight in drug mode of action, 

as well as help reduce important bottlenecks with regards 

to the speed of novel drug discovery or drug repurposing, 

by intelligently prioritizing potentially interesting 

compounds. 

METHODS 

To integrate the listed kinds of data, we use pattern mining 

methods that are collectively known as “frequent itemset 

mining”. This set of techniques uses clever heuristics to 

efficiently find items that occur more often together than a 

minimal threshold. In this work, we identified several 

pattern types based on their source: 

 

 

 

Expression itemsets 

Metadata itemsets 

Graph patterns (protein-protein, protein-drug and 

chemical structures) 

For subgraph mining, we used GASTON 1 . All other data 

sources were analysed with Apriori 2 . 

To deal with the extreme numbers of patterns that result 

from mining this kind of data, we used a filter which 

incorporates several quality measures based on objective 

data mining measures properties (e.g. lift), as well as more 

biologically inspired methods (e.g. functional coherence in 

the Gene Ontology 3 tree). 

Simple classification based on the patterns was performed 

with CBA 4 . 


We were able to identify several backbone patterns within 

the chemical structures studied and used these to define 

“chemical compound families”. Next, we used this 

classification as starting point to group experimental 

evidence (bio-assays, interactions and metadata). After 

applying cut-offs based on the quality measures, all 

patterns remaining were significant and made sense 

biologically. 

Unsurprisingly, structurally similar compound families 

show significant pattern overlaps in drug-drug interactions, 

gene expression, term co-occurrence and conserved 

protein-protein interactions. We found that specific 

patterns in the biological profile often correlate with 

specific discriminative structural patterns. Moreover, these 

collections of structural frequent subgraphs seemed highly 

relevant for the mode in which a compound connects to 

the “core” proteome. This central proteome performs 

essential functions of the cell (e.g. energy metabolism) and 

it is known to be conserved across cell types. Structurally 

distinct compound families converge much later (if at all) 

to the same “core proteins” than more similar chemicals 

do. This observation corresponds to currently known 

pathway knowledge and tissue biology. 

We were further able to associate previously unseen 

compounds to chemicals present in the database, based on 

the subgraph collection and by extension to the biological 

profile patterns. Manual survey of literature indicated that 

several compounds not covered by our database have 

recently been approved or are in testing as alternative 

drugs to the compounds we hypothesized as being 

substantially similar. 

FIGURE 1. Visualizing the dexamethasone environment. Both predictions 

and experimental evidence (drug-target and protein-protein interactions) 

are shown. 

REFERENCES 

1. Nijssen S & Kok J. ENTCS 127, 77-87 (2005). 

2. Agrawal R & Srikant R. Proc 20th Int Conf on Very Large Databases 

(1994). 

3. Ashburner M et al. Nat Genet 25, 25-29 (2000). 

4. Liu B et al. KDD (1998). 

82



Poster 


P39. ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX 

HISTORY OF NON-BIFURCATING SPECIATION IN THE GENUS 

ARABIDOPSIS 

Polina Novikova 1 , Nora Hohmann 2 , Marcus Koch 2 & Magnus Nordborg 1 . 

Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), A-1030 Vienna, Austria 1 ; Centre for 

Organismal Studies Heidelberg, University of Heidelberg, D-69120 Heidelberg, Germany 2 . 

*magnus.nordborg@gmi.oeaw.ac.at 

The prevailing notion of species rests on the concept of reproductive isolation. Under this model, sister taxa should not 

share genetic variation unless they still hybridize, or diverged too recently for genetic drift to have eliminated shared 

ancestral polymorphism, and gene trees should generally agree with species trees. Advances in sequencing technology 

are finally making it possible to evaluate this model. We sequenced (Illumina 100bp paired reads) multiple individuals 

from 26 proposed taxa in the genus Arabidopsis. Cluster analysis identified seven distinct groups, corresponding to four 

common species — the model species A. thaliana, plus A. arenosa, A. halleri and A. lyrata — and three species with 

very limited geographical distribution. However, at the level of gene trees, only the separation of A. thaliana from the 

remaining taxa was universally supported, and even in this case there was abundant sharing of ancestral polymorphism 

with the other taxa, demonstrating that reproductive isolation must be fairly recent. By considering the distribution of 

derived alleles, we were also able to reject a bifurcating species tree because there is clear evidence for asymmetrical 

gene flow between taxa. Finally, we show that the pattern of sharing and divergence between taxa differs between gene 

ontologies, suggesting a role for selection. 

83



Poster 


P40. RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN 

READING FRAMES (SORFS), A NEW SOURCE OF BIOACTIVE PEPTIDES 

Volodimir Olexiouk 1,* , Jeroen Crappé 1 , Steven Verbruggen 1 & Gerben Menschaert 1,* . 

Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and 

Bioinformatics, Faculty of Bioscience Engineering, Ghent University 1 . 

INTRODUCTION 

Evidence for micropeptides, defined as translation 

products from small open reading frames (sORFs), has 

recently emerged. While limitations contributed to 

sequencing technologies as well as proteomics have 

stalled the discovery of micropeptides. It is the advent of 

ribosome profiling (RIBO-SEQ), a next generation 

sequencing technique revealing the translation machinery 

on a sub-codon resolution, that provided evidence in favor 

of translating sORFs. RIBO-SEQ captures and 

subsequently sequences the +-30 nt mRNA-fragments 

captured within ribosomes, providing means to identify 

translating sORFs, possible encoding functional 

micropeptides. Since the advent of ribosome profiling 

several micropeptides were described with import cellular 

functions micropeptides (e.g. Toddler, Pri-peptides, 

Sarcolipin and Myoregulin). 

METHODS 

RIBO-SEQ allows the identification of sORFs with 

ribosomal activity, however in order to further access the 

coding potential (potential of sORFs truly encoding 

functional micropeptides) down-stream analysis is 

necessary. Here we propose a pipeline which starts from 

RIBO-SEQ, implements state-of-the-art tools and metrics 

accessing the coding potential of sORFs and creates a list 

of candidate sORFs for downstream analysis (e.g. 

proteomic identification). In summary, assessment of the 

coding potential includes: PhyloCSF (conservation 

analysis), FLOSS-score (Ribosome protected fragment 

(RPF) length distribution analysis), ORFscore (distribution 

analysis of RPFs towards the first frame of a coding 

sequence (CDS), BLASTp (sequence similarity), VarAn 

(genetic variation analysis). In an attempt to set a 

community standard in addition to make sORFs accessible 

to a larger audience, a public database (www.sorfs.org) is 

provided where public available datasets were processed 

by this pipeline, allowing users to browse, query and 

export identified ORFs. Furthermore a PRIDE-respin 

pipeline was developed in order to periodically search the 

PRIDE database for proteomic evidence. 


The pipeline has been tested and curated on three different 

cell-lines. These cell-lines include: HCT116 (human), E14 

mESC (mouse) and s2 (fruitfly). Results obtained 

provided similar results to those reported in recent 

literature proving its relevance. All metrics, as stated 

above, have been carefully inspected for their biological 

relevance and contributed significantly to the detection of 

sORFs. The pipeline is currently being finalized, however 

is available upon request. The public repository is 

accessible at http://www.sorfs.org, and includes the 

datasets mentioned above resulting in 263354 sORFs. Two 

querying interfaces were implemented, a default query 

interface intended for browsing sORFs and a BioMart 

query interface for advanced querying and export 

functions. sORFs have their own detail page, visualizing 

the above discussed metrics and ribosome profiling data 

and a link to the UCSC-browser is provided, visualizing 

the RIBO-SEQ data. 

REFERENCES 

Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a, 

Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al. 

(2014) Toddler: an embryonic signal that promotes cell movement 

via Apelin receptors. Science, 343, 1248636. 





Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De Keulenaer,S., 

De Meester,E., De Meyer,T., Van Criekinge,W., Van Damme,P., et 

al. (2014) PROTEOFORMER: deep proteome coverage through 

ribosome profiling and MS integration. Nucleic Acids Res., 

10.1093/nar/gku1283. 

Ingolia,N.T. (2014) Ribosome profiling: new views of translation, from 

single codons to genome scale. Nat. Rev. Genet., 15, 205–13. 

Crappé,J., Van Criekinge,W., Trooskens,G., Hayakawa,E., Luyten,W., 

Baggerman,G. and Menschaert,G. (2013) Combining in silico 

prediction and ribosome profiling in a genome-wide search for novel 

putatively coding sORFs. BMC Genomics, 14, 648. 





Chanut-Delalande,H., Hashimoto,Y., Pelissier-Monier,A., Spokony,R., 

Dib,A., Kondo,T., Bohère,J., Niimi,K., Latapie,Y., Inagaki,S., et al. 

(2014) Pri peptides are mediators of ecdysone for the temporal 

control of development. Nat. Cell Biol., 16 

84



PosterBeNeLux Bioinformatics Conference – Antwerp, 

December 7-8 2015 

Abstract 10th ID: Benelux 000 Bioinformatics Category: Conference Abstract template 


P41. RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE 

ALIGNMENT 

Gabriele Orlando 1,2,3,4 , Wim Vranken 1,2,3 and & Tom Lenaerts 1,4,5 . 

1 Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, CP 263 1 ; 2 Structural 

Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2 2 ; 3 Structural Biology Research Center, VIB,1050 Brussels, 

Belgium 3 ;. 4 Machine Learning group, Université Libre de Bruxelles, Brussels, 1050, Belgium 4 ;. 5 Artificial Intelligence 

lab, Vrije Universiteit Brussel, Brussels, 1050, Belgium 5 . 

INTRODUCTION 

Reliable protein alignments are a central problem for 

many bioinformatics tools, such as homology modelling. 

Over the years many different algorithms have been 

developed and different kinds of information have been 

used to align very divergent sequences [1]. Here we 

present a pairwise alignment tool, called Rigapollo, based 

on pairwise HMM-SVM, which includes backbone 

dynamics predictions [2] in the alignment process: recent 

work suggests that protein backbone dynamics is often 

evolutionary conserved and contains information 

orthogonal to the amino acid conservation.. 

METHODS 

Rigapollo uses a pairwise HMM-SVM alignment 

approach to infer the optimal alignment between two 

proteins, taking into consideration both sequence and 

dynamic information. The model (described in Figure 1) is 

composed by 3 states: M (match), G1 (gap in the first 

sequence) and G2 (gap in the second sequence). The 

transition probabilities are defined in the same way as a 

standard HMM. This new alignment tool is further 

designed in the following manner: 

Defining the N-dimensional feature vectors: 

Each amino acid in the sequences is described by an N- 

dimensional feature vector. That vector can be defined 

using any kind of information, ranging from evolutionary 

information (i.e. PSSM calculated with HHblits [3])) to 

dynamics predictions (using the DynaMine predictor [2]). 

While standard pairwise HMMs require the definition of a 

finite and discrete alphabet of observable states, our model 

works directly using these feature vectors (that can be both 

orthonormal or not orthonormal), evaluating the emission 

probability with a support vector machine (SVM). 

Definition of the emisisonemission probability: 

We define the emission probability using a SVM trained 

to discriminate matches from mismatches. We define as 

matches all the positions in the reference pairwise 

alignments that do not contain gaps and we use the 

concatenation of the previously defined feature vectors to 

describe them. These matches are considered positive hits. 

For what concerns the mismatches, we perform the same 

procedure, but couple positions that, in the reference 

alignment, are shifted a number of amino acids, varying 

between 5 and 10. After the training, the predicted 

emission probabilities for the M state, given the 

concatenation of two feature vectors, will be a function of 

the distance from the decision hyperplane of the SVM 

(called f(D)). The corresponding emission probabilities for 

the states G1 and G2 will be modeled as 1-f(D) 


For the evaluation of the performances of Rigapollo, we 

adopted two publicly available subsets of the Balibase and 

SABmark alignmenta datasets, already used to evaluate 

other pairwise alignment tools [1]; from the MSAs, allpair 

pairwise alignments has been extracted, and all these 

that shared a percentage of sequence equal to the median 

of the one of the full database has been put in the subset. 

The datasets consist respectively in 38 and 123 manually 

curated, structure based pairwise alignments and they 

share very low sequence identity. For the evaluation of the 

performances we performed a 10 folds randomized crossvalidtion. 

Rigapollo increases the quality of low sequence 

identity pairwise alignment from 5 to 10% respect to the 

state of the art methods and it seams appears that the 

increase in the performancewse is more marked in very 

Figure 1: Structure of the pairwise HMM-SVM model 

divergent sequences, such as the onesthose in the 

SABmark dataset , where the dynamics information seams 

to significantly increase the quality of the alignment. This 

is probably due to the fact that dynamics are often well 

conserved in functional patterns, also when the sequence 

is not preserved [2]. 

REFERENCES 

[1] Do Chuong B.et al. Research in Computational Molecular Biology. 

Springer Berlin Heidelberg, 2006 

[2] Cilia, Elisa, et al. Nucleic acids research 42.W1 (2014): W264-W270 

[3] Remmert, Michael, et al.Nature methods 9.2 (2012): 173-175. 

85



Poster 


P42. EARLY FOLDING AND LOCAL INTERACTIONS 

R. Pancsa 1 , M. Varadi 1 , E. Cilia 2,3 , D. Raimondi 1,2,3 & W. F. Vranken 1,3,* . 

Structural Biology Research Centre, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium 1 ; 

Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium 2 ; Interuniversity Institute of Bioinformatics 

in Brussels (IB) 2 , Brussels, Belgium 3 . * wvranken@vub.ac.be 

INTRODUCTION 

Protein folding is in its early stages largely determined by 

the protein sequence and complex local interactions 

between amino acids, resulting in the formation of foldons 

that provide the context for further folding into the native 

state. These early folding processes are therefore 

important to understand subsequent folding steps and their 

influence on, for example, aggregation, but they are 

difficult to study experimentally. We here address this 

issue computationally by assembling and analysing a 

dataset on early folding residues from hydrogen deuterium 

exchange (HDX) data from NMR and MS, and analyse 

how they relate to the sequence-based backbone dynamics 

predictions from DynaMine (Cilia et al. 2013, 2014) and 

evolutionary information from multiple sequence 

alignments. 

METHODS 

We assembled a dataset of HDX experimental data from 

NMR and MS from literature for 57 proteins totalling 

4172 residues. The data was classified by the into early, 

intermediate and late classes depending on the folding 

time where protection of the backbone NH was observed, 

and into strong, medium and weak classes depending on 

how long the amides remain protected upon unfolding the 

native state. This resulted in 219 residue sets that are 

organised in XML files and loaded into a database that is 

made available online via http://start2fold.eu. 

The DynaMine predictions were run locally with a new 

version of the software that handles C- and N-terminal 

effects. These original predictions were then normalised 

by shifting them so that the maximum prediction value for 

each protein is always 1.0, so not affecting the relative 

differences between the prediction values within each 

protein, but effectively normalising the values between 

different proteins. MSAs were generated for each 

sequence in the dataset using HHblits and Jackhmmer with 

3 iterations and E value threshold of 10 -4 . All the retrieved 

homologs have minimum 90% coverage with the query 

sequence. By using HHfilter, a post processing tool 

provided in the HHblits package, we built two different 

sets of MSAs by varying the maximum pairwise sequence 

identity threshold between the collected homologs in each 

MSA. The (ungapped) sequences in the MSAs were 

predicted without normalisation in order to preserve the 

differences within a protein family, and mapped back to 

the full (gapped) MSA. 

Our analysis shows that the DynaMine-predicted rigidity 

of the protein backbone represents where the protein is 

likely to adopt specific lower free energy conformations 

based on sequence-encoded local interactions, as 

evidenced by the HDX data on early folding (Figure 1). 

This effect is also present on a per-residue basis. 

FIGURE 1. Distribution of DynaMine predictions for early folding 

residues (green) and non-early folding residues (brown) for the original 

(left) and normalized (right) values. 

When relating the secondary structure elements as 

observed in the native fold to the early folding residues, 

we observe that the ‘early folding’ secondary structure 

elements also tend to be more rigid overall. Finally, we 

examined whether early folding is conserved in evolution 

on the basis of multiple sequence alignments. Although 

there is no conservation of individual amino acids, the 

physical characteristic of a rigid backbone seems to be 

conserved. 

We therefore propose that the backbone dynamics of the 

protein is a fundamental physical feature conserved by 

proteins that can provide important insights into their 

folding mechanisms and stability. 

REFERENCES 

Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013). 

From protein sequence to dynamics and disorder with DynaMine. 

Nature Communications, 4, 2741. 

http://doi.org/10.1038/ncomms3741 

Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2014). 

The DynaMine webserver: predicting protein dynamics from 

sequence. Nucleic Acids Research, 12(Web Server), W264–W270. 

http://doi.org/10.1093/nar/gku270 


86



Poster 


P43. BINDING SITE SIMILARITY DRUG REPOSITIONING: 

A GENERAL AND SYSTEMATIC METHOD FOR DRUG DISCOVERY 

AND SIDE EFFECTS DETECTION 

Daniele Parisi & Yves Moreau. 

I developed a protocol based on prediction of druggable cavities, comparison of these putative binding sites and crossdocking 

between bound ligands and the binding site detected to be similar to the one of the complex, in order to study the 

cross reactivity of known compounds. It is a general method because it can find applications both in drug repositioning 

and in the study of adverse effects, and it is systematic because it consists in several subsequent steps. It would indicate 

ligands to screen, reducing the number of candidates and allowing companies or universities to save money and time 

from unnecessary tests. 

INTRODUCTION 

The ability of small molecules to interact with multiple 

proteins is referred to as polypharmacology [1] , and the 

strategy that aims to exploit the positive aspects of 

polypharmacology is drug repositioning, whereby existing 

drugs are investigated for efficacy against targets for other 

indications. Existing drugs are privileged structures with 

verified bioavailability and compatibility. Furthermore, 

virtual screening allows to conduct repositioning of 

existing drugs against novel disease targets without the 

expense of purchasing thousands of compounds [2] . The 

combination of structure-based virtual screening (such as 

estimation of similarity of protein-ligand binding sites and 

consequent cross-docking) and drug repositioning 

represents a highly efficient and fast methodology for 

predicting cross-reactivity and putative side effects of drug 

candidates [3] . 

METHODS 

Each step of my work is related to a bioinformatics 

technique or tool, resulting to be the coupling of different 

software. 

1. At first there is the choice of the query (a single protein 

as PDB file) and the templates (a set of PDB 

structures). At least one of the two categories has to 

present a ligand bound in a cavity; 

2. prediction of druggable cavities in all the protein 

structures using a geometry-based or an energy-based 

algorithm (Fpocket, geometry-based tool, in my case); 

3. comparison of the query binding sites to the binding 

sites of the templates for assessing the similarity. It can 

be carried out by an alignment or alignment-free 

algorithm (I used Apoc, an alignment based tool); 

4. cross-docking of the ligand available in the pair of 

similar binding sites, into the other cavity, in order to 

study the binding with a different target for toxicity or 

new therapeutic indications (AutodockVina); 

5. Fingerprinting of the new complex ligand-cavity for 

scoring the docking poses. 

I applied this protocol on two different queries (Thrombin 

and Dihydrofolate reductase), using a data set of 1067 

druggable proteins as tamplates (Druggable Cavity 

Directory). 


The method works well in repositioning ligands among 

proteins of the same family (intraprotein), but is not able 

to detect interprotein similarities (among not related 

proteins). It happens because of the big size of the 

predicted cavities (larger than the mere space occupied by 

the ligand) coupled to the alignment-based algorithm used, 

which make difficult to have a sufficient similarity rate 

and exponentially increase the false negatives. For my 

further works I will divide the cavity space in subpockets, 

disengage the similarity from the sequence by using 

pharmacophoric maps, and couple the structure based 

similarity to the ligand based and network based. All the 

information will be fused with data integrations algorithms. 

REFERENCES 

On the origins of drug polypharmacology, Xavier Jalencas and Jordi 

Mestres, Med. Chem. Commun., 2013, 4, 80. 

Drug repositioning by structure-based virtual screening, Dik-Lung Ma, 

Daniel Shiu-Hin Chana and Chung-Hang Leung, Chem. Soc. Rev., 

2013, 42, 2130. 

Comparison and Druggability Prediction of Protein−Ligand Binding 

Sites from Pharmacophore-Annotated Cavity Shapes, Jérémy 

Desaphy, Karima Azdimousa, Esther Kellenberger, and Didier 

Rognan, J. Chem. Inf. Model. 2012, 52, 2287−2299. 

87



Poster 


P44. ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS 

OF ACETOBACTER GHANENSIS AND ACETOBACTER SENEGALENSIS TO 

THE COCOA BEAN FERMENTATION PROCESS THROUGH A GENOMIC 

APPROACH 

Rudy Pelicaen, Koen Illeghems, Luc De Vuyst, and Stefan Weckx * . 

Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering 

Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, 

Brussels, Belgium. *Stefan.Weckx@vub.ac.be 

Acetobacter ghanensis LMG 23848 T and Acetobacter senegalensis 108B are acetic acid bacteria species that originate 

from a spontaneous cocoa bean heap fermentation process. They have been indicated as strains with interesting 

functionalities through extensive metabolic and kinetic studies. Whole-genome sequencing of A. ghanensis LMG 23848 T 

and A. senegalensis 108B allowed to unravel their genetic adaptations to the cocoa bean fermentation ecosystem. 

INTRODUCTION 

Fermented dry cocoa beans are the basic raw material for 

chocolate production. The cocoa pulp-bean mass contents 

of the cocoa pods undergo, once taken out of the pods, a 

spontaneous fermentation process that lasts four to six 

days. This process is characterised by a succession of 

yeasts, lactic acid bacteria (LAB), and acetic acid bacteria 

(AAB) coming from the environment (De Vuyst et al., 

2015). 

METHODS 

Total genomic DNA isolation and purification of A. 

ghanensis LMG 23848 T and A. senegalensis 108B was 

followed by the construction of an 8-kb paired-end library, 

454 pyrosequencing, and assembly of the sequence reads 

using the GS De Novo Assembler version 2.5.3 with 

default parameters. Genome finishing was performed by 

PCR assays to close gaps in the draft assembly using 

CONSED 23.0. Automated gene prediction and annotation 

of the assembled genome sequences were carried out using 

the bacterial genome sequence annotation platform 

GenDB v2.2 (Meyer et al., 2003). The predicted genes 

were functionally characterised using searches in public 

databases and bioinformatics tools, and annotations were 

manually curated. Comparative analysis of the genome 

sequences of the cocoa-derived strains A. ghanensis LMG 

23848 T (this study), A. senegalensis 108B (this study), and 

A. pasteurianus 386B (Illeghems et al., 2013) was 

accomplished by the EDGAR framework (Blom et al., 

2009). 


The genomes of the strains investigated consisted of a 

circular chromosomal DNA sequence with a size of 2.7 

Mbp and two plasmids for A. ghanensis LMG 23848 T and 

a circular chromosomal DNA sequence with a size of 3.9 

Mbp and one plasmid for A. senegalensis 108B (Figure 1). 

Comparative analysis revealed that the order of 

orthologous genes was highly conserved between the 

genome sequences of A. pasteurianus 386B and A. 

ghanensis LMG 23848 T . Evidence was found that both 

species possessed the genetic ability to be involved in 

citrate assimilation and they displayed adaptations in their 

respiratory chain. As is the case for many AAB, the 

missing gene encoding phosphofructokinase in the 

genome sequences of both A. ghanensis LMG 23848 T and 

A. senegalensis 108B resulted in a non-functional upper 

part of the Embden–Meyerhof–Parnas pathway. However, 

the presence of genes coding for membrane-bound PQQdependent 

dehydrogenases enabled the AAB strains 

examined to rapidly oxidise ethanol into acetic acid. 

Furthermore, an alternative TCA cycle, characterised by 

genes coding for a succinyl-CoA:acetate-CoA transferase 

and a malate:quinone oxidoreductase, was present. 

Furthermore, evidence was found in both genome 

sequences that glycerol, mannitol and lactate could be 

used as energy sources. Thus, although both species 

displayed genetic adaptations to the cocoa bean 

fermentation process, their dependence on glycerol, 

mannitol and lactate may partly explain their low 

competitiveness during cocoa bean fermentation processes, 

as these substrates have to be formed through yeast or 

LAB activities, respectively. 

FIGURE 1. Graphical representation of the genomes of A. ghanensis 

LMG 23848 T (A) and A. senegalensis 108B (B). 

REFERENCES 

Blom, J., Albaum, S., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M., 

Goesmann, A., 2009. EDGAR: a software framework for the comparative 

analysis of prokaryotic genomes. BMC Bioinformatics 10, 1-14. 

De Vuyst, L., Weckx, S., 2015. The functional role of lactic acid bacteria in cocoa 

bean fermentation. In: Mozzi, F., Raya, R.R., Vignolo, G.M. (Eds.). 

Biotechnology of Lactic Acid Bacteria: Novel Applications. Wiley-Blackwell, 

Ames, IA, USA. In press.Illeghems, K., De Vuyst, L., Weckx, S., 2013. 

Complete genome sequence and comparative analysis of Acetobacter 

pasteurianus 386B, a strain well-adapted to the cocoa bean fermentation 

ecosystem. BMC Genomics 14, 526. 

Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., et al., 2003. 

GenDB - an open source genome annotation system for prokaryote genomes. 

Nucleic Acids Res. 31, 2187-2195. 

88




P45. REPRESENTATIONAL POWER OF GENE FEATURES 

FOR FUNCTION PREDICTION 

Konstantinos Pliakos 1* , Isaac Triguero 2,3 , Dragi Kocev 4 & Celine Vens 1 . 

Department of Public Health and Primary Care, KU Leuven Kulak 1 ; Department of Respiratory Medicine, Ghent 

University 2 ; Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center 3 ; Department of 

Knowledge Technologies, Jožef Stefan Institute 4 . * konstantinos.pliakos@kuleuven-kulak.be 

We present a short study on gene function prediction datasets, revealing an existing issue of non-unique feature 

representation, as well as the effect of this issue on hierarchical multi-label classification algorithms. 

INTRODUCTION 

This study focuses on hierarchical multi-label 

classification (HMC). HMC is a variant of classification 

where one sample can be assigned to several classes 

simultaneously. It differs though from multi-label 

classification as these classes are organized in a hierarchy. 

That means that a sample belonging to a class 

automatically belongs to all its super-classes. Typical 

HMC tasks include gene function prediction or text 

classification. Here, we focus on the former. 

A typical characteristic of genes is that they can be 

described in several ways: using information about their 

sequence, homology to well-characterized genes, 

expression profiles, secondary structure of their derived 

proteins, etc. The HMC community has multiple research 

datasets at its disposal on gene functions (e.g., (Vens et al., 

2008) or (Schietgat et al., 2010)), each representing genes 

by one type of features. Indisputably, researchers should 

get advantage of this amount of data but the question 

arises how “good” these datasets are. How discriminant 

are the features describing a gene? Here, a short study is 

trying to display existing data-related problems and give 

answers to the aforementioned questions. 

DATA STUDY & RESULTS 

After careful experimentation on various publicly 

available datasets it was noted that some of them suffer 

from large amount of duplicate feature vectors. The 

irrational behind this occurrence is that there are genes, 

which despite having different functions, have exactly the 

same feature representation. The table below lists the 

aforementioned problem in the 20 gene function 

prediction datasets described in (Vens et al., 2008) and 

(Schietgat et al., 2010). 

Organism Dataset Nb of genes Nb of unique gene 

representations 

S. cerevisiae church 3755 2352 

pheno 1591 514 

hom 3854 3646 

seq 3919 3913 

struc 3838 3785 

A. thaliana scop 9843 9415 

struc 11763 11689 

TABLE 1. Datasets, the number of genes and their unique representations. 

As it is displayed, the church (micro-array expression) and 

the pheno (phenotype features) datasets suffer the most. 

More specifically, in pheno dataset the 67.7% of the gene 

representations are duplicates. The most frequent feature 

vector appears 315 times, 197 times in the training set and 

118 times in the test set. Due to this, 20% of the 582 test 

examples will give the same feature vector as input for 

prediction. In a decision tree model, for example, these 

genes will end up in the same leaf, receive the same 

prediction (the average class vector of 197 training 

examples), but receive a different error term as they are a 

priori associated with a different class label-set. In the 

training phase, there may still be a lot of variation in the 

class vectors of the 197 genes, but no split exists to 

separate them. In the Church dataset, the 3755 genes 

correspond to only 2352 unique feature descriptors. In 

Hom or Struc datasets the number of the duplicates is 

lower but still impressive, considering the enormous size 

of the feature vectors in these datasets. 

For evaluation purposes, ML-KNN (Zhang M. L et al., 

2007) was employed to demonstrate the effect of the 

studied problem on the average precision for the FunCat 

annotated datasets. Here, “unique” refers to the datasets 

occurring after removing all the duplicates. Thus, any 

feature vector can only once be included in a gene’s 

neighbour set. We report the average of 10 “unique” 

versions, each one using a different gene’s class label as 

ground truth for the feature vector. 

Dataset K= 1 K = 5 K = 17 

Train Test 

(5cv) 

Train Test 

(5cv) 

Train Test 

(5cv) 

pheno initial 51.59 23.62 39.55 24.14 32.76 23.59 

unique 100 24.21 55.62 24.90 39.70 25.01 

hom initial 98.30 39.32 63.64 39.45 48.96 37.28 

unique 100 39.14 64.64 39.67 49.28 37.53 

TABLE 2. Average Precision rates (%) using ML-KNN. 

The table shows that the less discriminant feature 

representation can affect the ML-KNN and decrease the 

precision of multi-label classification. Indisputably, it 

could be concluded that the same problem will be more 

obvious or even completely disastrous for two-class or 

multi-class classification problems. 

CONCLUSION 

The major point of this study was to inform the research 

community of the relatively low representational power of 

the features present in some widely used gene function 

prediction datasets, making them even more difficult and 

challenging datasets from machine learning perspective. 

We observed the same issue in datasets of other HMC 

application domains like text categorization. 

REFERENCES 

Zhang M. L. & Zhou Z. H. ML-KNN: A lazy learning approach to multi-label learning, Pattern 

recognition 40, 2038-2048, (2007). 

Vens C. et al. Decision trees for hierarchical multi-label classification, Machine Learning 73, 185-214, 

(2008). 

Schietgat L. et al. Predicting gene function using hierarchical multi-label decision tree ensembles, BMC 

Bioinformatics 11, (2010). 

89



Poster 


P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY 

PREDICTION 

Fabrizio Pucci 1,* , Katrien Bernaerts 1,2 , Fabian Teheux 1 , Dimitri Gilis 1 & Marianne Rooman 1 . 

Department of BioModeling, BioInformatics & BioProcesses 1 , Université Libre de Bruxelles, 1050 Brussels, Belgium; 

BioBased Materials, Faculty of Humanities and Sciences 2 , Maastricht University, 6200 Maastricht, The Netherlands. 

* fapucci@ulb.ac.be 

In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we 

focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a 

first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms 

described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the 

last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with 

the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset. 

INTRODUCTION 

The accurate prediction of the stability changes on a large 

scale is still a challenge in protein science. Despite the 

large amount of work done in the last years, the results 

frequently suffer from hidden biases towards the training 

dataset and this makes the evaluation of the real 

performances a difficult task. 

Here we study the “bias problem” in the case of the 

prediction of protein thermodynamic stability changes 

upon point mutations and more precisely of its best 

descriptor G that is the change of folding free energy 

upon mutation from the wild type protein W to the mutant 

M. In principle the predicted G value of the inverse 

mutation (M to W) has to be exactly equal to minus the 

G of the direct mutation (W to M), since the free energy 

is a state function. 

Unfortunately the asymmetry of the training dataset 

towards the destabilizing mutations (reflecting the 

evolutionary optimization of protein stability) makes the 

prediction of inverse mutations less accurate with respect 

to the direct ones. This introduces a series of distortions in 

the prediction model that we will analyze here. 

METHODS 

We computed the G value for a set of almost 200 

mutations in which both the structure of the wild type 

protein and mutant are known, using a series of prediction 

tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet, 

AutoMute, CupSat, Eris and ProSMS. We then computed 

the Ratio (RID) of the standard deviation between the 

predicted and the experimental values of G for the 

Inverse mutations to for the Direct mutations (which 

should be one in the case of a perfect symmetric 

prediction) and compared the results of the different 

programs. 

If the functional structure of the model is known as in the 

case of the artificial neural network of PoPMuSiC, one 

can further understand which terms contribute more than 

others to deviate the RID from unit and thus propose new 

model structures in which the biases are correctly avoided 

[2]. 

In the more blind machine learning approaches (as the 

methods based on Random Forest or Support Vector 

Machine) in which the functional form is not explicitly 

known, the asymmetry correction is less obvious. 

In a second part, we investigated how the symmetry of the 

G values distribution in the training dataset influences 

the prediction of the G distribution for all possible 

mutations in a series of proteins with known structures. 


The estimation of the asymmetry computed for a 

series of available prediction methods gives a RID 

values between 1 for bias-corrected methods and 

about 3 for the most biased programs. From these 

results we have shown that the correct use of the 

symmetry in setting up the model structure helps to 

avoid unwanted biases towards the destabilizing 

mutations. 

Furthermore the distribution of the G values for all 

point mutations in some proteins has been analyzed 

and showed a dependence from the G distribution 

of the training dataset when the RID deviate 

significantly from one. The understanding of the 

relation between the two distrubutions is an 

important step to comprehend the universality of the 

distribution [3] and how much the proteins are 

optimized to minimize the impact of single-site 

aminoacid substitution. 

REFERENCES 

[1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011), 

PopMusic 2.1 : a web server for the estimation of the protein 

stability changes upon mutation and sequence optimality. BMC 

Bioinformatics. 12, 151 

[2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry 

Principles in Optimization Problems: an application to Protein 

Stability Prediction (2015), IFAC-PapersOnLine 48-1, 458-463 

[3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The 

stability effects of protein mutations appear to be universally 

distributed (2007), J Mol Biol, 356, 1318-1332. 

90



Poster 


P47. MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC 

VARIANTS AT THE PROTEIN LEVEL IMPROVES THE IDENTIFICATION OF 

THEIR DELETERIOUS EFFECTS 

Daniele Raimondi 1,2,3,4 , Andrea Gazzo 1,2 , Marianne Rooman 1,6 , Tom Lenaerts 1,2,5 & Wim Vranken 1,2,3,4 . 

Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium 1 ; Machine Learning group, 

Université Libre de Bruxelles, Brussels, 1050, Belgium 2 ; Structural Biology Brussels, Vrije Universiteit Brussel, 

Brussels, 1050, Belgium 3 ; Structural Biology Research Centre, VIB, Brussels, 1050, Belgium 4 ; Artificial Intelligence lab, 

Vrije Universiteit Brussel, Brussels, 1050 Belgium 5 ; 3BIO-BioInfo group, Université Libre de Bruxelles, Brussels, 1050, 

Belgium 6 . * daniele.raimondi@vub.ac.be 

The increasing availability of genome sequence data led to the development of predictors that are capable of identifying 

the likely phenotypic effects of Single Nucleotide Variants (SNVs) or short inframe Insertions or Deletions (INDELs). 

Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical 

and/or structural properties to link the observed variant to either a neutral or a disease phenotype. Despite notable 

successes, the mapping between genetic alterations and phenotypic effects is riddled with levels of complexity that are 

not yet fully understood and that are often not taken into account in the predictions. A better multi-level molecular and 

functional contextualization of both the variant and the protein may therefore significantly improve the predictive quality 

of variant-effect predictors. 

INTRODUCTION 

The phenotypical interpretation at the organism level of 

protein-level alterations is the ultimate goal of the varianteffect 

prediction field. This causal relationship is still far 

from being completely understood and is confounded by 

many aspects related to the intrinsic complexity of cell life. A 

crucial restriction of variant-effect prediction is that an 

alteration of the protein’s molecular phenotype, even if it is a 

sine qua non condition for the disease phenotype in the 

carrier individual,may not constitute in itself a sufficient 

cause for the disease: this also depends on the particular role 

that the affected protein plays in the well-being of the 

organism. Even the most commonly used features, which 

relate evolutionary constraints with likely functional damage, 

offer only a partial correlation with the pathogenicity of the 

variant. Consequently, additional information that bridges the 

variant-phenotype gap is crucial to improve variant-effect 

predictions. 

METHODS 

We address the inherently complex variant-effect prediction 

problem through the integration of different sources of 

information. By describing each (protein, variant) pair from 

different perspectives corresponding to different levels of 

contextualisation, we assembled the most relevant and 

accessible pieces of information that are currently available, 

with the aim to elucidate the fuzzy and complex mapping 

between molecular-level alterations and the individual-level 

phenotypic outcome. We use three variant-oriented features 

with different characteristics: the log-odd ratio (LOR) score 

and Conservation index (CI) [1], which are column-wise 

measures of the conservation of a mutated column within a 

multiple-sequence alignment (MSA), and the PROVEAN [2] 

predictions (PROV), which provide a sequence-wide measure 

of the change in evolutionary distance between the mutated 

target protein and close functional homologs that correlates 

with the deleteriousness of variants. The protein-oriented 

features use pathway [4] and protein-protein interaction 

networks information [5] (DGR) as well as genetic and 

clinical information, for instance an evaluation of how 

tolerant the affected genes are to homozygous loss-offunction 

mutations (REC) [3]. 


DEOGEN is our novel variant effect predictor that can 

natively handle both SNVs and inframe INDELs. By 

integrating information from different biological scales and 

mimicking the complex mixture of effects that lead from the 

variant to the phenotype, we obtain significant improvements 

in the variant-effect prediction results. Next to the typical 

variant-oriented features based on the evolutionary 

conservation of the mutated positions, we added a collection 

of protein-oriented features that are based on functional 

aspects of the gene affected. We cross-validated DEOGEN on 

36825 polymorphisms, 20821 deleterious SNVs and 1038 

INDELs from SwissProt. 

Method Missing SNVs Sen Spe Pre Bac MCC 

PROVEAN 0.0 78 79 68 79 56 

SIFT 2.0 85 69 61 77 52 

Mutation Assessor 0.6 85 71 63 78 54 

PolyPhen2 (HumDiv) 4.0 89 63 57 76 50 

CADD 7.0 82 75 66 78 55 

EFIN 0.0 86 80 87 83 64 

MutationTaster 20.7 86 75 69 81 60 

GERP++ 20.7 97 24 45 61 28 

DEOGEN 4.4 77 92 85 84 71 

FIGURE 1. Comparison of the performances of 8 variant-effect predictors 

with DEOGEN on Humsavar 2013 dataset. 

REFERENCES 

[1]Calabrese, R. et al., R. Functional annotations improve the predictive 

score of human disease-related mutations in proteins. Hum. Mutat. 

30, 123744 (2009). 

[2]Choi, Y. et al., Predicting the functional effect of amino acid 

substitutions and indels. PLoS One 7, e46688 (2012). 

[3]Daniel G. MacArthur et al. A Systematic Survey of Loss-of-Function 

Variants in Human Protein-Coding Genes Science 17 February 

2012: 335 (6070), 823-828. 

[4]Atanas Kamburov et al. (2011) ConsensusPathDB: toward a more 

complete picture of cell biology. Nucleic Acids Research 39:D712- 

717. 

91



Poster 


P48. NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN 

DEAMIDATION FROM SEQUENCE-DERIVED SECONDARY STRUCTURE AND 

INTRINSIC DISORDER 

J. Ramiro Lorenzo 1 , Leonardo G. Alonso 2 & Ignacio E. Sánchez 1* . 

Protein Physiology Laboratory, Facultad de Ciencias Exactas y Naturales and IQUIBICEN - CONICET, Universidad de 

Buenos Aires, Argentina 1 ; Protein Structure-Function and Engineering Laboratory, Fundación Instituto Leloir and 

IIBBA - CONICET, Buenos Aires, Argentina 2 . *isanchez@qb.fcen.uba.ar 

Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a 

molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein 

local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non - 

enzymatic deamidation of internal asparagine residues in proteins, in the absence of structural data, from sequence based 

predictions of secondary structure and intrinsic disorder. NGOME may help the user identify deamidation-prone 

asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological 

processes. 

INTRODUCTION 

Protein deamidation is a post-translational modification in 

which the side chain amide group of a glutamine or 

asparagine (Asn) residue is transformed into an acidic 

carboxylate group. Deamidation often, but not always, 

leads to loss of protein function 1,2 . Deamidation rates in 

proteins vary widely, with halftimes for particular Asn 

residues ranging from several days to years. In contrast 

with the ubiquity and importance of Asn deamidation, 

there is currently no publicly available algorithm for the 

prediction of Asn deamidation A structure-based 

algorithm was published 3 , but is no longer available online 

and is not useful for proteins of unknown structure or 

those that are intrinsically disordered. 

METHODS 

Dataset. We collected from the literature experimental 

reports of deamidation of Asn residues in proteins using 

mass spectrometry or Edman sequencing. Since 

deamidation rates depend strongly on pH and temperature, 

we only included experiments at neutral or slightly basic 

pH and up to 313K. An Asn residue was considered a 

positive if unequivocal change to aspartic or isoaspartic 

residue was observed. Asn residues for which direct 

experimental evidence was not obtained were not taken 

into account. 

NGOME training. We trained the algorithm by randomly 

splitting the dataset into training and test sets 100 times, 

while keeping a similar number of positive and negative 

Asn-Xaa dipeptides in the two sets. For each splitting, we 

selected the weights for disorder 4 and alpha helix 

prediction 5 in NGOME algorithm to maximize the area 

under the ROC curve for the training set. For the test set, 

the area under the ROC curve for NGOME was larger than 

for sequence-based prediction 97 out of 100 times. Finally, 

we selected the average values of weights for NGOME. 


Both protein sequence and structure can influence Asn 

deamidation kinetics. In the absence of secondary and 

5. Cole, C., et al. Nucleic Acids Res 36:W197-201 (2008). 

tertiary structure, Asn deamidation rates are governed by 

the identity of the N+1 amino acid 3 . In model peptides, the 

Asn-Gly dipeptide is by far the fastest to deamidate, with 

bulky N+1 side chains generally slowing down the 

reaction. Several structural features decreasing Asn 

deamidation rates have also been identified, including 

alpha helix formation and hydrogen bond formation by the 

Asn side chain, the N+1 backbone amide and the 

neighbouring residues 3 . 

We compiled a database of 281 Asn residues (67 positives 

and 214 negatives) in 39 proteins to train NGOME. We 

computed t50 for all Asn in the dataset and generated a 

ROC curve by considering as positives Asn residues with 

different values of t50. The area under the ROC curve is 

larger for the NGOME predictions (0.9640) than for the 

sequence-based predictions (0.9270) (p-value 6×10 -3 ). 

NGOME also performs better for threshold value s 

yielding few false positives. NGOME can also 

discriminate between positive and negative Asn-Gly 

dipeptides whereas sequence-based prediction can not. 

The area under the ROC curve is 0.7051 for the NGOME 

predictions, larger than the random value of 0.5 for 

sequence-based prediction (p-value 9×10 –3 ). Since 

NGOME requires only a protein sequence as an input and 

not a three-dimensional structure, we envision that 

GNOME will be useful to systematically evaluate whole 

proteome data and in the study of intrinsically disordered 

proteins for which the structural data is scarce. NGOME is 

freely available as a webserver at the National EMBnet 

node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ 

in the subpage “Protein and nucleic acid structure and 

sequence analysis”. 

REFERENCES 

1. Curnis, F., et al. J Biol Chem 281:36466-36476 (2006). 

2. Reissner, K.J. and Aswad, D.W. Cell Mol Life Sci 60:1281 -1295 

(2003). 

3. Robinson, N.E. and Robinson, A.B. Proc Natl Acad Sci U S A 

98:4367-4372 (2001). 

4. Dosztanyi, Z., et al. Bioinformatics 21:3433-3434 (2005). 

92



Poster 


P49. OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL 

MODELS 

Jérôme Renaux 1,* , Alexandros Sarafianos 1 , Kurt De Grave 1 & Jan Ramon 1 . 

Department of Computer Science, KU Leuven. 1 * Jerome.renaux@cs.kuleuven.be 

Targeted proteomics techniques such as Selected Reaction Monitoring (SRM) have become very popular for protein 

quantification due to their high sensitivity and reproducibility. However, these rely on the selection of optimal transitions, 

which are not always known in advance and may require expensive and time-consuming discovery experiments to 

identify. We propose a computer program for the automated identification of optimal transitions using machine learning 

and show encouraging results when compared to a widely used spectral library. 

INTRODUCTION 

A major issue with both SRM is to know which transitions 

to monitor in order to maximally detect a specific protein, 

these being different from one protein to another. Good 

candidates are transitions whose chemical properties will 

make them likely to occur and easy to detect by the mass 

spectrometer, while being sufficiently specific indicators 

of their parent protein. 

Traditionally, targeted proteomics assays, which consist of 

lists of ions or transitions to monitor, are designed through 

costly exploratory experiments. Recently, attempts have 

been made to produce software to help design optimal 

assays. These efforts rely on some extent on collaborative 

databases of mass spectra which are mined to identify the 

best possible peptides to include in the assays. While 

successful, these approaches still depend on past 

exploratory analyses and on the coverage of the exploited 

databases. Therefore, their performance decrease in cases 

where such databases cannot be leveraged, such as when 

dealing with little-studied organisms or rare, lowabundance 

proteins. 

We propose an approach called SIMPOPE (Sequence of 

Inductive Models for the Prediction and Optimization of 

Proteomics Experiments) that models all the steps of the 

typical tandem mass spectrometry (MS/MS) workflow in 

order to accurately predict the properties of peptide and 

fragment ions within a given proteome, and subsequently 

identify optimal assays among them. 

METHODS 

SIMPOPE consists of a sequential suite of predictive 

models for each step of the MS/MS workflow. It exploits 

knowledge from public databases and combines it with the 

generalizing power of machine learning models to 

compensate for noisy or missing data. All models are 

probabilistic, allowing to keep track of the inherent 

uncertainty of the successive predictions and to weight the 

results accordingly for the assay prediction. 

Enzymatic cleavage is modelled using CP-DT(Fannes et 

al., 2013), which models the behaviour of the trypsin 

enzyme using random forests. Retention time prediction is 

achieved using the Elude tool from the Percolator suite 

(Moruz et al., 2010). The charge distribution of 

electrospray precursor ions is also modelled using random 

forests trained on experimental data mined from PRIDE 

(Vizcaino et al., 2013). Fragmentation patterns and 

product ion intensity are predicted with the help of random 

forest models trained on MS-LIMS data (Degroeve & 

Martens 2013; De Grave et al., 2014). Finally, prior 

knowledge about the abundance of proteins within a given 

proteome is incorporated as prior probabilities, obtained 

when available from PaxDB. 

On the human proteome, these steps yield a total of 321 

000 000 transitions together with their relevant chemical 

properties. We then compute a score for every single 

transition, based on these properties and on their aliasing 

with other transitions in terms of Q1 and Q3 m/z. 


We validated our approach by computing scores for 2000 

reference transitions from the SRMAtlas database (Picotti 

et al., 2014). Based on these scores, we can rank the 

reference transitions among all possible transitions. 

Intuitively, reference transitions should rank high, and 

therefore have a low rank (ideally, in the top five). Based 

on the average number of transitions per protein in our 

reference set, a perfect median rank would be 3.2, while a 

totally random scoring system should yield a median rank 

of 151. The approach we propose achieved a median rank 

of 15, signifying that using our scoring method, 50% of 

the reference transitions are ranked in the top 15. This 

result is encouraging as it shows that the scores predicted 

by SIMPOPE do correlate with the quality of the 

transitions. We can subsequently use that score as a 

feature to train an additional model on top of the ones 

described here to refine the assay prediction process 

(further results on the poster). 

REFERENCES 

Degroeve, S. & Martens, L. MS2PIP: a tool for MS/MS peak 

intensity prediction. Bioinformatics, 29, pp.3199–203 (2013). 

Fannes, T. et al. Journal of Proteome Research, 12(5), pp.2253–2259 

(2013). 

De Grave, K. De et al. Prediction of peptide fragment ion intensity : a 

priori partitioning reconsidered. International Mass Spectrometry 

Conference 2014, (2014). 

Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust 

calibration of retention time models for targeted proteomics. Journal 

of Proteome Research, 9(10), pp.5209–5216 (2010). 

Picotti, P. et al. A complete mass-spectrometric map of the yeast 

proteome applied to quantitative trait analysis. Nature, 494(7436), 

pp.266–270 (2014). 

Vizcaino, J. a. et al. The Proteomics Identifications (PRIDE) database 

and associated tools: status in 2013. Nucleic Acids Research, 41(D1), 

pp.D1063–D1069 (2013). 

93



Poster 


P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION 

ACROSS MULTIPLE MICROBIAL GENOMES 

Alex Salazar 1,2 & Thomas Abeel 1,2* . 

Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands 1 ; Genome Sequencing and Analysis 

Program, Broad Institute of MIT and Harvard 2 . * T.Abeel@tudelft.nl 

Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal 

important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on 

single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are 

robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating 

large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that 

breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap 

normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be 

incorporated to existing association pipelines. 

INTRODUCTION 

Structural sequence variants—such as large insertion and 

deletions (indels)—along with small sequence variants (e.g. 

single nucleotide variants and small indels) can enable more 

robust comparisons of microbial populations. Unfortunately, 

limitations in variant calling methods restrict investigations to 

compare only small variants across multiple microbial 

genomes—thereby ignoring larger variants (e.g. indels of size 

greater than 50nt). The recent development of structural 

variant detecting tools now provide an opportunity to 

compare and associate large indels with phenotype and 

population structure across a collection of samples. However, 

these tools have only been benchmarked against a single 

genome and their ability to consistently call large events 

across multiple genomes remains uncharacterized. 

METHODS 

In this study, we systematically benchmarked the robustness 

of large indel identification across multiple genomes using 

five recently developed structural variant detection tools: 

Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014), 

BreakSeek (Zhao et al., 2015), and MindTheGap (Rizk et al., 

2014). Using a manually-curated reference genome for 

M. tuberculosis (H37Rv), we simulated nearly 10,000 

deletions and 8,000 thousand insertions—ranging from 50nt 

to 550nt. Overall, the simulation experiment resulted in a 

total 1.6 million expected deletions and 1.3 million expected 

insertions when we aligned short-reads from a data set of 161 

clinical strains of M. tuberculosis (Zhang et al., 2013). 

After identifying the simulated indels using the variant 

detecting tools, we used a distance test to investigate each 

tool’s robustness in breakpoint and genotype prediction. For 

each simulated indel prediction, we computed the distance of 

the predicted breakpoint coordinate to the expected 

breakpoint coordinate. We also calculated a genotype 

similarity score using the Damerau-Levenshtein distance. 


We found that all tools are able to precisely predict the 

breakpoint coordinate of the same large event present across 

multiple genomes. For deletions, Breseq and Breakseek 

consistently identified more than 96% of all simulated 

deletions regardless of size. This number ranged from 87% to 

93% in Pilon and correlated with decreasing deletion size. 

Breseq and Pilon correctly predicted the exact breakpoint 

coordinate for about two-thirds of all identified simulated 

indels. This number ranged from 1% to 7% in Breakseek calls 

and inversely correlated with increasing deletion size. 

For insertions, MindTheGap consistently identified 

approximately 97% of all simulated insertions, but Pilon’s 

performance worsened as the number of insertions that it 

identified ranged from 69% to 93%--again, we observed a 

direct correlation of missed calls as the insertion size 

increased. Both tools correctly predicted the exact breakpoint 

coordinate for about two-thirds of all identified simulated 

indels. Nevertheless, we found 99% of the predicted 

breakpoint coordinates made by the four tools were within 

10nt of the expected breakpoint coordinate. 

Our results also indicate that Pilon, Breseq, Breakseek, and 

MindTheGap are robust when predicting the genotype of 

large indels across multiple samples. The large majority of 

identified simulated deletions had a size and genotype 

similarity of more than 98%. In insertions, the size similarity 

of insertions varied widely in both MindTheGap and Pilon 

calls indicating that both tools have a difficult time 

determining the exact length of an insertion sequence. 

Overall, these results show that breakpoint detection is 

precise when identifying deletion and insertions of any size. 

Therefore, a simple normalization procedure—such as leftmost-overlap 

normalization across samples—will ensure 

consistent breakpoint location for identical large events. This 

will enable researchers to incorporate large variants to 

existing association pipelines; opening novel opportunities to 

associate large variants with phenotype and population 

structure. 

REFERENCES 

Barrick,J.E. et al. (2014) Identifying structural variation in haploid 

microbial genomes from short-read resequencing data using breseq. 

BMC Genomics, 15, 1039. 

Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of 

short and long insertions. Bioinformatics, 30, 1–7. 

Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive 

microbial variant detection and genome assembly improvement. 

PLoS One, 9, e112963. 

Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium 

tuberculosis isolates from China identifies genes and intergenic 

regions associated with drug resistance. Nat. Genet., 45, 1255–60. 

Zhao,H. and Zhao,F. (2015) BreakSeek: a breakpoint-based algorithm for 

full spectral range INDEL detection. Nucleic Acids Res., 1–13. 

94



10th Benelux Bioinformatics Conference Poster 


P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES 

FOR PREDICTING CLINICAL CODES 

Elyne Scheurwegs 1,3* , Kim Luyckx 2 , Léon Luyten 2 , Walter Daelemans 3 & Tim Van den Bulcke 1 . 

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Antwerp University Hospital 2 ; Center 

for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp 3 ; * elyne.scheurwegs@uantwerpen.be 

Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to 

various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both 

in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple 

days during the stay). This work studies the complementarity of information derived from these different sources to 

enhance clinical code prediction. 

INTRODUCTION 

The increased accessibility of healthcare data through the 

large-scale adoption of electronic health records stimulates 

the development of algorithms that monitor hospital 

activities, such as clinical coding applications. 

Clinical coding consists of the translation of information 

found in a patient file to diagnostic and procedural codes, 

originating from a medical ontology to patient files. 

In our work, we investigate if unstructured (textual) and 

structured data sources, present in electronic health 

records, can be combined to assign clinical diagnostic and 

procedural codes (specifically ICD-9-CM) to patient stays. 

Our main objective is to evaluate if integrating these 

heterogeneous data types improves prediction strength 

compared to using the data types in isolation. 

METHODS 

Several datasets were collected from the clinical data 

warehouse of the Antwerp University Hospital (UZA). 

The resulting dataset consists of a randomized subset of 

anonymized data of patient stays, in 14 different medical 

specialties. Two separate data integration approaches were 

evaluated on each dataset from a medical specialty. 

With early data integration, multiple sources are combined 

prior to training a model. This is achieved by using a 

single bag of features that are given to the prediction 

pipeline. Feature selection is performed with tf-idf for 

unstructured sources and gainratio and minimal 

redundancy, maximum relevance (mRMR) for structured 

source filtering. 

The late data integration method trains a separate model 

on each data source, and then combines the prediction 

output for each code in a meta-learner. This meta-learner 

is mainly used to find which sources perform best for a 

certain code. 

The prediction task in both approaches was cast as a multiclass 

classification task, in which an array of binary 

predictions was made (one for each clinical code). 


Late data integration improves the predictions of ICD-9- 

CM diagnostic codes made in comparison to the best 

individual prediction source (i.e. overall F-measure 

increased from 30.6% to 38.3%). Early data integration 

does not show this trend and only performs well with a 

limited number of combinations of sources. ICD-9-CM 

procedure codes also show this trend, with the exception 

of the RIZIV data source, which shows a better prediction 

when used individually. The predictive strength of the 

models varies strongly between different medical 

specialties. 

The results show that the data sources, independent of 

their structured or unstructured nature, are able to provide 

complementary information when predicting ICD-9-CM 

codes, particularly when combined within the late data 

integration approach. This approach also allows for 

including as many sources as possible, as the effects of 

including a source that does not contain any additional 

information barely influences the end result. This is an 

advantage when the information content of a data source is 

not previously known. A disadvantage is the loss of 

information due to the strong generalisation as each data 

source is effectively reduced to a single feature for the 

meta-learner. 

Early data integration seems to suffer when combining 

sources that have features with a largely differing 

information content and different numbers of features. An 

unstructured data source typically renders 30,000 

different, weak features, while a structured source often 

contains only 500 different features. 

CONCLUSIONS 

Models using multiple electronic health record data 

sources systematically outperform models using data 

sources in isolation in the task of predicting ICD-9-CM 

codes over a broad range of medical specialties. 

ACKNOWLEDGEMENT 

This work is supported by a doctoral research grant (nr. 

131137) by the Agency for Innovation by Science and 

Technology in Flanders (IWT). The datasets used in this 

research were made available by the Antwerp University 

Hospital (UZA) for restricted use. 

REFERENCES 

Scheurwegs, E et al. Data integration of structured and unstructured 

sources for assigning clinical codes to patient stays. Journal of the 

American Medical Informatics Association (2015): ocv115. 

95



Poster 


P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS 

Jaak Simm 1,2,3* , Adam Arany 1,2 , Sarah ElShal 1,2 & Yves Moreau 1,2 . 

Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data 

Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium 1 ; iMinds Medical IT, Kasteelpark 

Arenberg 10, box 2446, 3001 Leuven, Belgium 2 ; Institute of Gene Technology, Tallinn University of Technology, 

Akadeemia tee 15A, Estonia 3 . * jaak.simm@esat.kuleuven.be 

Scientific publications contain rich information about genetic disorders. Text mining these publications provides an 

automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes 

advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and 

integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look 

promising. 

INTRODUCTION 

Scientific publications contain rich information about 

genetic disorders. Text mining these publications provides 

an automatic way to quickly query and summarize the 

information. 

The traditional approaches employ unsupervised text 

mining approaches like TF-IDF (term frequency–inverse 

document frequency) or Latent Dirichlet Allocation 

(LDA) by Blei et al. (2003) for linking terms to genes and 

diseases. A recent text mining software Beegle (ElShal et 

al., 2015) developed for linking diseases and genes has 

taken this approach using TF-IDF as its similarity metric. 

PROPOSED METHOD 

Our work proposes a supervised learning of the 

importance of the textual terms, which can automatically 

filter out many terms that are unnecessary for the task at 

hand. We formulate it as a prediction of supervised values 

y given the terms for all genes g and all diseases d where i 

is the index of the term: 

and w i is the weight for the term i and σ is sigmoid 

function. The main idea is to learn the weight vector w that 

minimizes the difference between known values y and 

predictions. The minimization can transformed into a 

logistic regression. 

For the supervised values we use OMIM database 

(Hamosh et al., 2003). More specifically y corresponds to 

1 if there is a link between the given gene-disease pair and 

0 if there is no link. Intuitively, in this setup the text 

mining is transformed into a classification problem. We 

use dataset of 330 OMIM terms and their linked genes and 

randomly sample genes as negatives for each disease. 

For the textual terms we use MEDLINE abstracts as the 

source of biomedical text. We employ MetaMap (Aronson 

et al. 2010) to link terms with abstracts. We use geneRIF 

to link genes with abstracts, and PubMed to link diseases 

with abstracts. We apply a TF-IDF transformation to score 

a term with a given disease or gene based on the abstracts 

linked to each entity. We only use the terms linked to 

abstracts that belong to genes. Hence our vocabulary 

consists of 66,883 terms. 


The preliminary results show that supervised learning 

allows to automatically pick up the keywords that are 

informative, improving the recall of the genes that are 

related to genetic disorders. We will present more detailed 

results in the poster. 

We are also investigate how to integrate the supervised 

approach to have answers to online queries provided by 

Beegle. 

REFERENCES 

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet 

allocation. the Journal of machine Learning research, 3, 993-1022. 

Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick, 

V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a 

knowledgebase of human genes and genetic disorders. Nucleic acids 

research, 33(suppl 1), D514-D517. 

ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J., 

Moreau Y. (2015). Beegle: from literature mining to disease-gene 

discovery. Nucleic Acids Res, gkv905. 

Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap: 

historical perspective and recent advances. Journal of the American 

Medical Informatics Association, 17(3), 229-236. 

96



Poster 


P53. FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND 

COMPARE CYTOMETRY DATA IN THE BROWSER 

Arne Soete 2 , Sofie Van Gassen 1,2,3 , Tom Dhaene 1 , Bart N. Lambrecht 2,3 & Yvan Saeys 2,3 . 

Department of Information Technology, Ghent University-iMinds, Ghent, Belgium 1 ; Inflammation Research Center, VIB, 

Ghent, Belgium 2 ; Department of Respiratory Medicine, Ghent University Hospital, Ghent, Belgium 3 . 

We developed FlowSOM Web, a web-tool which visualizes cytometry data based on Self-Organizing Maps. Similar cells 

are clustered and visualized via star charts. This allows us to process and display millions of cells efficiently. 

Additionally, different biological samples (e.g. healthy versus diseased mice) can be compared. 

INTRODUCTION 

Cytometry data describes cell characteristics in 

biological samples. Cells are labeled with fluorescent 

antibodies and a flow cytometer measures the properties 

of millions of cells one by one. Biologists use this 

information to get more insight in diseases and to 

diagnose patients. Most of them still analyse this data 

manually to differentiate between the different cell types 

present. This is done by plotting the data in 2D scatter 

plots and selecting groups of cells in a hierarchical way. 

This process is called `gating'. Recently, the number of 

properties that can be measured simultaneously has 

strongly increased. As the number of possible 2D scatter 

plots increases exponentially with the number of 

properties measured, it becomes infeasible to analyze 

them all and relevant information that is present in the 

data might be missed. 

METHODS 

We present FlowSOM, a new algorithm for the 

visualization and interpretation of cytometry data (Van 

Gassen, et al,. 2015). Using a twolevel clustering and 

star charts, our algorithm helps to obtain a clear 

overview of how all markers are behaving on all cells, 

and to detect subsets that might be missed otherwise. 

Our algorithm consists of 4 steps: pre-processing the 

data, building a self-organizing map, building a minimal 

spanning tree and computing a meta-clustering result. 


Although our results are quite similar to SPADE, another 

state-of-the art algorithm for the visualization of 

cytometry data, our results can be computed much faster 

and use less memory. By providing star-charts and an 

automatic meta-clustering step, much more information 

can be visualised in a single tree than is done by the 

SPADE algorithm. 

Additionally, multiple states can be compared (e.g. 

healthy versus diseased mice) with one another and the 

differences between the two states can be visualized via 

star-charts. 

On this conference, we would like to demonstrate a 

recently developed web interface to the underlying R 

functionality. This interface allows to upload cytometry 

data, run the aforementioned analysis, compare different 

cell states and explore the results, via interactive 

visualizations, all from the comfort of the browser. 

FIGURE 1. Example of a FlowSOM star chart. 

REFERENCES 

Van Gassen, et al. (2015), FlowSOM: Using self-organizing maps for 

visualization and interpretation of cytometry data. Cytometry, 

87: 636–645 

97



Poster 


P54. TOWARDS A BELGIAN REFERENCE SET 

Erika Souche 1* , Amin Ardeshirdavani 2 , Yves Moreau 2 , Gert Matthijs 1 & Joris Vermeesch 1 . 

Department of Human Genetics, KU Leuven 1 ; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and 

Data Analytic, KU Leuven 2 . * Erika.souche@uzleuven.be 

Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous 

sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved 

from sequencing to variant interpretation and classification. Although publically available databases of variant 

frequencies help distinguishing causative mutations from common variants, they often lack population specific variant 

frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence 

data of unrelated and unaffected individuals to filter out common local variants is often done. However the 

files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the 

utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of 

information between genetic centers and mine low coverage whole genome data for common variants. 

INTRODUCTION 

Next-Generation Sequencing (NGS) is increasingly used 

to study and diagnose human disorders. The simultaneous 

sequencing of a large number of genes leading to the 

detection of a large number of variants, the bottleneck has 

moved from sequencing to variant interpretation and 

classification. Publically available databases of variant 

frequencies provided by, among others, the Exome 

Sequencing Project (ESP) the 1000 genomes project 

(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help 

distinguishing causative mutations from common variants, 

identifying up to 78% of variants as common for a Belgian 

exome. However, these data sets often lack population 

specific variant frequencies and are outperformed by 

databases of local variants. For example, using GoNL 

(The Genome of the Netherlands Consortium, 2014) alone 

allowed the identification of up to 85% of variants as 

common for the same Belgian exome. The fact that the 

GoNL is based on only 498 individuals further highlights 

the importance of building and using population specific 

databases. 

Such population specific data can be retrieved from locally 

sequenced individuals that underwent Whole Exome 

Sequencing (WES) or Whole Genome Sequencing (WGS). 

Storing only the frequencies and genotype counts of the 

variants provides a valuable tool for variant classification 

while no sensitive information on the individuals is 

included. 

METHODS 

WES data of 350 unrelated and unaffected individuals 

have been parsed. All samples were analysed in a similar 

way i.e. reads were aligned to the reference genome with 

BWA (Li & Durbin, 2009) and genotyping was performed 

according to GATK best practices (McKenna et al., 2010; 

DePristo et al., 2011). All samples were genotyped at all 

polymorphic positions using GATK HaplotypeCaller and 

GenotypeGVCFs. For each position, samples with low 

quality genotype were considered as not genotyped and 

excluded from the genotype counts. The number of 

alternate alleles, allele counts and genotypes were 

compiled in a population VCF file, in which individual 

genotypes are not accessible. 

Variant frequencies can also be extracted from low 

coverage WGS. As a pilot we processed the data of 

chromosome 21 of about 4,000 WGS. The mapping was 

performed with BWA (Li & Durbin, 2009) and the BAM 

files were merged per 200 samples. All positions were 

genotyped using freebayes (Garrison & Marth, 2012). 

Genotype information of all locations outside low 

complexity regions were then compiled for all samples 

using the integration of Apache Hadoop, HBase and Hive 

(see poster “Big data solutions for variant discovery from 

low coverage sequencing data, by integration of Hadoop, 

Hbase and Hive”). Several models were then used to 

distinguish real variants from sequencing errors: the Minor 

Allele Frequency (MAF), the transition/transversion ratio, 

the expected number of loci with a MAF of 5%, etc. 


We demonstrated the effect of our reference set on several 

exomes. The inclusion of only 350 individuals allowed the 

identification of about 3% additional common variants, 

not listed as common by ESP, dbSNP (Sherry et al., 2001), 

1000 Genomes (McVean et al., 2012) and GoNL (The 

Genome of the Netherlands Consortium, 2014). Since only 

the frequencies of the variants in the screened populations 

are reported, this file can easily be shared between 

laboratories. Besides, the procedure used to generate the 

population VCF file can easily be applied to several 

genetic centers in order to generate a common population 

VCF file, as planned within the BeMGI project. 

Finally we expect that the data from WGS will further 

increase the performance of our reference set. A genomewide 

variant frequencies file from local population will 

become worthwhile when WGS is routinely used in 

diagnostics. 

REFERENCES 

DePristo M et al. Nature Genetics 43, 491-498 (2011). 

Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle, 

WA (URL: http://evs.gs.washington.edu/EVS/). 

Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012). 

Li H & Durbin R Bioinformatics 25, 1754-60 (2009). 

McKenna A et al. Genome Research 20, 1297-303 (2010). 

McVean et al. Nature 491, 56–65 (2012). 

Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001). 

The Genome of the Netherlands Consortium. Nature Genetics 46, 

818–825 (2014). 

98



Poster 


P55. MANAGING BIG IMAGING DATA FROM MICROSCOPY: 

A DEPARTMENTAL-WIDE APPROACH 

Yves Sucaet 1* , Silke Smeets 1 , Stijn Piessens 1 , Sabrina D’Haese 1 , Chris Groven 1 , Wim Waelput 1 & Peter In’t Veld 1 . 

Department of Pathology 1 , Faculty of Medicine, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090 Brussels, Belgium. 

* yves.sucaet@usa.net 

With recent breakthroughs in whole slide imaging (WSI), almost any microscopic material can be digitized in an 

efficient manner. In order to mine these data efficiently, a top-down approach was employed to manage various imaging 

platforms. At Brussels Free University (VUB), we built a centralized infrastructure that integrates a variety of imaging 

platforms (brightfield, fluorescence, multi-vendor formats). With the help of the Pathomation software platform for 

digital microscopy, various datastores and image repositories were integrated. Custom coding was used to interact with 

various vendor-software and server applications, where needed. The end-result is an interconnected network of 

heterogeneous scalable information silos. We currently have two main use cases for WSI: education and biobanking. 

These applications are available to the public via http://www.diabetesbiobank.org. 

INTRODUCTION 

Too often, image analysis and data/image mining projects 

remain stuck in micro-environments because they are 

limited by vendor-specific solutions that neither scale nor 

interact with material from other departments or 

institutions. Successful roll-out of digital histopathology 

therefore requires more than a whole slide scanner. 

If the goal is for an imaging facility to allow a researcher 

to conduct a (microscopic) experiment, then that 

researcher should not be hindered by the imaging platform 

used. Similarly, an instructor integrating digital content 

into his or her course, should be able to make their 

materials as accessible as possible to as many students as 

possible. 

At Brussels Free University (VUB), we currently have two 

main use cases for whole slide imaging: education and 

biobanking. We have set these up in such a way that they 

are both scalable and expandable. 

METHODS 

Whole slide imaging (WSI) has recently provided a boost 

to digital capturing of microscopic content (and an 

explosion of data, resulting in a veritable digital treasure 

trove waiting for bioinformatics to be explored). But 

researchers have been digitizing content for a long time 

already through various technologies (mounted cameras, 

inverted fluorescent microscopes with low magnification, 

…). 

We envisioned an environment whereby a researcher can 

manage and view all of the material related to an 

experiment or observation from a single interface, 

irrespective of origin or technology used. 

The following steps were taken to accomplish this: 

 

 

 

Setup a central server (50TB storage) 

Centrally store all imaging data provide mapped 

drives on the individual workstations to facilitate 

a smooth transition for end-users 

Install the Pathomation platform for digital 

microscopy (PMA.core, PMA.view, PMA.zui) 

for universal viewing of digital content and to 

provide a uniform end-user experience 

 

 

Install Pydio (open source) for easy sharing of 

digital imaging content (integrated with 

Pathomation’s PMA.core so no duplicate user 

directories need to be maintained) 

Build custom portals to highlight specific 

collections of microscopic content and/or serve 

specific target audiences 


The centralized digital imaging infrastructure is used by 

various researchers and graduate students. Recently over 

3,000 images were processed and hosted in the course of 

one month. 

Two use cases are worth highlighting: 

 

 

For undergraduate students (Medicine, BMS) we 

built custom portal websites to supplement their 

courses in histology and pathology. These sites 

are available at http://histology.vub.ac.be and 

http://pathology.vub.ac.be and provide students 

with (guided) virtual microscopy without the 

need to install any additional software 

We also provide access portals to different 

specialized biobanks. The Willy Gepts collection 

represents a historic milestone in diabetes 

research (http://gepts.vub.ac.be) and is 

complementary to the Alan Foulis collection 

(http://foulis.vub.ac.be). Furthermore, the clinical 

diabetes biobank can now be consulted online, 

too, via http://www.diabetesbiobank.org. 

CONCLUSION 

Digital histopathology has been around for some time now, 

but often results in heterogeneous data collections. It is 

only now that we start looking at integrated approaches on 

this varied data can be best handled. Digital pathology 

involves much more than the acquisition of a slide scanner. 

We have engaged five different imaging platforms onto a 

single architecture. We are storing data from all modalities 

in a single storage facility, and manage it through a single 

access point. The resulting environment assists in 

rendering content to any type of display device, without 

the need for extra software or background information 

concerning the content’s origin. 

99



Poster 


P56. ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN 

CANCER GENOMES USING ENHANCER PREDICTION MODELS AND 

MATCHED GENOME-EPIGENOME-TRANSCRIPTOME DATA 

Dmitry Svetlichnyy 1* , Hana Imrichova 1 , Zeynep Kalender Atak 1 & Stein Aerts 1 . 

Laboratory of Computational Biology, University of Leuven 1 . *dmitry.svetlichnyy@med.kuleuven.be 

The prioritization of candidate driver mutations in the non-coding part of the genome is a key challenge in cancer 

genomics. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on 

their recurrence, non-coding mutations are usually not recurrent at the same position. We aim to tackle this problem 

using machine-learning methods to predict regulatory regions and cancer genome sequences in combination with samplespecific 

chromatin profiles obtained using ChIP-seq against H3K27Ac. 

INTRODUCTION 

Perturbations of gene regulatory networks in cancer cells 

can arise from mutations in transcription factors or cofactors, 

but also from mutations in regulatory regions. 

Prioritizing candidate driver mutations that have a 

significant impact on the activity of a regulatory region is 

a key challenge in cancer genomics. 

METHODS 

We have developed enhancer prediction methods using 

Random Forest classifiers to estimate the Predicted 

Regulatory Impact of a Mutation in an Enhancer 

(PRIME). We find that the recently identified driver 

mutation in the TAL1 enhancer has a high PRIME score, 

representing a “gain-of-target” for the oncogenic 

transcription factor MYB [1]. We trained enhancer models 

for 45 cancer-related transcription factors, and used these 

to score somatic mutations across more than five hundred 

breast cancer genomes. Next, we re-sequenced the genome 

of ten cancer cell lines representing six different cancer 

types (breast, lung, melanoma, ovarian, and colon) and 

profiled their active chromatin by ChIP-seq against 

H3K27Ac. 


Then we integrated these data with matched expression 

data and with the Random Forest model predictions for 

sets of oncogenic transcription factors per cancer type. 

This resulted in surprisingly few high-impact mutations 

that generate de novo regulatory (oncogenic) activity at 

the chromatin and gene expression level. Our framework 

can be applied to identify candidate cis-regulatory 

mutations using sequence information alone, and to 

samples with combined genome-epigenome-transcriptome 

data. Our results suggest the presence of only few cisregulatory 

driver mutations per genome in cancer genomes 

that may alter the expression levels of specific oncogenes 

and tumor suppressor genes. 

REFERENCES 

1. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A, 

Durbin AD, et al. An oncogenic super-enhancer formed through somatic 

mutation of a noncoding intergenic element. Science. 2014;346: 1373– 

1377. doi:10.1126/science.1259037 

100



Poster 


P57. I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN 

SEQUENCE VISUALIZATION 

Ibrahim Tanyalcin 1,2* , Carla Al Assaf 3 , Alexander Gheldof 1 , Katrien Stouffs 1,4 , Willy Lissens 1,4 & Anna C. Jansen 5,2 . 

Center for Medical Genetics, UZ Brussel, Brussels, Belgium 1 ; Neurogenetics Research Group, Vrije Universiteit Brussel, 

Brussels, Belgium 2 ; Center for Human Genetics, KU Leuven and University Hospitals Leuven, 3000 Leuven, Belgium 3 ; 

Reproduction, Genetics and Regenerative Medicine, Vrije Universiteit Brussel, Brussels, Belgium 4 ; Pediatric Neurology 

Unit, Department of Pediatrics, UZ Brussel, Brussels, Belgium 5 . *ibrahim.tanyalcin@i-pv.org or itanyalc@vub.ac.be 

Summary: Today’s genome browsers and protein databanks supply vast amounts of information about proteins. The 

challenge is to concisely bring together this information in an interactive and easy to generate format. 

Availability and Implementation: We have developed an interactive CIRCOS module called i-PV to visualize user 

supplied protein sequence, conservation and SNV data in a live presentable format. I-PV can be downloaded from 

http://www.i-pv.org. 

INTRODUCTION 

Today’s genome browsers and protein databanks supply 

vast amount of information about both the structural 

annotation and the single nucleotide variants (SNV) in 

genes. The challenge is to concisely bring together this 

information in an interactive and easy to generate format. 

Thus, we have developed an interactive CIRCOS 

(Krzywinski et al.) module combined with D3 (Bostock et 

al.) and plain javascript called i-PV to visualize user 

supplied protein sequence, conservation and SNV data 

while significantly easing and automating input file 

requirements and generation. 

METHODS 

To use i-PV, only 4 text files (with “.txt” extension) have 

to be supplied to the software: conservation scores, 

protein and cDNA sequences, and SNVs/Indels files. 

Protein and cDNA (or mRNA) sequence files are supplied 

in fasta format whereas SNP/Indel fıles are provided as 

annotated vcf file (Variant Call Format). The conservation 

scores are simply array of numbers separated by newline 

characters. The input files are supplied to i-PV, data are 

automatically checked for errors or duplicates and 

matched against the user provided fasta files, and then an 

interactive html file containing the graph is automatically 

generated as shown in Fig.1. 


Many sequence visualization tools focus on certain aspects 

of proteins such as conservation, variations, sequence 

alignments or topology. While all these tools are very 

useful in their own right, we pursued a more interactivity 

based design. Therefore, i-PV is not solely designed for 

visualization but also for live presentable graphs and 

information that can selectively be displayed and 

customized. I-PV combines major sources of information 

under one html file that is easy to generate and share on 

both desktop and mobile environments. 

Last but not least, many visualization tools are based on 

rectangular-scroll based representation of information 

which does not deliver a “wide angle” view of the 

sequence data unlike circular visualization. However, as 

like all other types of visualizations, there are also 

limitations for circular graphs when it comes to 

conveniently zoom in to a particular region or visually 

align tracks with different radii. We intend to further 

develop this software with several other features based on 

end user needs. The current version of i-PV can be 

downloaded from http://www.i-pv.org. 

FIGURE 1. Overview of i-PV features. (A) SNVs with mouse over 

explanation and automatic generated dbSNP links (red: Nonsynonymous, 

green: Synonymous, gray: Not validated). (B) Console can 

be hidden for publication quality image. (C) Domains are colored based 

on user preference. (D) Conservation data from user generated 

alignment with mouse over information. (E) The user can define which 

amino acids to be shown on the sequence track. (F) Switch the color of 

the background to black. (G) Amino acids are plotted and split into 5 

main categories (nonpolar: gray circle, polar: magenta circle, negative: 

blue triangle, positive: red triangle, aromatic: green hexagon). (H) 

Adjustable conservation score threshold to display regions above a 

certain percentage of maximum conservation score. (I) Font-size of 

chosen amino acids can be adjusted. (J) User selectable amino acids to 

be displayed. (K) Up to 17 different amino acid properties can be chosen 

to be displayed from drop-down menu. (I) Tile track showing SNVs and 

indels (red: SNVs, magenta: Indels, gray stroke: Not validated, black: 

collapsed due to over display). (M) Gene Name. (N) Buttons for mass 

selection of amino acids. (O) User defined regions are marked with 

custom name tag and mouse over information. (P) Meta-analysis of 

amino acid distributions. This information is only displayed in case of 

single amino acid comparisons. The log2 ratios are capped between -3 

and 3. The maximum and the minimum blosum62 scores are -4 and 11. 

Since the blosum62 matrix is diagonally symmetric, the absolute value of 

the log ratios are mapped to this range and a p-value is indicated based 

on how close the two scores are. 

REFERENCES 

Bostock, M., et al. (2011), 'D3: Data-Driven Documents', IEEE Trans. 

Visualization & Comp. Graphics (Proc. InfoVis). 

Krzywinski, M., et al. (2009), 'Circos: an information aesthetic for 

comparative genomics', Genome Res, 19 (9), 1639-45. 

101



Poster 


P58. SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY 

PURIFICATION-MASS SPECTROMETRY DATA ANALYSIS 

Kevin Titeca 1,2 , Pieter Meysman 3,4 , Kris Gevaert 1,2 , Jan Tavernier 1,2 , 

Kris Laukens 3,4 , Lennart Martens 1,2 & Sven Eyckerman 1,2* . 

Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, B-9000 

Ghent, Belgium 2 ; Advanced Database Research and Modeling (ADReM), University of Antwerp, Belgium 3 ; Biomedical 

informatics research center Antwerpen (biomina), Belgium 4 . sven.eyckerman@vib-ugent.be 

Affinity purification-mass spectrometry (AP-MS) is one of the most common techniques for the analysis of proteinprotein 

interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult because 

of the many false positives. The ideal filter technique for these data is highly accurate, fast and user friendly without the 

need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased. 

Because none of the existing filter techniques combines all these features, we developed SFINX, the Straightforward 

Filtering INdeX. 

We here describe the SFINX algorithm and its performance on two independent AP-MS benchmark datasets. SFINX 

shows superior performance over the other approaches with accuracy increases of up to 20%, and is extremely fast. It 

does not require parameter optimization, and is absolutely independent of external resources. Both the algorithm and its 

website interface are highly intuitive with limited need for user input and the possibility of immediate network 

visualization and interpretation at http://sfinx.ugent.be/. SFINX might become essential in the toolbox of any scientist 

interested in user-friendly and highly accurate filtering of AP-MS data. 

102



Poster 


P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: 

AN EXTREMELY IMBALANCED BIG DATA PROBLEM 

Isaac Triguero 1,2* , Sara del Río 3 , Victoria López 3 , Jaume Bacardit 4 , José M. Benítez 3 & Francisco Herrera 3 . 

VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Department of Computer 

Science and Artificial Intelligence 3 ; School of Computing Science, Newcastle University 4 . 

* Isaac.Triguero@irc.vib-Ugent.be 

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an 

ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and 

store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these 

problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these 

circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine 

learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was 

concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal 

with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics 

classifications problems. 

INTRODUCTION 

The prediction of a protein’s contact map is a crucial step 

for the prediction of the complete 3D structure of a protein. 

This is one of the most challenging bioinformatics tasks 

within the field of protein structure prediction because of 

the sparseness of the contacts (i.e. few positive examples) 

and the great amount of data extracted (i.e. millions of 

instances, Gbs of disk space) from a few thousand of 

proteins. 

This problem refers to an imbalance bioinformatics big 

data application, in which traditional machine learning 

techniques become non effective and non efficient due to 

the big dimension of the problem. However, with use of 

the emerging cloud-based technologies, these techniques 

can be redesigned to extract valuable knowledge from 

such amount of data. 

The ECDBL’14 competition (http://cruncher.ncl.ac.uk/ 

bdcomp/) brought up a data set that modeled the contact 

map prediction problem as a classification task. 

Concretely, the training data set considered was formed by 

32 million instances, 631 attributes, 2 classes, 98% of 

negative examples and it occupies about 56GB of disk 

space. 

In this work we describe the methodology with which we 

have participated, under the name 'Efdamis', ranking as the 

winner algorithm (Triguero et al, 2015). 

METHODS 

In the proposed methodology, we focused on the 

MapReduce (Dean et al, 2008) paradigm in order to 

manage this voluminous data set. We extended the 

applicability of some pre-processing and classification 

models to deal with large-scale problems. This is 

composed of four main parts: 

 

 

An oversampling approach: The goal is to balance the 

highly skewed class distribution of the problem by 

replicating randomly the instances of the minority 

class (del Rio et al, 2014). 

 

 

An evolutionary feature weighting method: Due the 

relative high number of features of the given problem 

we developed a feature selection scheme for largescale 

problems that improves the classification 

performance by detecting the most significant features 

(Triguero et al, 2012). 

Building a learning model: As classifier, we focused 

on a scalable RandomForest algorithm. 

Testing the model: Even the test data can be 

considered big data (2.9 millions of instances), so that, 

the testing phase was also deployed within a parallel 

approach. 


Table 1 presents the final results of the top 5 participants 

in terms of True Positive Rate (TPR) and True Negative 

Rate (TNR). In this particular problem, the necessity of 

balancing the TPR and TNR ratios emerged as a difficult 

challenge for most of the participants of the competition. 

In this sense, the use of scalable preprocessing techniques 

played in important role to improve the results of the 

RandomForest classifier. First, the designed oversampling 

approach allowed us to prevent RandomForest to be 

biased to the negative class. Second, our feature weighting 

approach provided us the possibility of reducing the 

dimensionality of the problem by selecting the most 

relevant features. Thus, it resulted in a better performance 

as well as a notable reduction of the time requirements. 

Team TPR TNR TPR * TNR 

Efdamis 0.73043 0.73018 0.53335 

ICOS 0.70321 0.73016 0.51345 

UNSW 0.69916 0.72763 0.50873 

HyperEns 0.64003 0.76338 0.48858 

PUC-Rio_ICA 0.65709 0.71460 0.46956 

TABLE 1: Comparison with the top 5 of the competition. 

REFERENCES 

Dean J., Ghemawat S., Mapreduce: simplified data processing on large 

clusters, Commun. ACM 51 (1), 107–113 (2008). 

del Río S., et al., On the use of MapReduce for imbalanced big data using 

random forest, Inf. Sci. 285 (2014) 112–137. 

Triguero I. et al., Integrating a differential evolution feature weighting 

scheme into prototype generation, Neurocomputing 97 (2012) 332– 

343. 

103



Poster 


P60. COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO- 

EXPRESSION NETWORKS 

Oren Tzfadia 1,2 , Tim Diels 1,2,4 , Sam De Meyer 1,2 , Klaas Vandepoele 1,2 , Yves Van de Peer 1,2,3,5,* & Asaph Aharoni 6 . 

Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium 1 ; Department of Plant Biotechnology and 

Bioinformatics, Ghent University, 9052 Ghent, Belgium 2 ; Genomics Research Institute (GRI), University of Pretoria, 

0028 Pretoria, South Africa 3 ; Department of Mathematics and Computer Science, University of Antwerp, Antwerp, 

Belgium 4 ; Bioinformatics Institute Ghent, Ghent University, 9052 Ghent, Belgium 5 ; Department of Plant Sciences and 

the Environment, Weizmann Institute of Science, Rehovot 6 . 

INTRODUCTION 

Comparative transcriptomics is a common approach in 

functional gene discovery efforts. It allows for finding 

conserved co-expression patterns between orthologous 

genes in closely related plant species, suggesting that these 

genes potentially share similar function and regulation. 

Several efficient co-expression-based tools have been 

commonly used in plant research but most of these 

pipelines are limited to data from model systems, which 

greatly limit their utility. Moreover, in addition, none of 

the existing pipelines allow plant researchers to make use 

of their own unpublished gene expression data for 

performing a comparative co-expression analysis and 

generate multi-species co-expression networks. 

RESULTS 

We introduce CoExpNetViz, a computational tool that 

uses a set of bait genes as an input (chosen by the user) 

and a minimum of one pre-processed gene expression 

dataset. The CoExpNetViz algorithm proceeds in three 

main steps; (i) for every bait gene submitted, coexpression 

values are calculated using Pearson correlation 

coefficients, (ii) non-bait (or target) genes are grouped 

based on cross-species orthology, and (iii) output files are 

generated and results can be visualized as network graphs 

in Cytoscape. 

AVAILABILITY AND IMPLEMENTATION 

The CoExpNetViz tool is freely available both as a PHP 

web server (link: 

http://bioinformatics.psb.ugent.be/webtools/coexpr/) 

(implemented in C++) and as a Cytoscape plugin 

(implemented in Java). Both versions of the CoExpNetViz 

tool support LINUX and Windows platforms. 

104



Poster 


P61. THE DETECTION OF PURIFYING SELECTION DURING TUMOUR 

EVOLUTION UNVEILS CANCER VULNERABILITIES 

Jimmy Van den Eynden 1* & Erik Larsson 1 . 

Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University 

of Gothenburg, Sweden. * jimmy.van.den.eynden@gu.se 

Identification of somatic mutation patterns indicative of positive selection arguably has become the major goal of cancer 

genomics. This is motivated by a search for cancer driver genes and pathways that are recurrently activated in tumours 

but not normal cells, thus providing possible therapeutic windows. However, cancer cells additionally depend on a large 

number of basic cellular processes, and elevated sensitivity to inhibition of certain essential non-driver genes has been 

demonstrated in some cases. While such vulnerability genes should in theory be identifiable based on strong purifying 

(negative) selection in tumors, these patterns have been elusive and purifying selection remains underexplored in cancer. 

We established a new methodology and, using mutational data from 25 TCGA tumor types, we show for the first time 

that negative selection in candidate vulnerability genes can be detected. 

INTRODUCTION 

Recently it was shown that a hemizygous deletion of the 

well–known tumour suppressor gene TP53 creates 

therapeutic vulnerability in colorectal cancer due to 

concomitant loss of the neighbouring gene POLR2A (Liu 

et al., 2015). 

As any damaging mutation occurring in the single allele of 

a hemizygously deleted essential gene, like POLR2A, is 

expected to lead to cell death, we hypothesized that 

purifying selection in these genes could be unveiled by 

demonstrating a lower number of damaging mutations 

then could be expected in the absence of any selection. 

Therefore we used the POLR2A case as a proof-ofconcept 

to develop a methodology to detect purifying 

selection in large genome sequencing datasets. 

METHODS 

Mutation and copy number data from 25 different cancers 

types and 7,871 samples were downloaded from the 

TCGA data portal and pooled together in a large pancancer 

dataset. Different mutational functional impact 

scores were calculated using Annovar. Copy number data 

were analyzed using Gistic 2.0 to differentiate POLR2A 

copy number neutral from hemizygously deleted samples. 


POLR2A was found to be hemizygously deleted in 29% of 

all samples. As expected, in over 99% this deletion was 

part of the TP53 (driving) deletion on chromosome 17. 

POLR2A was mutated 228 times in 2.3% of all samples. 

While 14 nonsense mutations and small out-of-frame 

insertions or deletions occurred in the copy number 

neutral group, none of these damaging mutations were 

found in the deletion group (p=0.03, fisher test), 

suggesting purifying selection against this type of 

mutations. 

Next to these truncating mutations, also missense 

mutations that have a damaging effect on the gene’s 

protein function are expected to be selected against. 

Therefore we predicted the functional impact of all 

mutations using different functional impact scores. The 

median (PolyPhen-2) functional impact score was found 

to significantly lower in the deletion group compared to 

the copy number neutral group (p=0.002, Wilcoxon test, 

fig.1), further confirming that purifying selection has 

taken place in POLR2A during tumour evolution. 

These preliminary findings confirm that purifying 

selection is detectable in vulnerability genes like POLR2A 

and this approach could be used to detect other, new 

candidate vulnerability genes. 

FIGURE 1. Negative selection against POLR2A high impact mutations in 

hemizygously deleted tumour samples. 

REFERENCES 

Liu, Y., Zhang, X., Han, C., Wan, G., Huang, X., Ivan, C., … Lu, X. 

(2015). TP53 loss creates therapeutic vulnerability in colorectal 

cancer. Nature, 520(7549), 697–701. 

http://doi.org/10.1038/nature14418 

105



Poster 


P62. FLOREMI: SURVIVAL TIME PREDICTION 

BASED ON FLOW CYTOMETRY DATA 

Sofie Van Gassen 1,2,3* , Celine Vens 2,3,4 , Tom Dhaene 1 , Bart N. Lambrecht 2,3 & Yvan Saeys 2,3 . 

Department of Information Technology, Ghent University—iMinds 1 ; VIB Inflammation Research Center 2 ; Department of 

Respiratory Medicine, Ghent University 3 ; Department of Public Health and Primary Care, kU Leuven Kulak 4 . 

* sofie.vangassen@irc.vib-ugent.be 

Flow cytometry is a high-throughput technique for single cell analysis. It enables researchers and pathologists to study 

blood and tissue samples by measuring several cell properties, such as cell size, granularity and the presence of cellular 

markers. While this technique provides a wealth of information, it becomes hard to analyze all data manually. To 

investigate alternative automatic analysis methods, the FlowCAP challenges were organized. We will present an 

algorithm that obtained the best results on the FlowCAP IV challenge, predicting the time of progression to AIDS for 

HIV patients. 

INTRODUCTION 

The main task of the most recent FlowCAP IV challenge 

was a survival modeling challenge: participants had to 

predict the time of progression to AIDS for HIV patients, 

based on flow cytometry data of an unstimulated and a 

stimulated blood sample. Additionally, a secondary task 

was the identification of cell populations that could be 

indicative of this progression rate. Several challenges 

needed to be taken into account: the raw dataset was about 

20GB large and about eighty percent of the survival times 

were censored. 

METHODS 

We developed a new algorithm, FloReMi, which 

combined several preprocessing steps with a density based 

clustering algorithm, a feature selection step and a random 

survival forest (Van Gassen et al., 2015). 

The input for our algorithm consisted of 2 flow cytometry 

samples for each patient: one unstimulated PBMC sample 

and one PBMC sample stimulated with HIV antigens. For 

each of these samples, 16 parameters were measured for 

hundreds of thousands of cells. 

First, we included quality control to remove erroneous 

measurements from the samples. We also made an 

automatic selection of live T cells to focus on the cells of 

interest in this specific flow cytometry staining. 

Once the dataset was cleaned up, we extracted features for 

each patient. This was done by clustering the cells using 

the flowDensity (Malek et al., 2015) and flowType 

algorithms (Aghaeepour et al., 2012). These algorithms 

divide the values for each feature into either “high” or 

“low” and use all combinatorial options of “high”, “low” 

or “neutral” marker values to group the cells. This resulted 

in 3 10 different cell subsets. 

For each of these subsets, we computed the number of 

cells assigned to it and the mean fluorescence intensity for 

13 markers. Per patient, we collected these numbers for 

both samples and also computed the differences between 

the two. This resulted in a total of 2,480,058 features per 

patient. 

Because traditional machine learning algorithms cannot 

handle this amount of features, we then applied a feature 

selection step. To estimate the usefulness of a feature, we 

applied a Cox proportional hazards model on each feature. 

The resulting p-value indicates how well the feature 

corresponds with the known survival times for the training 

set. We ordered the features based on these scores, and 

picked only those that were uncorrelated with the others. 

This resulted in a final selection of 13 features, on which 

we applied several machine learning techniques. We 

compared the results of the Cox Proportional Hazards 

model, the Additive Hazards model and the Random 

Survival Forest. 


All three methods performed well on the training dataset. 

However, on the test dataset, both the Cox Proportional 

Hazards model and the Additive Hazards model obtained 

bad results, probably due to overfitting on the training data. 

Only the Random Survival Forest obtained good results on 

the test dataset (Figure 1). This method outperformed all 

other methods submitted to the challenge. 

FIGURE 1. On the training dataset, there was a strong correlation 

between the scores and the actual survival times for all models. On the 

test dataset, only the Random Survival Forest performed well. 

One important challenge remains: the biological 

interpretation of our final features. Although they correlate 

with the transition times from HIV to AIDS, it is hard to 

interpret them as known cell types, due to our 

unsupervised feature extraction. Our method delivers a 

first step towards new insights in the progress from HIV to 

AIDS. 

REFERENCES 

Malek M et al. Bioinformatics 31.4, 606-607 (2015). 

Aghaeepour N et al. Bioinformatics 28, 1009-1016 (2012). 

Van Gassen S et al. Cytometry A, DOI 10.1002/cyto.a.22734 

106



Poster 


P63. STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO 

UNDERSTAND GENOTOXICITY OF MLV-BASED GENE THERAPY VECTORS 

Sebastiaan Vanuytven 1* , Jonas Demeulemeester 1 , Zeger Debyser 1 & Rik Gijsbers 1,2 . 

Laboratory for Molecular Virology and Gene Therapy, KU Leuven 1 ; Leuven Viral Vector Core, KU Leuven 2 . 

* Sebastiaan.vanuytven@student.kuleuven.be 

Integrating retroviral vectors are used to treat genetic and acquired disorders that, theoretically, can be cured by 

introducing specific gene expression cassettes into patient cells. Clinical trials held over the past two decades have 

proven that this approach is effective in curing genetic disorders and can produce better results than the standard therapy 

(Touzot, F et al., 2015). Nevertheless, adverse events in a limited number of patients treated with gamma-retroviral 

vectors have deterred their widespread application. Specifically, vector integration occurring in proximity of protooncogenes 

resulted in insertional mutagenesis and clonal expansion of the cells (Hacein-Bey-Abina S et al., 2003). 

INTRODUCTION 

Retroviruses and their derived viral vectors do not 

integrate at random. Their overall integration pattern is 

dictated by cellular cofactors that are co-opted by the 

invading viral complex. For gammaretroviral vectors 

(prototype MLV) the cellular bromo- and extraterminal 

domain (BET) family of proteins (BRD2, BRD3 and 

BRD4) tethers the viral integrase to the host cell 

chromatin (De Rijck J et al., 2013). At the moment the 

only available ChIP-seq data derives from HEK-293T 

cells exogenously overexpressing FLAG-tagged versions 

of the BET proteins (LeRoy G et al., 2012). Yet, the 

detailed chromatin binding profile of endogenous BET 

proteins in human cells is currently unknown. Here we 

report on the chromatin occupation of the endogenous 

BET proteins in K562 and human primary CD4+ T cells. 

METHODS 

Following fixation, all three BET proteins were pulleddown 

with specific antibodies (Bethyl Laboratories, α- 

BRD2: A302-583A; α-BRD3: A302-368A; α-BRD4: 

A301-985A or Abcam ab84776). Subsequently, 1x10 7 

cells per sample were processed for ChIP as previously 

described (Pradeepa MM et al., 2012). ChIPed DNA was 

amplified with WGA2 using the manufacturer's protocol 

(Sigma Aldrich). All ChIP experiments were done with at 

least two biological replicates in K562 and CD4+ T cells. 

After processing of the ChIP-seq data, we compared the 

obtained BET protein-binding sites with MLV integration 

sites, histone modifications and other genetic features. 

Furthermore, we used motif discovery in the 

neighbourhood of BET binding sites and MLV integration 

sites to try and discover potential new players in the MLV 

integration process. 


Analysis showed that 24% of the MLV integration sites 

overlap with a BET-binding site in K562 cells, the 

majority of which are BRD4 sites. In addition, BET 

binding sites located in promoter and enhancer regions are 

preferred for MLV integration. Further, evaluation 

demonstrated a strong correlation between MLVintegration 

in these sites and the occurrence of the 

transcription factor recognition motifs for MAX, GATA2, 

EGR1, GAPBA and YY1, suggesting a role for these 

proteins or the underlying chromatin structures in 

targeting integration of MLV to these locations in the 

genome via interaction with BET proteins and/or the MLV 

long terminal repeat sequences. Recently, we generated 

MLV-based vectors that no longer recognize BET-proteins, 

BET independent MLV-based (BinMLV) vectors (El 

Ashkar S et al., 2014). Integration preferences of BinMLV 

vectors are shifted away from epigenetic marks associated 

with enhancers and promoters as shown in a PCA analysis, 

but they also associate less with BET and MAX binding 

sites. Even though, BinMLV vectors still did not integrate 

at random, their distribution can overall be described as 

more safe, with 3% more integration sites in so-called 

genomic "safe-harbor" regions (Sadelain M et al., 2012). 

REFERENCES 

De Rijck J et al. The BET family of proteins targets moloney murine 

leukemia virus integration near transcription start sites, Cell Rep, 5, 

886-894, (2013). 

El Ashkar S et al. BET-independent MLV-based Vectors Target Away 

From Promoters and Regulatory Elements, Mol Ther Nucleic Acids, 

3, e179, (2014). 

Hacein-Bey-Abina S et al. LMO2-associated clonal T cell proliferation in 

two patients after gene therapy for SCID-X1, Science, 302, 415-419, 

(2003). 

LeRoy G et al. Proteogenomic characterization and mapping of 

nucleosomes decoded by Brd and HP1 proteins, Genome Biol, 13, 

R68, (2012). 

Pradeepa MM et al. Psip1/Ledgf p52 binds methylated histone H3K36 

and splicing factors and contributes to the regulation of alternative 

splicing, PLoS Genet, 8, e1002717, (2012). 

Sadelain M, Papapetrou EP and Bushman FD. Safe harbours for the 

integration of new DNA in the human genome, Nat Rev Cancer, 12, 

51-58, (2012). 

Touzot, F et al. Faster T-cell development following gene therapy 

compared with haploidentical HSCT in the treatment of SCID-X1, 

Blood, 125, 3563-3569, (2015). 

107



Poster 


P64. THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS 

FERMENTUM IMDO 130101 AND ITS METABOLIC TRAITS RELATED TO 

THE SOURDOUGH FERMENTATION PROCESS 

Marko Verce, Koen Illeghems, Luc De Vuyst & Stefan Weckx * . 

Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering 

Sciences, Vrije Universiteit Brussel, Brussels, Belgium. * stefan.weckx@vub.ac.be 

The genome of the lactic acid bacterium species Lactobacillus fermentum IMDO 130101, capable of dominating 

sourdough fermentation processes, was sequenced, annotated, and curated. Further, this genome sequence of 2.09 Mbp 

was compared to other complete genomes of different strains of L. fermentum to elucidate the potential of L. fermentum 

IMDO 130101 as a sourdough starter culture strain. As opposed to the other strains, L. fermentum IMDO 130101 

contained unique genes related to carbohydrate import and metabolism as well as a gene coding for a phenolic acid 

decarboxylase and a gene encoding a 4,6- -glucanotransferase. The latter enzyme activity may result in the production 

of isomalto/malto-polysaccharides. All these features make L. fermentum IMDO 130101 attractive for further study as a 

candidate sourdough starter culture strain. 

INTRODUCTION 

Lactobacillus fermentum is a heterofermentative lactic 

acid bacterium often found in fermented food products, 

including sourdough. Strain L. fermentum IMDO 130101, 

a dominant sourdough strain originally isolated from a rye 

sourdough (Weckx et al., 2010) and extensively described 

previously (e.g., Vrancken et al., 2008), was sequenced 

and compared to other L. fermentum strains with 

completed genomes to elucidate unique adaptations of the 

strain studied to the sourdough environment. 

METHODS 

High-quality genomic DNA was used to construct an 8-kb 

paired-end library for 454 pyrosequencing. The 

pyrosequencing reads were assembled using the GS De 

Novo Assembler version 2.5.3 with default parameters. 

Primers for gap closure were designed using CONSED 

23.0, the gaps amplified with polymerase chain reaction 

(PCR) assays and the amplicons sequenced using Sanger 

sequencing. The sequences were imported into CONSED 

23.0 and used to close the gaps. The genome was 

annotated using the automated genome annotation 

platform GenDB v2.2 (Meyer et al., 2003), followed by 

extensive manual curation. Publicly available genome 

sequences of L. fermentum F-6 (Sun et al., 2015), L. 

fermentum IFO 3956 (Morita et al., 2008), and L. 

fermentum CECT 5716 (Jiménez et al., 2010) were 

acquired from RefSeq. Whole-genome comparisons with 

the other three L. fermentum strains and ortholog findings 

were performed using the progressiveMauve algorithm 

(Darling et al., 2010). 


The 2.09 Mbp genome was assembled from 403,466 reads, 

resulting in 74 contigs. No plasmids were found. The 

comparative genome analysis with other strains showed 

that 477 coding sequences were found in L. fermentum 

IMDO 130101 solely (Figure 1). 

L. fermentum IMDO 130101 was predicted to be able to 

import and utilise glucose, fructose, xylose, mannose, N- 

acetylglucosamine, maltose, sucrose, lactose and gluconic 

acid via the heterolactic fermentation pathway. Also, the 

ability to degrade raffinose and arabinose was predicted. 

Consumption of glucose, fructose, maltose and sucrose 

was shown in previous research, although growth with 

sucrose as the sole energy source was impaired (Vrancken 

et al., 2008). The strain possibly imports isomaltose and 

maltodextrins, hence elaborating glucose subunits. The 

-glucosidase-encoding gene was not found in the 

genomes of the other three strains considered, and neither 

were the putative maltodextrin import-related genes, the 

trehalose-6-phosphate phosphorylase-encoding gene and a 

putative -glucanase-encoding gene, which all may be 

adaptations of L. fermentum IMDO 130101 to the 

sourdough environment. The presence of the arginine 

deiminase gene cluster was confirmed. Also, L. fermentum 

IMDO 130101 contained a gene for a phenolic acid 

decarboxylase, which may have an impact on sourdough 

aroma. Further, a 4,6- -glucanotransferase-encoding gene 

was present in strain IMDO 130101 solely, which could 

result in isomalto/malto-polysaccharide production, a 

soluble dietary fibre with prebiotic properties. 

Overall, comparative genome analysis revealed metabolic 

traits that are of interest for the use of L. fermentum IMDO 

130101 as a functional starter culture for sourdough 

fermentation processes. 

FIGURE 1. Venn diagram of shared coding sequences between four 

different strains of Lactobacillus fermentum. 

REFERENCES 

Darling et al. PLoS ONE 5, e11147 (2010). 

Jiménez E. et al. J. Bacteriol. 192, 4800-4800 (2010). 

Meyer et al. Nucleic Acids Res. 31, 2187-2195 (2003). 

Morita et al. DNA Res. 15: 151-161 (2008). 

Sun et al. J. Biotechnol. 194, 110-111 (2015). 

Vrancken et al. Int. J. Food Microbiol. 128, 58-66 (2008). 

Weckx et al. Food Microbiol. 27, 1000-1008 (2010). 

108



Poster 


P65. ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN 

SUGGESTS REDUCED INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC 

PROCESSES IN ITS SUSPECTED BAT RESERVOIR HOST 

Ben Verhees 1* , Kris Laukens 1,2 , Stefan Naulaerts 1,2 , Pieter Meysman 1,2 & Xaveer Van Ostade 3 . 

Biomedical informatics research center Antwerpen (biomina) 1 ; Advanced Database Research and Modeling (ADReM), 

University of Antwerp 2 ; Laboratory of Protein Science, Proteomics and Epigenetic Signalling (PPES) and Centre for 

Proteomics and Mass spectrometry (CFP-CeProMa), University of Antwerp 3 . * ben.verhees@student.uantwerpen.be 

Ebola virus is a zoonosis, but its reservoir host has not yet been identified. Recent findings suggest however, that Mops 

condylurus, an insect-eating bat, is a likely candidate. Studying the interactions between Ebola virus and its reservoir 

host could prove highly informative, as reservoir hosts of zoonotic pathogens often appear to tolerate infections with 

these pathogens with little evidence of disease. In this study, a protein-protein interaction network (PPIN) was created 

between Ebola virus and human proteins. Orthology data in Myotis lucifugus – a model organism often used for bat 

studies – was employed to determine which of the human first neighbors of Ebola virus proteins do not possess an 

orthologue in M. lucifugus. Subsequent GO enrichment analysis suggested that these proteins are mostly involved in 

epigenetic processes, and thus we hypothesize that Ebola virus displays reduced interference with epigenetic processes in 

its reservoir host. 

INTRODUCTION 

The idea that bats serve as reservoirs for a wide range of 

zoonotic pathogens has been the topic of much recent 

research. Previous studies on human and bat orthology in 

this context have mainly focused on specific genes, 

important in fighting off viral infection. 

Our study is different however, in that it focuses on 

proteins the Ebola virus immediately interacts with in 

humans, and the existence of orthologues of these proteins 

in bats. 

METHODS 

Construction of an Ebola virus – human PPIN 

An Ebola virus – human PPIN was constructed from in 

silico data. All network analysis was done using 

Cytoscape v. 3.2.1. 

Orthology analysis 

Identification of orthologues was performed using the 

OMA orthology database, release: September 2015. 

Statistics 

For the statistical analysis, the hypergeometric test was 

performed. 

GO enrichment 

GO enrichment analysis was performed using ClueGO v. 

1.2.7, a Cytoscape plug-in. Default settings were used, and 

all ontologies/pathways were examined. 


Myotis lucifugus as a model for Mops condylurus 

In this study, Myotis lucifugus was used as a model to 

study interactions between Ebola virus and Mops 

condylurus, its suspected reservoir. 

Ebola virus – human PPIN and orthology in M. 

lucifugus 

An Ebola virus – human PPIN was created, and human 

first neighbors of Ebola virus proteins were examined for 

existence of orthologues in M. lucifugus. Statistical 

analysis revealed that there was an upregulation of human 

proteins with orthologues in M. lucifugus (p=0.019). 

GO enrichment suggests reduced interference of Ebola 

virus with epigenetic processes in its reservoir host 

Gene ontology (GO) enrichment analysis was performed 

of the human first neighbors of Ebola virus proteins which 

do not possess an orthologue in M. lucifugus. The analysis 

revealed that these proteins are mostly involved in 

epigenetic processes (Figure 1). 

FIGURE 1. GO enrichment analysis of human first neighbors of Ebola 

virus proteins which do not possess an orthologue in M. lucifugus. 

Discussion 

Using this novel approach, we have shown that Ebola 

virus is likely able to interfere with epigenetic processes in 

humans. Secondly, Ebola virus’ ability to interfere with 

host epigenetics is likely reduced or altered in its reservoir 

host. 

While the idea that viruses are able to interact with host 

epigenetic mechanisms is fairly recent, over the past few 

years significant research has been done exploring this 

topic. In a comprehensive review, Li et al. (2014) describe 

how specific viral proteins are able to modulate the 

activity of chromatin modification complexes, e.g. HATs, 

HDACs, HMTs, and HDMTs, and even directly bind 

histone proteins. These findings lend support to the results 

of our study, as these suggest that Ebola virus is also able 

to interact with HDACs, HMTs and several histone 

proteins in humans. 

REFERENCES 

Li S et al. Rev Med Virol 24, 223-241 (2014). 

109



Poster 


P66. PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING 

Kenneth Verheggen 1,2,3* , Harald Barsnes 4,5 , Lennart Martens 1,2,3 & Marc Vaudel 4 . 

Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent 2 ; 

Belgium,Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 ; Proteomics Unit, Department of 

Biomedicine, University of Bergen, Norway 4 ; KG Jebsen Center for Diabetes Research, Department of Clinical Science, 

University of Bergen, Norway 5 . *kenneth.verheggen@vib-ugent.be 

The use of proteomics bioinformatics substantially contributes to an improved understanding of proteomes, but this novel 

and in-depth knowledge comes at the cost of increased computational complexity. Parallelization across multiple 

computers, a strategy termed distributed computing, can be used to handle this increased complexity. However, setting 

up and maintaining a distributed computing infrastructure requires resources and skills that are not readily available to 

most research groups. 

Here, we propose a free and open source framework named Pladipus that greatly facilitates the establishment of 

distributed computing networks for proteomics bioinformatics tools. 

INTRODUCTION 

Various modern day bioinformatics-related fields have a 

growing focus on large scale data processing. This 

inevitably leads to an increased complexity, as is 

illustrated by the recent efforts to elaborate a 

comprehensive MS-based human proteome 

characterization (Kim et al., 2014; Wilhelm et al., 2014). 

Such high-throughput, complex studies are becoming 

increasingly popular, but require high performance 

computational setups in order to be analyzed swiftly. 

METHODS 

Here, we present a generic platform for distributed 

proteomics software, called Pladipus. It provides an 

end-user-oriented solution to distribute 

bioinformatics tasks over a network of computers, 

managed through an intuitive graphical user interface 

(GUI). 

Pladipus comes with several modules that work out 

of the box. They include SearchGUI (Vaudel et al., 

2011), PeptideShaker (Vaudel et al., 2015), 

DeNovoGUI (Muth et al., 2014), MsConvert (part of 

Proteowizard (Kessner et al., 2008)) and three 

common forms of the BLAST (Altschul et al., 1990) 

algorithm (blastn, blastp and blastx). It is possible to 

link these together to set up tailored pipelines for 

specific needs, including custom, in-house 

algorithms and execute the whole on an inexpensive, 

scalable cluster infrastructure without additional cost 

or expert maintenance requirement. It can even be set 

up to allow existing (idle) hardware to hook into the 

network and participate in the processing. 


To numerically assess the benefits of using a distributed 

computing framework, 52 CPTAC experiments (LTQ- 

Study6 : Orbitrap@86) (Paulovich et al., 2010) were 

searched three times against a protein sequence database 

(UniProtKB/SwissProt (release-2015_05)) on Pladipus 

networks of various. A selection of three search engines 

was applied: X!Tandem, Tide and MS-GF+. As expected 

for a distributed system, the wall time is very reproducible 

and decreased nearly exponentially with the number of 

workers. 

FIGURE 1. Benchmarking of a Pladipus network 

(16GB ram, 12cores, 250GB disk space, Ubuntu 

precise) 

Pladipus is freely available as open 

source under the permissive Apache2 

license. Documentation, including 

example files, an installer and a video tutorial, can be 

found at 

https://compomics.github.io/projects/pladipus.html. 

REFERENCES 

Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. 

Biol., 215, 403–10. 

Kessner,D. et al. (2008) ProteoWizard: open source software for rapid 

proteomics tools development. Bioinformatics, 24, 2534–6. 

Kim,M.-S. et al. (2014) A draft map of the human proteome. Nature, 

509, 575–81. 

Muth,T. et al. (2014) DeNovoGUI: an open source graphical user 

interface for de novo sequencing of tandem mass spectra. J. 

Proteome Res., 13, 1143–6. 

Paulovich,A.G. et al. (2010) Interlaboratory study characterizing a yeast 

performance standard for benchmarking LC-MS platform 

performance. Mol. Cell. Proteomics, 9, 242–54. 

Vaudel,M. et al. (2015) PeptideShaker enables reanalysis of MS-derived 

proteomics data sets. Nat. Biotechnol., 33, 22–24. 

Vaudel,M. et al. (2011) SearchGUI: An open-source graphical user 

interface for simultaneous OMSSA and X!Tandem searches. 

Proteomics, 11, 996–9. 

Wilhelm,M. et al. (2014) Mass-spectrometry-based draft of the human 

proteome. Nature, 509, 582–7. 

110



Poster 


P67. IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING 

A NETWORK-BASED APPROACH 

Bram Weytjens 1,2,3,4 , Dries De Maeyer 1,2,,3,4 & Kathleen Marchal 1,2,4 *. 

Dept. of Information Technology (INTEC, iMINDS), UGent, Ghent, 9052, Belgium 1 ; Dept. of Plant Biotechnology and 

Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium 2 ; Dept. of Microbial and Molecular 

Systems, KU Leuven, Kasteelpark Arenberg 20, B-3001 Leuven, Belgium 3 , Bioinformatics Institute Ghent, Ghent 

University, Ghent B-9000, Belgium 4 . * kathleen.marchal@intec.ugent.be 

Antibiotic resistance is a growing public health concern as the effectiveness of multiple types of antibiotics is decreasing. 

To prevent and combat the further spread of antibiotic resistance in bacteria there is the need to better understand the 

relationship between genetic alterations and the (molecular) phenotype of antibiotic resistant strains. As several (-omics) 

experiments regarding the attainment of antibiotic resistance by bacteria have already been performed and are publicly 

available, we re-analysed a laboratory evolution experiment by Suzuki et al. (Suzuki, 2014) in order to demonstrate the 

power of a network-based approach in identifying mutations and molecular pathways driving the resistance phenotype. 

INTRODUCTION 

While network-based approaches are no longer new in 

high-throughput (-omics) analysis, they are not yet widely 

used in standard analysis pipelines. We analysed a dataset 

consisting of multiple E. coli MDS42 strains, each 

independently evolved in the presence of a specific 

antibiotic (10 in total). By adapting PheNetic (De Maeyer. 

2013), an algorithm which connects genetic alterations to 

their differentially expressed genes over a genome-wide 

interaction network, we were able to automatically 

identify mutations in genes which are known to induce 

antibiotic resistance. 

METHODS 

For every strain whole-genome sequencing data and 

microarray data (eQTL data) was available. By finding the 

most probable connections between the mutations of every 

strain and the strain’s respective expression data over a 

biological network, PheNetic was able to not only uncover 

potential driver genes and molecular pathways for the 

resistance phenotype but also to prioritize the identified 

mutations based on the likelihood that they are truly 

driving the resistance phenotype. Such network-based 

approach has following advantages: 

 

 

Integration of interactomics (network), genomics 

and interactomics data 

Multiple related datasets can be analyzed together 

FIGURE 1: Part of Amikacin resistance network. 


In the case of Amikacin resistance (figure 1) we were able 

to uncover a gain-of-function mutation in cpxA, a gene of 

a two-component signal transduction mechanisms which is 

known to be involved in amikacin resistance for two 

strains out of four. For the other two strains, deleterious 

cyoB mutations were found, which is known to lead to 

intracellular oxidized copper and eventually multidrug 

resistance. These genes were furthermore ranked highest 

by PheNetic. 

REFERENCES 

Suzuki S et al. Nat Commun 5, 5792 (2014). 

De Maeyer D et al. Mol Biosyst 9: 1594-1603 (2013). 

111



Poster 


P68. DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT 

LACTOBACILLUS NICHES USING METAGENOMIC SEQUENCING 

Sander Wuyts 1,2* , Eline Oerlemans 1 , Ilke De Boeck 1 , Wenke Smets 1 , Dieter Vandenheuvel, Ingmar Claes 1 & Sarah 

Lebeer 1 . 

Laboratory of Applied Microbiology and Biotechnology, University of Antwerp 1 ; Research Group of Industrial 

Microbiology and Food Biotechnology (IMDO), Vrije Universiteit Brussel 2 * Sander.Wuyts@UAntwerp.be 

Next-Generation Sequencing (NGS) has revolutionized the field of microbial community analysis. Due to these highthroughput 

DNA-technologies, microbiologists are now able to perform more in-depth analyses of various microbial 

communities compared to culture-independent methods. In our lab, we have successfully deployed 16S rDNA amplicon 

sequencing using MiSeq-sequencing (Illumina). A bioinformatic pipeline has been built based on mothur (Schloss et al. 

2009), UPARSE (Edgar 2013) and Phyloseq (McMurdie & Holmes 2013) to analyse different microbial community 

datasets. The focus is on functional analysis of lactobacilli and other lactic acid bacteria in different ecological niches: 

ranging from the human upper respiratory tract to naturally fermented plant-based foods. 

INTRODUCTION 

16S metagenomics is a technique that makes use of the 

highly conserved bacterial 16S rRNA gene. This gene 

codes for an RNA-molecule which is a component of the 

30S small subunit of bacterial ribosomes. It consists of 9 

hypervariable regions, flanked by conserved regions for 

which primer pairs for PCR/sequencing can be designed. 

Due to these characteristics and due to the slow rate of 

evolution, this gene has been widely used in bacterial 

phylogeny and taxonomy. NGS technologies like Illumina 

MiSeq have made it possible to study all the different 

16S rRNA gene copies from an environmental sample and 

use these to identify the bacteria present in the sample. But 

the use of these high-throughput technologies comes with 

a cost: the need for a more in-depth bioinformatic analysis. 

METHODS 

Wetlab: 

DNA is extracted using sample dependent extraction 

protocols. A barcoded PCR is performed on the V4 region 

of the 16S rRNA gene as described in Kozich et al. 2013. 

For each sample a different set of primers is used; each 

primerset contains a unique combination of barcodes. The 

PCR-products are cleaned using AMPure XP (Agencourt) 

bead purification and quantified using Qubit (Life 

technologies). All samples are equimolary pooled into one 

single library. A negative control (= “empty” DNAextraction) 

and a positive control (= “Mock” communities 

HM-276D and HM-782D) are always processed together 

with the samples. The library is sequenced using a dual 

index sequencing strategy (Kozich et al. 2013) and a 

2 x 250 bp kit on the Illumina MiSeq. 

Bio-informatic analysis: 

Samples are demultiplexed on the MiSeq itself, allowing 1 

bp difference in the barcodes. The general quality of the 

reads is checked using FastQC (Babraham Bioinformatics). 

The paired end reads are merged using mothur’s 

make.contigs command. Quality control in mothur is 

performed using screen.seqs, alignment to the SILVA 

database and removal of sequences that do not map to the 

database, removal of chimeras using chimera.uchime and 

removal of sequences that classify to the lineages 

“Mitochondria” and “Chloroplast”. 

The distance between sequences are calculated using 

mothur’s dist.seqs command and are clustered at 97 % 

sequence similarity using mothur’s cluster command. 

Alternatively the UPARSE clustering algorithm can be 

used for these last two steps. Sequences are classified 

using the RDP database and the complete dataset is 

exported as a .biom file. 

Visualisation and statistical analysis is performed using 

the R-package Phyloseq. This analysis depends on the 

experimental design but generally consists of a 

normalisation step (either using rarefying, proportions or a 

statistical mixture model (McMurdie & Holmes 2014)), a 

calculation of alpha diversity measurements and a 

calculation and visualisation of beta diversity. 


The above described method was optimised and proved to 

be working. We successfully used this technique to obtain 

better insights in the role of lactobacilli in different 

ecological niches, e.g. in the murine gastrointestinal tract, 

vegetable fermentations and the human upper respiratory 

tract. 

REFERENCES 

Edgar, R.C., 2013. UPARSE: highly accurate OTU sequences from 

microbial amplicon reads. Nature methods, 10(10), pp.996–8. 

Kozich, J.J. et al., 2013. Development of a dual-index sequencing 

strategy and curation pipeline for analyzing amplicon sequence 

data on the MiSeq Illumina sequencing platform. Applied and 

environmental microbiology, 79(17), pp.5112–20. 

McMurdie, P.J. & Holmes, S., 2013. Phyloseq: An R Package for 

Reproducible Interactive Analysis and Graphics of Microbiome 

Census Data. PLoS ONE, 8(4). 

McMurdie, P.J. & Holmes, S., 2014. Waste not, want not: why rarefying 

microbiome data is inadmissible. PLoS computational biology, 

10(4), p.e1003531. 

Schloss, P.D. et al., 2009. Introducing mothur: Open-source, platformindependent, 

community-supported software for describing and 

comparing microbial communities. Applied and Environmental 

Microbiology, 75(23), pp.7537–7541. 

112



Poster 


P69. HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES 

USING MATRIX FACTORIZATION 

Pooya Zakeri 1,2,* , Jaak Simm 1,2 , Adam Arany 1,2 , Sarah Elshal 1,2 & Yves Moreau 1,2 . 

Department of Electrical Engineering, STADIUS, KU Leuven, Leuven 3001, Belgium 1 ; iMinds Medical IT, Leuven 3001, 

Belgium 2 . * pooya.zakeri@esat.kuleuven.be 

In the last decade, the phenotype-genes identification has received growing attention. It is yet one of the most 

challenging problem in biology. In particular, determining disease-associated genes is a demanding process and plays a 

crucial role in understanding the relationship between phenotype disease and genes. Typical approaches for gene 

prioritization often models each diseases individually, that fails to capture the common patterns in the data. This 

motivates us to formulate the hunting phenotype-associated genes problem as a factorization of an incompletely filled 

gene-phenotype-matrix where the objective is to predict unknown values. Experimental result on the updated version of 

Endeavour benchmark demonstrates that our proposed model can effectively improve the accuracy of the state-of-the-art 

gene prioritization model. 

INTRODUCTION 

In biology, there is often the need to discover the most 

promising genes among large list of candidate genes to 

further investigate. While a single data source might not 

be effective enough, fusing several complementary 

genomic data sources results in more accurate prediction. 

Moreover, fusing the phenotypic similarity of diseases and 

sharing information about known disease genes across 

both diseases and genes through a multi-task approach, 

enable us to handle gene prioritization for diseases with 

very few known genes and genes with limited available 

information. Typical strategies for hunting phenotypeassociated 

genes often models each phenotype 

individually [1, 2, 3, 4], that fails to capture the common 

patterns in the data. This motivates us to formulate the 

hunting phenotype-associated genes task as a factorization 

of an incompletely filled gene-phenotype-matrix where the 

objective is to predict unknown values. 

METHODS 

We consider OMIM database which is a human phenotype 

disease specific association databases. OMIM focuses on 

the relationship between human genotype and associated 

diseases. OMIM database can be seen as an incomplete 

matrix where each row is a gene and each column is a 

phenotype (disease). 

The idea behind the factorizing the M×N OMIM matrix is 

to represent each row and each column by a latent vector 

of size D. Then, the OMIM matrix can be modeled by 

product of an N×D gene matrix G and an M× D disease 

matrix P. 

Bayesian matrix factorization (BPMF) [5] is a famous 

method to fill such an incomplete matrix. But BPMF uses 

no side information which results in an inaccurate genephenotype-matrix 

completion. 

We propose an extended version of BPMF with an ability 

to work with multiple side information sources for 

completing gene-phenotype-matrix [6], which allows to 

make out-of-genes-phenotype-matrix ranking. In our 

proposed framework we are also able to integrate both 

genomic data sources and phenotypes information, 

whereas earlier approaches for hunting phenotype 

associated genes are limited to only fuse genomic 

information. This modification is done by adding genomic 

and phenotypic features to the corresponding latent 

variables [6]. In this study, we consider several genomic 

data sources including annotation-based data sources such 

as UniProt annotation, literature-based data sources on 

each genes, and as well the literature-based phenotypic 

information on each diseases, as just as in [1, 4, 9]. The 

framework of our Bayesian data fusion model for gene 

prioritization is illustrated in Figure 1. 

FIGURE 1. The framework of our Bayesian data fusion model for gene 

prioritization. 


We report the average TPR results, when considering the 

top 1%, 5%, 10%, and 30% of the ranked genes. 

Experimental result on the updated version of Endeavour 

[3] benchmark demonstrates that our proposed model can 

effectively improve the accuracy of the state-of-the-art 

gene prioritization model. 

REFERENCES 

Aerts, S. et al. Nat Biotech, 24(5), 537–544, (2006). 

De Bie T, Tranchevent LC, van Oeffelen LMM, Moreau Y, 

Bioinformatics, 23(13):i125-i132, (2007). 

Tranchevent LC1, et. al. NAR, (35) W377-W384(2008) . 

ElShal S, et al. Davis J. Moreau Y. NAR, (2015). 

R. Salakhutdinov and A. Mnih. 25th ICML, 880–887. ACM, (2008). 

SIMM J, et al. arXiv:1509.04610 [stat.ML], (2106). 

113



Poster 


P70. THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS 

DISTRIBUTION 

A. Zouaoui 1 , M. Kahli 2 , E. Besnard 3 , R. Desprat 1 , N. Kirsten 4 , P. Ben-sadoun 1 & J.M. Lemaitre 1 . 

Institute for Regenerative Medicine and Biotherapy, France 1 ; Institut de Biologie de l’École Normale Supérieure (ENS), 

France 2 ; The Gladstone Institutes, University of California San Francisco (UCSF), United States 3 ; Helmholtz Zentrum 

München, Research Unit Gene Vectors, Munich, Germany 4 . 

Proliferative cells can have an irreversible stop in the cell 

cycle that is called cellular senescence which can induct 

the development of cancer and ageing. Senescence is 

characterized by the development of Dense 

Heterochromatic Foci (SAHF) and the decline of the DNA 

replication. High-Mobility Group A proteins promote 

SAHF formation, a proliferative stop and stabilize 

senescence when overexpressed. 

In a cell, DNA replication is regulated on several 

genomics sites called replication origin (« Oris »). Prereplication 

proteic complex is required for DNA 

replication to occur. In the pre-replication complex, the 

ORC1 protein is involved in recognition of the origin of 

replication. DNA autoradiography of eukaryote cells 

allowed to find that human replication origins are 

bidirectional and spaced at 20-400kb intervals (Huberman 

and Riggs, 1968). At each origin, replication forks are 

formed and new short nascent strand are synthetized. A 

popular method to map replication origins is the 

purification of Short Nascent Strand (SNS). Several 

laboratories have identified up to 50 000 origins using 

microarray and sequencing techniques. Our laboratory has 

developed an origin mapping method divided in four cell 

type: IMR90, H9, iPSC and HeLa (Besnard et al., 2012). 

The Short Nascent Strand was isolated, sequenced and 

analyzed. 250 000 origin peaks have been identified with a 

peak detection tool named SoleSearch (Blahnik KR, Dou 

L, O’Geen H, et al. 2010). 

The objective is to find the most sensitive method to 

analyze the origin distribution in proliferative and 

senescent cells to observe if senescence has an impact on 

the origin distribution. The implication of HMGA proteins 

on the DNA replication is investigated. Two new methods 

are in development to analyze the replication origin with 

two more sensitive tools. In the first method, we search 

origin peaks with Macs2 tool (Zhang et al., 2008) which 

uses a new statistic and algorithm model. In a second time, 

origin enrichment is observed with Homer tool (Heinz S et 

al., 2010). 

Two methods are currently in development to identify the 

replication origin site by Illumina GaII sequencing of short 

nascent strand. Human SNS-seq reads of 36bp were 

mapped to human genome build GRCH38 with BWA tool 

(ref). Origin peaks were called by MACS2 and origin 

enrichment by Homer. To compare the two methods, 

active origins in HeLa cells were detected with each 

method. Correlation between ORC1 peaks and origins 

identified is calculated to choose the most sensitive 

method. The impact of pre-senecence is observed in 

comparing origins distribution observed in proliferative 

and senescent cells. Origins distribution is compared 

before and after induction of HMGA proteins to 

investigate the implication of these proteins on the DNA 

replication during senescence. 

REFERENCES 

Besnard et al. Best practices for mapping replication origins in 

eukaryotic chromosomes. Current Protoc Cell Biol. 2014 Sep 2; 

64:22.18.1-22.18.13 

Besnard et al. Unraveling cell type-specific and reprogrammable human 

replication origin signatures associated with G-quadruplex consensus 

motifs. Nat Struct Mol Biol. 2012 Aug; 19, 837-44 

Blahnik KR, Dou L, O’Geen H, et al. Sole-Search: an integrated analysis 

program for peak detection and functional annotation using ChIP-seq 

data. Nucleic Acids Res. 2010; 38:e13 

Fu H et al. Mapping replication origin sequences in eukaryotic 

chromosomes. Curr Protoc Cell Biol. 2014 Dec 1; 65:22.20.1- 

22.20.17 

Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of 

Lineage-Determining Transcription Factors Prime cis-Regulatory 

Elements Required for Macrophage and B Cell Identities. Mol Cell 

2010 May 28; 38, 576-589 

Hubberman JA et al. On the mechanism of DNA replication in 

mammalian chromosomes. J Mol Biol 1968 Mar 14; 32, 327-41 

Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 

(2008) 9 pp. R13 

114



December 7 - 8, 2015 Antwerp, Belgium 

www.bbc2015.be 

115

bbc 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?