bbc 2015
BBC2015_booklet
BBC2015_booklet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
10 th Benelux Bioinformatics Conference<br />
<strong>bbc</strong> <strong>2015</strong><br />
December 7 - 8, <strong>2015</strong><br />
Antwerp, Belgium<br />
www.<strong>bbc</strong><strong>2015</strong>.be<br />
1
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
10 th Benelux Bioinformatics Conference<br />
<strong>bbc</strong> <strong>2015</strong><br />
PROCEEDINGS<br />
December 7 and 8, <strong>2015</strong><br />
Antwerp, Belgium<br />
Elzenveld, Lange Gasthuisstraat 45, 2000 Antwerp, Belgium<br />
2
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
3
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Welcome to the 10 th Benelux Bioinformatics Conference!<br />
Dear attendee,<br />
It is our great pleasure to welcome you to the 10th Benelux Bioinformatics Conference in Antwerp (Belgium)!<br />
We are especially proud to host this conference, for the first time ever, in Antwerp, the diamond city.<br />
Ten years of BBC is worth some celebration. The meeting has always struck the right balance between<br />
strengthening the regional network and offering a scientifically strong program. From its inception 10 years<br />
ago, the BBC has always been a prominent platform for the thriving regional bioinformatics community to<br />
present their latest research. Not only did many young bioinformatics scientists get their first experience<br />
presenting their work as a poster or an oral presentation at one of the BBC editions, it has always attracted a<br />
healthy mix of presenters and attendees from all career stages, with diverse backgrounds.<br />
The program of this year's edition again demonstrates the wide range of life science disciplines in which<br />
bioinformatics plays a key role nowadays. First, we are delighted to introduce two eminent keynote speakers:<br />
Cedric Notredame (Center for Genomic Regulation) and Lars Juhl Jensen (Novo Nordisk Foundation Center for<br />
Protein Research). Second, a program committee of 36 scientists has critically reviewed a large number of<br />
submissions and selected 24 authors to deliver an oral presentation. In addition, we have two special<br />
corporate talks. Furthermore, we have again a large number of poster presentations that promise a very<br />
interactive poster session, and our corporate sponsors present their activities at their respective booths. Last<br />
but not least, our special guest Pierre Rouzé will bring us a perspective on the history of bioinformatics and 10<br />
years of Benelux Bioinformatics Conferences.<br />
For this edition, we would like to congratulate 10 (mostly master) students that were selected from a large<br />
pool of submissions to enjoy a student fellowship. For many of them it is their first chance to actively<br />
participate in a scientific conference, and we hope that it inspires them for their future bioinformatics career.<br />
The program also includes a healthy mix of chances for social interaction and networking. Conference dinner,<br />
coffee and lunch breaks and the farewell drink are perfect opportunities to strengthen the network even<br />
further.<br />
We cannot close this foreword without a very strong word of thank you to the many people who made this<br />
event possible. Thanks to the sponsors for their crucial support, to the keynote speakers and all other<br />
presenters for presenting their work, to the program committee for reviewing many abstracts, to many<br />
volunteers and people in the administration of the University of Antwerp for their helping hands, in many<br />
different ways.<br />
Last but not least, thank you for being here and being part of yet another great BBC edition. We wish you an<br />
enjoyable and very illuminating meeting.<br />
On behalf of the organizing committee,<br />
Kris Laukens & Pieter Meysman<br />
BBC<strong>2015</strong> chairs<br />
University of Antwerp<br />
4
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Special thanks to the BBC <strong>2015</strong> sponsors!<br />
Gold sponsors:<br />
Silver sponsors:<br />
Bronze sponsors:<br />
Affiliations:<br />
5
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Organizing committee<br />
<br />
<br />
<br />
<br />
<br />
Kris Laukens, University of Antwerp, Belgium<br />
Pieter Meysman, University of Antwerp, Belgium<br />
Geert Vandeweyer, University of Antwerp, Belgium<br />
Yvan Saeys, Ghent University, Belgium<br />
Thomas Abeel, Delft University of Technology, The Netherlands<br />
Programme committee<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Thomas Abeel, Delft University of Technology, The Netherlands<br />
Stein Aerts, University of Leuven, Belgium<br />
Francisco Azuaje, Luxembourg Institute of Health, Luxembourg<br />
Gianluca Bontempi, Université libre de Bruxelles, Belgium<br />
Tomasz Burzykowski, Hasselt University, Belgium<br />
Susan Coort, Maastricht University, The Netherlands<br />
Tim De Meyer, Ghent University, Belgium<br />
Jeroen De Ridder, Delft University of Technology, The Netherlands<br />
Dick De Ridder, Delft University of Technology, The Netherlands<br />
Peter De Rijk, University of Antwerp, Belgium<br />
Pierre Dupont, Université catholique de Louvain, Belgium<br />
Pierre Geurts, University of Liège, Belgium<br />
Peter Horvatovich, University of Groningen, The Netherlands<br />
Jan Ramon, University of Leuven, Belgium<br />
Rob Jelier, University of Leuven, Belgium<br />
Gunnar Klau, Centrum Wiskunde & Informatica, The Netherlands<br />
Andreas Kremer, ITTM S.A., Luxembourg<br />
Kris Laukens, University of Antwerp, Belgium<br />
Tom Lenaerts, Université libre de Bruxelles, Belgium<br />
Steven Maere, Ghent University / VIB, Belgium<br />
Lennart Martens, Ghent University / VIB, Belgium<br />
Pieter Meysman, University of Antwerp, Belgium<br />
Perry Moerland, University of Amsterdam, Belgium<br />
Pieter Monsieurs, SCK-CEN, Belgium<br />
Yves Moreau, University of Leuven, Belgium<br />
Yvan Saeys, Ghent University / VIB, Belgium<br />
Thomas Sauter, University of Luxembourg, Luxembourg<br />
Alexander Schoenhuth, Centrum Wiskunde & Informatica, The Netherlands<br />
Berend Snel, Utrecht University, Belgium<br />
Dirk Valkenborg, VITO, Belgium<br />
Raf Van de Plas, Delft University of Technology, The Netherlands<br />
Vera van Noort, University of Leuven, Belgium<br />
Natal van Riel, Eindhoven University of Technology, The Netherlands<br />
Klaas Vandepoele, Ghent University / VIB, Belgium<br />
Geert Vandeweyer, University of Antwerp, Belgium<br />
Wim Vrancken, Vrije Universiteit Brussel, Belgium<br />
6
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Local Organizing Committee<br />
<br />
<br />
<br />
<br />
<br />
<br />
Charlie Beirnaert, University of Antwerp<br />
Wout Bittremieux, University of Antwerp<br />
Bart Cuypers, University of Antwerp<br />
Nicolas De Neuter, University of Antwerp<br />
Aida Mrzic, University of Antwerp<br />
Stefan Naulaerts, University of Antwerp<br />
The results published in this book of abstracts are under the full responsibility of the authors. The<br />
organizing committee cannot be held responsible for any errors in this publication or potential<br />
consequences thereof.<br />
7
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Conference agenda 1/2<br />
December 6, <strong>2015</strong>: Satellite events<br />
12.30 – 19.00 Student-run satellite meeting at the Institute of Tropical Medicine, Antwerp.<br />
19.00 - … Guided sightseeing tour of Antwerp for early arrivals.<br />
December 7, <strong>2015</strong>: Main Conference<br />
8.30 - 9.30 Registration and welcome coffee.<br />
9.30 - 9.50<br />
Welcome and conference opening, with foreword by UAntwerpen Rector Prof.<br />
Alain Verschoren.<br />
9.50 - 10.50<br />
K1 Invited keynote: Lars Juhl Jensen. Medical data and text mining: Linking<br />
diseases, drugs, and adverse reactions.<br />
10.50 - 11.10 Coffee break.<br />
Selected talks session 1<br />
11.10 - 11.25<br />
O1 Mafalda Galhardo, Philipp Berninger, Thanh-Phuong Nguyen, Thomas Sauter and Lasse<br />
Sinkkonen. Cell type-selective disease association of genes under high regulatory load.<br />
11.25 - 11.40<br />
O2 Andrea M. Gazzo, Dorien Daneels, Maryse Bonduelle, Sonia Van Dooren, Guillaume<br />
Smits and Tom Lenaerts. Predicting oligogenic effects using digenic disease data.<br />
11.40 - 11.55<br />
O3 Wouter Saelens, Robrecht Cannoodt, Bart N. Lambrecht and Yvan Saeys. A<br />
comprehensive comparison of module detection methods for gene expression data.<br />
11.55 - 12.10<br />
O4 Joana P. Gonçalves and Sara C. Madeira. LateBiclustering: Efficient discovery of temporal<br />
local patterns with potential delays.<br />
12.10 - 12.30<br />
C1 Nicolas Goffard. Illumina software platforms to transform the path to knowledge and<br />
discovery. (Corporate presentation: Illumina)<br />
8
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
12.30 - 15.00 Lunch break & poster session.<br />
Selected talks session 2<br />
15.00 - 15.15<br />
O5 Robrecht Cannoodt, Katleen De Preter and Yvan Saeys. Inferring developmental<br />
chronologies from single cell RNA.<br />
15.15 - 15.30<br />
O6 Vân Anh Huynh-Thu and Guido Sanguinetti. Combining tree-based and dynamical<br />
systems for the inference of gene regulatory networks.<br />
15.30 - 15.45<br />
15.45 - 16.00<br />
O7 Annika Jacobsen, Nika Heijmans, Renée van Amerongen, Martine Smit, Jaap Heringa<br />
and K. Anton Feenstra. Modeling the Regulation of β-Catenin Signalling by WNT stimulation<br />
and GSK3 inhibition.<br />
O8 Thanh Le Van, Jimmy Van den Eynden, Dries De Maeyer, Ana Carolina Fierro, Lieven<br />
Verbeke, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt and Kathleen Marchal.<br />
Ranked tiling based approach to discovering patient subtypes.<br />
16.00 - 16.15<br />
O9 Martin Bizet, Jana Jeschke, Matthieu Defrance, François Fuks and Gianluca Bontempi.<br />
Development of a DNA methylation-based score reflecting Tumour Infiltrating Lymphocytes.<br />
16.15 - 16-30<br />
O10 Aliaksei Vasilevich, Shantanu Singh, Aurélie Carlier and Jan de Boer. Prediction of cell<br />
responses to surface topographies using machine learning techniques.<br />
16.30 - 17.00 Coffee break.<br />
Selected talks session 3<br />
17.00 - 17.15<br />
O11 Wout Bittremieux, Pieter Meysman, Lennart Martens, Bart Goethals, Dirk Valkenborg<br />
and Kris Laukens. Analysis of mass spectrometry quality control metrics.<br />
17.15 - 17.30<br />
O12 Şule Yılmaz, Masa Cernic, Friedel Drepper, Bettina Warscheid, Lennart Martens and<br />
Elien Vandermarliere. Xilmass: A cross-linked peptide identification algorithm.<br />
17.30 - 17.45<br />
17.45 - 18.00<br />
O13 Nico Verbeeck, Jeffrey Spraggins, Yousef El Aalamat, Junhai Yang, Richard M. Caprioli,<br />
Bart De Moor, Etienne Waelkens and Raf Van de Plas. Automated anatomical interpretation<br />
of differences between imaging mass spectrometry experiments.<br />
O14 Yousef El Aalamat, Xian Mao, Nico Verbeeck, Junhai Yang, Bart De Moor, Richard M.<br />
Caprioli, Etienne Waelkens and Raf Van de Plas. Enhancement of imaging mass spectrometry<br />
data through removal of sparse intensity variations.<br />
18.10 - 18.30 Walk to the gala dinner leaving from conference venue.<br />
18.30 - 22.00 Gala dinner at Pelgrom – Pelgrimstraat 15, Antwerpen.<br />
9
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Conference agenda 2/2<br />
December 8, <strong>2015</strong>: Main Conference<br />
8.30 - 9.30 Welcome coffee.<br />
9.30 - 9.40 Opening and announcements.<br />
Selected talks session 4<br />
9.40 - 9.55<br />
9.55 - 10.10<br />
10.10 – 10.25<br />
10.25 - 10.40<br />
O15 Gipsi Lima Mendez, Karoline Faust, Nicolas Henry, Johan Decelle, Sébastien Colin,<br />
Fabrizio Carcillo, Simon Roux, Gianluca Bontempi, Matthew B. Sullivan, Chris Bowler, Eric<br />
Karsenti, Colomban de Vargas and Jeroen Raes. Determinants of community structure in the<br />
plankton interactome.<br />
O16 Mohamed Mysara, Yvan Saeys, Natalie Leys, Jeroen Raes and Pieter Monsieurs.<br />
Bioinformatics tools for accurate analysis of amplicon sequencing data for<br />
biodiversity analysis.<br />
O17 Sjoerd M. H. Huisman, Else Eising, Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Arn<br />
van den Maagdenberg and Marcel Reinders. Gene co-expression analysis identifies brain<br />
regions and cell types involved in migraine pathophysiology: a GWAS-based study using the<br />
Allen Human Brain Atlas.<br />
O18 Ahmed Mahfouz, Boudewijn P.F. Lelieveldt, Aldo Grefhorst, Isabel Mol, Hetty Sips,<br />
Jose van den Heuvel, Jenny Visser, Marcel Reinders and Onno Meijer. Spatial co-expression<br />
analysis of steroid receptors in the mouse brain identifies region-specific<br />
regulation mechanisms.<br />
10.40 - 11.10 Coffee break.<br />
Selected talks session 5<br />
11.10 - 11.25<br />
O19 Bart Cuypers, Pieter Meysman, Manu Vanaerschot, Maya Berg, Malgorzata<br />
Domagalksa, Jean-Claude Dujardin and Kris Laukens. A systems biology compendium for<br />
Leishmania Donovani.<br />
11.25 - 11.40<br />
O20 Volodimir Olexiouk, Elvis Ndah, Sandra Steyaert, Steven Verbruggen, Eline De Schutter,<br />
Alexander Koch, Daria Gawron, Wim Van Criekinge, Petra Van Damme and Gerben<br />
Menschaert. Multi-omics integration: Ribosome profiling applications.<br />
11.40 - 11.55<br />
O21 Qingzhen Hou, Kamil Krystian Belau, Marc Lensink, Jaap Heringa and K. Anton<br />
Feenstra. CLUB-MARTINI: Selecting favorable interactions amongst available candidates: A<br />
coarse-grained simulation approach to scoring docking decoys.<br />
11.55 - 12.10<br />
O22 Elien Vandermarliere, Davy Maddelein, Niels Hulstaert, Elisabeth Stes, Michela Di<br />
Michele, Kris Gevaert, Edgar Jacoby, Dirk Brehmer and Lennart Martens. Pepshell:<br />
Visualization of conformational proteomics data.<br />
10
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
12.10 - 12.30<br />
C2 Carine Poussin. The systems toxicology computational challenge: Identification of<br />
exposure response markers. (Corporate presentation: sbv IMPROVER)<br />
12.30 - 13.30 Lunch break.<br />
13.30 - 14.30<br />
K2 Invited keynote: Cedric Notredame. Multiple survival strategies to deal with the<br />
multiplication of multiple sequence alignment methods.<br />
Selected talks session 6<br />
14.30 - 14.45<br />
O23 Thomas Moerman, Dries Decap and Toni Verbeiren. Interactive VCF comparison using<br />
Spark Notebook.<br />
14.45 - 15.00<br />
O24 Sepideh Babaei, Waseem Akhtar, Johann de Jong, Marcel Reinders and Jeroen de<br />
Ridder. 3D hotspots of recurrent retroviral insertions reveal long-range interactions with<br />
cancer genes.<br />
15.00 - 15.30 Coffee break.<br />
15.30 - 16.00 K3 Invited keynote: Pierre Rouzé. Thirty years in Bioinformatics.<br />
16.00 - 16.30 Closing and awards.<br />
16.30 - 17.00 Closing reception.<br />
11
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
Gala dinner<br />
The gala event will take place at the Pelgrom, a Medieval-style restaurant at walking distance from<br />
the Elzenveld conference location, on the evening of Monday December 7th, after the conference<br />
programme, from 18h30 until 22h00. Gala dinner participation is optional, although highly<br />
recommended!<br />
The Pelgrom is one of Antwerp’s most historic eating and drinking place, situated in authentic 15th<br />
century cellars that were used by merchants for temporary storage during the two big annual<br />
Antwerp fairs. Prepare to feast on a Medieval buffet in the style of Antwerp’s Golden Century!<br />
The Pelgrom is at walking distance from the<br />
Elzenveld conference location. For people using<br />
public transportation, after the end of the gala<br />
dinner, the Antwerp-Central train station can easily<br />
be reached by tram from the Groenplaats station<br />
(10 minutes), or on foot (20 minutes).<br />
Where? Restaurant Pelgrom, Pelgrimsstraat 15, 2000 Antwerp<br />
When? Monday December 7th, <strong>2015</strong>; 18h30 - 22h00<br />
12
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
List of abstracts<br />
K1 MEDICAL DATA AND TEXT MINING: LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS 17<br />
K2<br />
Keynotes<br />
MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE MULTIPLICATION OF MULTIPLE SEQUENCE<br />
ALIGNMENT METHODS<br />
18<br />
Corporate presentations<br />
C1 ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO KNOWLEDGE AND DISCOVERY 19<br />
C2<br />
THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE<br />
RESPONSE MARKERS<br />
20<br />
Selected oral presentations<br />
O1 CELL TYPE-SELECTIVE DISEASE ASSOCIATION OF GENES UNDER HIGH REGULATORY LOAD 21<br />
O2 PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA 22<br />
O3<br />
O4<br />
A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS FOR GENE EXPRESSION<br />
DATA<br />
LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL PATTERNS WITH POTENTIAL<br />
DELAYS<br />
O5 INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL RNA 25<br />
O6<br />
O7<br />
COMBINING TREE-BASED AND DYNAMICAL SYSTEMS FOR THE INFERENCE OF GENE<br />
REGULATORY NETWORKS<br />
MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT STIMULATION AND GSK3<br />
INHIBITION<br />
O8 RANKED TILING BASED APPROACH TO DISCOVERING PATIENT SUBTYPES 28<br />
O9<br />
O10<br />
DEVELOPMENT OF A DNA METHYLATION-BASED SCORE REFLECTING TUMOUR INFILTRATING<br />
LYMPHOCYTES<br />
PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES USING MACHINE LEARNING<br />
TECHNIQUES<br />
O11 ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS 31<br />
O12 XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM 32<br />
O13<br />
O14<br />
AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES BETWEEN IMAGING MASS<br />
SPECTROMETRY EXPERIMENTS<br />
ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA THROUGH REMOVAL OF SPARSE<br />
INTENSITY VARIATIONS<br />
O15 DETERMINANTS OF COMMUNITY STRUCTURE IN THE PLANKTON INTERACTOME 35<br />
O16<br />
O17<br />
BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON SEQUENCING DATA FOR<br />
BIODIVERSITY ANALYSIS<br />
GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND CELL TYPES INVOLVED IN<br />
MIGRAINE PATHOPHYSIOLOGY: A GWAS-BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS<br />
13<br />
23<br />
24<br />
26<br />
27<br />
29<br />
30<br />
33<br />
34<br />
36<br />
37
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O18<br />
SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN THE MOUSE BRAIN IDENTIFIES<br />
REGION-SPECIFIC REGULATION MECHANISMS<br />
O19 A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI 39<br />
O20 MULTI-OMICS INTEGRATION: RIBOSOME PROFILING APPLICATIONS 40<br />
O21<br />
CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS AMONGST AVAILABLE CANDIDATES: A<br />
COARSE-GRAINED SIMULATION APPROACH TO SCORING DOCKING DECOYS<br />
O22 PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS DATA 42<br />
O23 INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK 43<br />
O24<br />
3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL LONG-RANGE INTERACTIONS<br />
WITH CANCER GENES<br />
Poster presentations<br />
38<br />
41<br />
44<br />
P1 KNN-MDR APPROACH FOR DETECTING GENE-GENE INTERACTIONS 45<br />
P2 CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC PATHWAYS IN FUNGI 46<br />
P3<br />
VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS USING POLIMERO AND<br />
POLIMERO-BIO<br />
P4 DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND 48<br />
P5<br />
P6<br />
BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW COVERAGE SEQUENCING DATA, BY<br />
INTEGRATION OF HADOOP, HBASE AND HIVE<br />
ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING LONG-TERM PATIENT GUT<br />
COLONIZATION<br />
P7 XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC 51<br />
P8 IDENTIFICATION OF NUMTS THROUGH NGS DATA 52<br />
P9 MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING SCHEMES FOR BACTERIA 53<br />
P10<br />
P11<br />
FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN BIOLOGICAL INTERPRETATION OF<br />
GWAS RESULTS<br />
IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS IN SETS OF FUNCTIONALLY<br />
RELATED GENES<br />
P12 PHENETIC: MULTI-OMICS DATA INTERPRETATION USING INTERACTION NETWORKS 56<br />
P13<br />
THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS SUSCEPTIBILITY IN ALLOGENEIC<br />
TRANSPLANT POPULATIONS<br />
P14 NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM WHOLE GENOME NGS DATA 58<br />
P15<br />
ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR NANOMATERIAL SAFETY<br />
EVALUATION<br />
P16 BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY: SOMETIMES LESS IS MORE 60<br />
P17 TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS 61<br />
P18<br />
P19<br />
P20<br />
RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH ALTERNATIVE FUNCTIONALITY IN<br />
MUSHROOMS<br />
MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE QUANTIFICATION IN LABEL-<br />
FREE MASS SPECTROMETRY-BASED QUANTITATIVE PROTEOMICS<br />
A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF MONOALLELICALLY EXPRESSED<br />
LOCI AND THEIR DEREGULATION IN CANCER<br />
P21 GEVACT: GENOMIC VARIANT CLASSIFIER TOOL 65<br />
P22<br />
MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH THROUGHPUT INTERACTOMICS DATA<br />
FROM ARRAY-MAPPIT EXPERIMENTS<br />
P23 HIGHLANDER: VARIANT FILTERING MADE EASIER 67<br />
14<br />
47<br />
49<br />
50<br />
54<br />
55<br />
57<br />
59<br />
62<br />
63<br />
64<br />
66
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P24<br />
P25<br />
P26<br />
P27<br />
P28<br />
DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR GENE REGULATORY NETWORK<br />
INFERENCE FROM GENE EXPRESSION DATA WITH MULTIPLE DOSES AND TIME POINTS<br />
IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS USING A “DUMMY” LIGAND<br />
APPROACH<br />
PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL GENETICALLY MODIFIED<br />
CONGENIC MICE<br />
DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION AND DIFFERENCES IN DRUG<br />
SUSCEPTIBILITY WITH WGS DATA<br />
APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO CIRCULATING MICRORNAS REVEALS<br />
NOVEL BIOMARKERS FOR DRUG-INDUCED LIVER INJURY<br />
P29 INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION 73<br />
P30 GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS FROM GENE EXPRESSION DATA 74<br />
P31<br />
KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT FOR INTRINSICALLY DISORDERED<br />
PROTEINS<br />
P32 ON THE LZ DISTANCE FOR DEREPLICATING REDUNDANT PROKARYOTIC GENOMES 76<br />
P33 THE ROLE OF MIRNAS IN ALZHEIMER’ S DISEASE 77<br />
P34 FUNCTIONAL SUBGRAPH ENRICHMENTS FOR NODE SETS IN REGULATORY NETWORKS 78<br />
P35 HUMANS DROVE THE INTRODUCTION & SPREAD OF MYCOBACTERIUM ULCERANS IN AFRICA 79<br />
P36<br />
LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA DETECTION AND<br />
CLASSIFICATION IN PLANTS<br />
P37 ANALYSIS OF RELATIONSHIP PATTERNS IN UNASSIGNED MS/MS SPECTRA 81<br />
P38 MINING ACROSS “ OMICS ” DATA FOR DRUG PRIORITIZATION 82<br />
P39<br />
P40<br />
ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX HISTORY OF NON-BIFURCATING<br />
SPECIATION IN THE GENUS ARABIDOPSIS<br />
RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN READING FRAMES (SORFS), A<br />
NEW SOURCE OF BIOACTIVE PEPTIDES<br />
P41 RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE ALIGNMENT 85<br />
P42 EARLY FOLDING AND LOCAL INTERACTIONS 86<br />
P43<br />
P44<br />
BINDING SITE SIMILARITY DRUG REPOSITIONING: A GENERAL AND SYSTEMATIC METHOD FOR<br />
DRUG DISCOVERY AND SIDE EFFECTS DETECTION<br />
ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS OF ACETOBACTER<br />
GHANENSIS AND ACETOBACTER SENEGALENSIS TO THE COCOA BEAN FERMENTATION PROCESS<br />
THROUGH A GENOMIC APPROACH<br />
P45 REPRESENTATIONAL POWER OF GENE FEATURES FOR FUNCTION PREDICTION 89<br />
P46 ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY PREDICTION 90<br />
P47<br />
P48<br />
MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC VARIANTS AT THE PROTEIN LEVEL<br />
IMPROVES THE IDENTIFICATION OF THEIR DELETERIOUS EFFECTS<br />
NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN DEAMIDATION FROM SEQUENCE-DERIVED<br />
SECONDARY STRUCTURE AND INTRINSIC DISORDER<br />
P49 OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL MODELS 93<br />
P50<br />
P51<br />
EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION ACROSS MULTIPLE MICROBIAL<br />
GENOMES<br />
INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES FOR PREDICTING CLINICAL<br />
CODES<br />
P52 SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS 96<br />
P53<br />
FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND COMPARE CYTOMETRY DATA IN<br />
THE BROWSER<br />
P54 TOWARDS A BELGIAN REFERENCE SET 98<br />
P55 MANAGING BIG IMAGING DATA FROM MICROSCOPY: A DEPARTMENTAL-WIDE APPROACH 99<br />
15<br />
68<br />
69<br />
70<br />
71<br />
72<br />
75<br />
80<br />
83<br />
84<br />
87<br />
88<br />
91<br />
92<br />
94<br />
95<br />
97
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P56<br />
ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN CANCER GENOMES USING<br />
ENHANCER PREDICTION MODELS AND MATCHED GENOME-EPIGENOME-TRANSCRIPTOME<br />
DATA<br />
P57 I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN SEQUENCE VISUALIZATION 101<br />
P58<br />
P59<br />
SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY PURIFICATION-MASS<br />
SPECTROMETRY DATA ANALYSIS<br />
MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: AN EXTREMELY IMBALANCED BIG<br />
DATA PROBLEM<br />
P60 COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-EXPRESSION NETWORKS 104<br />
P61<br />
THE DETECTION OF PURIFYING SELECTION DURING TUMOUR EVOLUTION UNVEILS CANCER<br />
VULNERABILITIES<br />
P62 FLOREMI: SURVIVAL TIME PREDICTION BASED ON FLOW CYTOMETRY DATA 106<br />
P63<br />
P64<br />
P65<br />
STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO UNDERSTAND GENOTOXICITY OF MLV-<br />
BASED GENE THERAPY VECTORS<br />
THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS FERMENTUM IMDO 130101 AND ITS<br />
METABOLIC TRAITS RELATED TO THE SOURDOUGH FERMENTATION PROCESS<br />
ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN SUGGESTS REDUCED<br />
INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC PROCESSES IN ITS SUSPECTED BAT<br />
RESERVOIR HOST<br />
P66 PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING 110<br />
P67<br />
P68<br />
IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING A NETWORK-BASED<br />
APPROACH<br />
DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT LACTOBACILLUS NICHES USING<br />
METAGENOMIC SEQUENCING<br />
P69 HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES USING MATRIX FACTORIZATION 113<br />
P70 THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS DISTRIBUTION 114<br />
100<br />
102<br />
103<br />
105<br />
107<br />
108<br />
109<br />
111<br />
112<br />
Corporate poster presentations<br />
C2<br />
THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE: IDENTIFICATION OF EXPOSURE<br />
RESPONSE MARKERS<br />
20<br />
16
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
K1. MEDICAL DATA AND TEXT MINING:<br />
LINKING DISEASES, DRUGS, AND ADVERSE REACTIONS<br />
Lars Juhl Jensen<br />
Clinical data describing the phenotypes and treatment of patients is an underused data source that has much greater<br />
research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for revealing<br />
unknown disease correlations and for improving post-approval monitoring of drugs. In my presentation I will introduce<br />
the centralized Danish health registries and show how we use them for identification of temporal disease correlations and<br />
discovery of common diagnosis trajectories of patients. I will also describe how we perform text mining of the clinical<br />
narrative from electronic health records and use this for identification of new adverse reactions of drugs.<br />
17
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: K2<br />
Keynote<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
K2. MULTIPLE SURVIVAL STRATEGIES TO DEAL WITH THE<br />
MULTIPLICATION OF MULTIPLE SEQUENCE ALIGNMENT METHODS<br />
Cedric Notredame<br />
In this seminar I will introduce some of the latest developments in the field of multiple sequence alignment construction,<br />
including some of the work from my group. I will briefly review the main challenges and the latest work in the field,<br />
including ClustalO and the phylogeny aware aligners like SATe and how these aligners relate to consistency based<br />
methods like T-Coffee. I will also look at the complex relationship between multiple sequence alignment accuracy,<br />
structural modeling and phylogenetic tree reconstruction and introduce the notion of reliability index while reviewing<br />
some of the latest advances in this field, including the TCS (Transitive consistency score). I will show how this index can<br />
be used to both identify structurally correct positions in an alignment and evolutionary informative sites, thus suggesting<br />
more unity than initially thought between these two parameters. I will then introduce the structure based clustering<br />
method we recently developed to further test these hypothesis. I will finish with some consideration on the main<br />
challenges that need to be confronted for the accurate modeling of biological sequences relationship with a special<br />
attention on genomic and RNA sequences. All methods are available from www.tcoffee.org.<br />
REFERENCES<br />
TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Chang<br />
JM, Di Tommaso P, Notredame C. Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1.<br />
Using tertiary structure for the computation of highly accurate multiple RNA alignments with the SARA-Coffee package. Kemena C, Bussotti G,<br />
Capriotti E, Marti-Renom MA, Notredame C. Bioinformatics. 2013 May 1;29(9):1112-9. doi: 10.1093/bioinformatics/btt096. Epub 2013 Feb 28.<br />
Alignathon: a competitive assessment of whole-genome alignment methods. Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K,<br />
Seledtsov I, Molodtsov V, Raney BJ, Clawson H, Kim J, Kemena C, Chang JM, Erb I, Poliakov A, Hou M, Herrero J, Kent WJ, Solovyev V,<br />
Darling AE, Ma J, Notredame C, Brudno M, Dubchak I, Haussler D, Paten B. Genome Res. 2014 Dec;24(12):2077-89. doi: 10.1101/gr.174920.114.<br />
Epub 2014 Oct 1.<br />
Epistasis as the primary factor in molecular evolution. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Nature. 2012 Oct<br />
25;490(7421):535-8. doi: 10.1038/nature11510. Epub 2012 Oct 14.<br />
18
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: C1<br />
Corporate presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
C1. ILLUMINA SOFTWARE PLATFORMS TO TRANSFORM THE PATH TO<br />
KNOWLEDGE AND DISCOVERY<br />
Nicolas Goffard<br />
Illumina, Inc. ngoffard@illumina.com<br />
The next big bottleneck in the biological sample to answer workflow has undoubtedly moved beyond the generation of<br />
the raw data towards its initial processing and analysis and even more so its biological and medical interpretation. There<br />
are two main reasons why this is particularly challenging for research organisations to successfully accomplish. Firstly<br />
there is a need to easily and securely analyse, archive and share sequencing data as well as to simplify and accelerate the<br />
data analysis with push button tools using widely validated and scientifically accepted algorithms. Secondly there is a<br />
requirement to normalize, standardize and curate not just their proprietary data from multiple studies, but to do it in a<br />
way that allows them to compare it in real time to data produced from public domain studies. Illumina provides two<br />
integrated software platforms to overcome these challenges called BaseSpace and NextBio and this presentation provides<br />
an overview of the capabilities found within both to empower biologists and informaticians to interactively explore the<br />
data.<br />
19
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: C2<br />
Corporate presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
C2. THE SYSTEMS TOXICOLOGY COMPUTATIONAL CHALLENGE:<br />
IDENTIFICATION OF EXPOSURE RESPONSE MARKERS<br />
Carine Poussin, Vincenzo Belcastro, Stéphanie Boué, Florian Martin,<br />
Alain Sewer, Bjoern Titz, Manuel C. Peitsch & Julia Hoeng.<br />
Philip Morris International Research and Development, Philip Morris Product SA,<br />
Quai Jeanrenaud 5, CH-2000 Neuchâtel, Switzerland<br />
INTRODUCTION<br />
Risk assessment in the context of 21st century<br />
toxicology relies on the identification of specific<br />
exposure response markers and the elucidation of<br />
mechanisms of toxicity, which can lead to adverse<br />
events. As a foundation for this future predictive risk<br />
assessment, diverse set of chemicals or mixtures are<br />
tested in different biological systems, and datasets are<br />
generated using high-throughput technologies.<br />
However, the development of effective computational<br />
approaches for the analysis and integration of these data<br />
sets remains challenging.<br />
METHODS<br />
The sbv IMPROVER (Industrial Methodology for<br />
Process Verification in Research;<br />
http://sbvimprover.com/) project aims to verify methods<br />
and concepts in systems biology research via challenges<br />
posed to the scientific community. In fall <strong>2015</strong>, the 4th<br />
sbv IMPROVER computational challenge will be<br />
launched which is aimed at evaluating algorithms for<br />
the identification of specific markers of chemical<br />
mixture exposure response in blood of humans or<br />
rodents. The blood is an easily accessible matrix,<br />
however remains a complex biofluid to analyze. This<br />
computational challenge will address questions related<br />
to the classification of samples based on transcriptomics<br />
profiles from well-defined sample cohorts. Moreover, it<br />
will address whether gene expression data derived from<br />
human or rodent whole blood are sufficiently<br />
informative to identify human-specific or speciesindependent<br />
blood gene signatures predictive of the<br />
exposure status of a subject to chemical mixtures<br />
(current/former/non-exposure).<br />
RESULTS & DISCUSSION<br />
Participants will be provided with high quality datasets<br />
to develop predictive models/classifiers and the<br />
predictions will be scored by an independent scoring<br />
panel. The results and post-challenge analyses will be<br />
shared with the scientific community, and will open<br />
new avenues in the field of systems toxicology.<br />
REFERENCES<br />
Meyer et al. Industrial methodology for process verification in<br />
research (IMPROVER): toward systems biology verification.<br />
Bioinformatics, 2012<br />
Meyer et al. Verification of systems biology research in the age of<br />
collaborative competition. Nat Biotechnol, 2011<br />
Tarca et al. Strengths and limitations of microarray-based phenotype<br />
prediction: lessons learned from the IMPROVER Diagnostic<br />
Signature Challenge. Bioinformatics, 2013<br />
Hartung, T. Lessons learned from alternative methods and their<br />
validation for a new toxicology in the 21st century. Journal of<br />
toxicology and environmental health, 2010<br />
Hoeng et al. A network-based approach to quantifying the impact of<br />
biologically active substances. Drug Discov Today, 2012.<br />
20
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O1<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O1. CELL TYPE-SELECTIVE DISEASE ASSOCIATION<br />
OF GENES UNDER HIGH REGULATORY LOAD<br />
Mafalda Galhardo 1 , Philipp Berninger 2 , Thanh-Phuong Nguyen 1 , Thomas Sauter 1 & Lasse Sinkkonen 1*.<br />
Life Sciences Research Unit, University of Luxembourg, Luxembourg, Luxembourg 1 ; Biozentrum, University of Basel<br />
and Swiss Institute of Bioinformatics, Basel, Switzerland 2 . * lasse.sinkkonen@uni.lu<br />
Identification of biomarkers and drug targets is a key task of biomedical research. We previously showed that diseaselinked<br />
metabolic genes are often under combinatorial regulation (Galhardo et al. 2014). Here we extend this analysis to<br />
include almost 100 transcription factors (TFs) and key histone modifications from over 100 samples to show that genes<br />
under high regulatory load (HRL) are enriched for disease-association across cell types. Network and pathway analysis<br />
suggests the central role of HRL genes in biological networks, under heavy regulation both at transcriptional and posttranscriptional<br />
level, as a possible explanation for the observed enrichment. Thus, epigenomic mapping of enhancers<br />
presents an unbiased approach for identification of novel disease-associated genes.<br />
INTRODUCTION<br />
Identification of disease-relevant genes and gene products<br />
as biomarkers and drug targets is one of key tasks of<br />
biomedical research. Still, a great majority of research is<br />
focused on a small minority of genes while many remain<br />
unstudied (Pandey et al. 2014). Unbiased prioritization<br />
within these ignored genes would be important to harvest<br />
the full potential of genomics in understanding diseases.<br />
Many databases to catalog disease-associated genes have<br />
been created, including DisGeNET that draws from<br />
multiple sources (Bauer-Mehren et al. 2010). In addition,<br />
large amounts of publicly available epigenomic data on<br />
the cell type-selective regulation of these genes has been<br />
produced. The importance of epigenetic regulation for<br />
disease development is increasingly recognized, for<br />
example in analysis of GWAS studies where causal SNPs<br />
are mostly located within gene regulatory regions<br />
(Maurano et al. 2012).<br />
METHODS<br />
Public ChIP-seq data produced by the ENCODE project<br />
(Dunham et al. 2012), the BLUEPRINT Epigenome<br />
project (Martens et al. 2013) and the NIH Epigenomic<br />
Roadmap project (Kundaje et al. <strong>2015</strong>) were downloaded<br />
on May 2014. The data were used to rank active protein<br />
coding genes (based on NCBI Entrez and marked by<br />
H3K4me3) by their regulatory load based on the number<br />
of associated TFs or enhancer (H3K27ac) regions using<br />
GREAT tool. The enrichment of disease genes from<br />
DisGeNET among HRL genes was tested using either<br />
Matlab® hypergeometric cumulative distribution function<br />
and adjusted for multiple testing with the Benjamini and<br />
Hochberg methodology or normalized enrichment score.<br />
Enriched diseases were clustered using R package<br />
“blockcluster”. Peak calling for super-enhancers was done<br />
using HOMER. A liver disease gene network was<br />
constructed from HPRD based on liver diseases genes<br />
from MeSH and genes from CTD and had 8278<br />
interactions. Statistical analysis of KEGG pathway<br />
enrichments and betweenness centrality was done using<br />
random sampling tests. miRNA target predictions were<br />
obtained from TargetScan6.2. Further details of the used<br />
methods can be found in Galhardo et al. <strong>2015</strong>.<br />
RESULTS & DISCUSSION<br />
Using ENCODE ChIP-Seq profiles for 93 transcription<br />
factors (TFs) in nine cell lines, we show that HRL genes<br />
are enriched for disease-association across cell types<br />
(Figure 1). TF load correlates with the enhancer load of<br />
the genes, allowing the identification of HRL genes by<br />
epigenomic mapping of active enhancers marked by<br />
H3K27ac modifications. Identification of the HRL genes<br />
across 139 samples from 96 different cell and tissue types<br />
reveals a consistent enrichment for disease-associated<br />
genes in a cell type-selective manner.<br />
The HRL genes are involved in more pathways than<br />
expected by chance, exhibit increased betweenness<br />
centrality in the interaction network of liver disease genes,<br />
and carry longer 3’UTRs with more microRNA binding<br />
sites than genes on average, suggesting a role as hubs<br />
within regulatory networks.<br />
Thus, epigenomic mapping of enhancers presents an<br />
unbiased approach for identification of novel diseaseassociated<br />
genes (Galhardo et al. <strong>2015</strong>).<br />
Transcription factor<br />
binding sites<br />
(93 TFs)<br />
9 ENCODE cell lines<br />
A549, GM12878, H1hESC, HCT116,<br />
HeLaS3, HepG2, HUVEC, K562, MCF7<br />
Gene ranking by<br />
regulatory load<br />
(Number of TFs or enhancers per gene)<br />
ChIP-seq data (Human)<br />
Active enhancers<br />
(H3K27ac)<br />
139 samples comprising<br />
96 tissue or cell types<br />
Disease genes<br />
(min score 0.08)<br />
High regulatory load genes are enriched<br />
for disease association<br />
FIGURE 1. Worflow of the disease-gene enrichment analysis.<br />
Figure 1<br />
REFERENCES<br />
Pandey AK et al. PLoS One, 9:e88889 (2014).<br />
Bauer-Mehren A et al. Nucleic Acids Res., 33:D514-D517 (2010).<br />
Maurano et al. Science, 337:1190-1195 (2012).<br />
Galhardo et al. Nucleic Asics Res. 42:1474-1496 (2014).<br />
Dunham et al. Nature, 489:57-74 (2012)<br />
Martens et al. Haematologica, 98:1487-1489 (2013)<br />
Kundaje et al. Nature, 518:317-330 (<strong>2015</strong>).<br />
Galhardo et al. Nucleic Acids Res. 10.1093/nar/gkv863 (<strong>2015</strong>).<br />
21
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O2<br />
10th Benelux Bioinformatics Conference Oral presentation<br />
<strong>bbc</strong> <strong>2015</strong><br />
O2. PREDICTING OLIGOGENIC EFFECTS USING DIGENIC DISEASE DATA<br />
Andrea M. Gazzo 1,2,3* , Dorien Daneels 1,3 , Maryse Bonduelle 3 , Sonia Van Dooren 1,3 , Guillaume Smits 1,4 & Tom<br />
Lenaerts 1,2,5 .<br />
Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium 1 ; MLG, Departement d'Informatique,<br />
Universite Libre de Bruxelles, Brussels, Belgium 2 ; Center for Medical Genetics, Reproduction and Genetics,<br />
Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium 3 ; Genetics,<br />
Hopital Universitaire des Enfants Reine Fabiola, Universite Libre de Bruxelles, Brussels, Belgium 4 ;<br />
Computerwetenschappen, Vrije Universiteit Brussel, Brussel, Belgium 5 . * Andrea.Gazzo@ulb.ac.be<br />
Recent research has shown that disorders may be better described by more complex inheritance mechanisms, advocating<br />
that some of the monogenic disease may in fact be oligogenic. Understanding how the combined interplay and weight of<br />
variants leads to disease may provide improved and novel insights into diseases classically considered being monogenic.<br />
Here we present a unique classification method that separates two types of digenic diseases, i.e. those that requires<br />
variants in both genes to induce the disease and those where one is causative and the second increases the severity. Our<br />
results show that a clear separation can be made between both classes using gene and variant-level features extracted<br />
from DIDA.<br />
INTRODUCTION<br />
DIDA is a novel database that provides for the first time<br />
detailed information on genes and associated genetic<br />
variants involved in digenic diseases, the simplest form of<br />
oligogenic inheritance 1 . The database is accessible via<br />
http://dida.ibsquare.be and currently includes 213 digenic<br />
combinations involved in 44 different digenic diseases 2 .<br />
These combinations are composed of 364 distinct variants,<br />
which are distributed over 136 distinct genes. Creating this<br />
new repository was essential, as current databases do not<br />
allow one to retrieve detailed records regarding digenic<br />
combinations. Genes, variants, diseases and digenic<br />
combinations in DIDA are annotated with manually<br />
curated information and information mined from other<br />
online resources. Each digenic combination was<br />
categorized into one of two effect classes: either ``on/off'',<br />
in which variant combinations in both genes are required<br />
to develop the disease, or ``severity'', where variants in<br />
one gene are enough to develop the disease and carrying<br />
variant combinations in two genes increases the severity or<br />
affects its age of onset. In this work we present a predictor<br />
capable of distinguishing between the digenic effect<br />
classes. We analyse the result of this predictor in relation<br />
to specific features collected for the different digenic<br />
combinations in DIDA, as for instance the<br />
haploinsufficiency of the genes, their zygosity and the<br />
relationship between them, providing insight into the<br />
biological meaning of the result.<br />
METHODS<br />
We used a machine learning approach to determine the<br />
classes, i.e. "severity" or "on/off", of a digenic<br />
combination. Starting with feature selection we chose the<br />
most informative features to classify the digenic<br />
combination in either 2 classes. For each of the two genes<br />
involved in a digenic combination: Zygosity<br />
(Heterozygote, Homozygote, etc.), recessiveness<br />
probability, haploinsufficiency score, known recessive<br />
information, if the gene is essential or not (based on<br />
Mouse knock out experimental data) are used as features<br />
in the predictor. At variant level, we used as features the<br />
pathogenicity predictions from SIFT and Polyphen 2 tools.<br />
Finally, we encode also the relationship between the two<br />
genes, defining the relation "Similar function", "Directly<br />
interacting" and "Pathway membership". After different<br />
tests we decided to use a Random forest algorithm, as this<br />
approach gave the best results.<br />
RESULTS & DISCUSSION<br />
After a 10-fold cross validation we obtained promising<br />
performances, with an MCC of 0,67 and 0,92 as AUROC.<br />
Regretfully, this performance is an overestimation since,<br />
as the gene-based features are the most important, many<br />
examples with mutations mapped on the same gene pair<br />
lead to the same oligogenic effect class. A stratification<br />
that ensures that the same pair of genes are never in both<br />
the training and in the testing set was required. We<br />
manually created 5 subsets, where the instances with the<br />
same gene-pair belong to the same subset. . After this<br />
procedure we assessed again the performances, obtaining<br />
an MCC of 0,36 and as AUROC 0,78. In order to verify<br />
the significance of the performances we retrained the<br />
random forest on a randomization of the data. This<br />
randomization was obtained by shuffling all the features<br />
for each instance but maintaining class unchanged. This<br />
reshuffling resulted in an MCC close to zero and a<br />
AUROC near to 0.5, as expected. This additional test<br />
confirms the significance of the stratified results.<br />
In a next stage we are analysing the relationship between<br />
the oligogenic effect and the features used, particularly in<br />
terms of biological and molecular interpretation. As a<br />
future perspective, the benefit at clinical level is very<br />
promising: one goal of medical genetics is to assign<br />
predictive value to the genotype, in order to it to assist in<br />
diagnosis and disease management. If we can infer, based<br />
on the genotype, what the digenic/oligogenic effect will be,<br />
we can potentially anticipate the treatment.<br />
REFERENCES<br />
[1] Gazzo, A. et al., DIDA: a curated and annotated digenic diseases<br />
database, under review on NAR database issue (2016).<br />
[2] Schäffer, A. A. (2013) Digenic inheritance in medical genetics.<br />
J. Med. Genet., 50, 641–652.<br />
22
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O3<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O3. A COMPREHENSIVE COMPARISON OF MODULE DETECTION METHODS<br />
FOR GENE EXPRESSION DATA<br />
Wouter Saelens 1,2* , Robrecht Cannoodt 1,2,3 , Bart N. Lambrecht 1,2 & Yvan Saeys 1,2 .<br />
VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Center for Medical<br />
Genetics, Ghent University Hospital 3 . * wouter.saelens@ugent.be<br />
Module detection is central in every analysis of large scale gene expression data. While numerous methods have been<br />
developed, the relative merits and drawbacks of these different approaches is still unclear. In this work we use known<br />
gene regulatory networks to do an unbiased comparison of 41 module detection methods, spanning clustering,<br />
biclustering, decomposition, direct network inference and iterative network inference. This analysis showed that<br />
decomposition methods outperform current clustering methods. Our work provides a first comprehensive evaluation to<br />
guide the biologist in their choice but also serves as a protocol for the evaluation of novel module detection methods.<br />
INTRODUCTION<br />
Module detection methods form a cornerstone in the<br />
analysis of genome wide gene expression compendia.<br />
Modules in this context are defined as groups of genes<br />
with a similar expression profile, and therefore frequently<br />
share certain functions, are co-regulated and cooperate to<br />
produce a certain phenotype.<br />
Over the last years, dozens of module detection methods<br />
have been developed, which can be classified in five<br />
different categories. The most popular method is<br />
undoubtedly clustering, which will group genes into<br />
modules based on global similarity in expression profiles.<br />
Within the transcriptomics community these methods have<br />
received a considerable amount of criticism. This is<br />
mainly due to three drawbacks: (i) clustering cannot detect<br />
so called local co-expression effects, (ii) most clustering<br />
methods are unable to detect overlapping modules and (iii)<br />
clustering methods do not model the underlying gene<br />
regulatory network. Alternative approaches have therefore<br />
been developed which either handle both overlap and local<br />
co-expression (biclustering and decomposition) or model<br />
the gene regulatory network (direct network inference and<br />
iterative network inference).<br />
Given this methodological diversity, it is important that<br />
existing and new approaches are evaluated on robust and<br />
objective benchmarks. However, evaluation studies in the<br />
past were limited in the number of methods, use synthetic<br />
data or do not correctly assess the balance between false<br />
positives and false negatives. In this study we therefore<br />
provide a novel unbiased and comprehensive evaluation<br />
strategy (Figure 1), and used it to evaluate 41 state-of-theart<br />
module detection methods.<br />
METHODS<br />
The key of our approach is that we use golden standard<br />
regulatory networks to define sets of known modules.<br />
These can be used to directly assess the sensitivity and<br />
specificity of the different module detection methods. We<br />
used four different large scale gene expression compendia,<br />
two from E. coli and two from S. cerevisae. For each of<br />
these organisms a substantial part of the regulatory<br />
network is already known, either based on the integration<br />
of small-scale experiments or based on large, genome<br />
wide datasets. We use these networks to define groups of<br />
known modules using by looking at genes which either<br />
share on regulator, all regulators or are strongly<br />
interconnected. We used four different metrics to compare<br />
a set of observed modules with known modules: recovery<br />
and recall control the type II errors, while the relevance<br />
and specificity control the type I errors.<br />
Parameter tuning is a necessary but often overlooked<br />
challenge of module detection methods. As default<br />
parameters of a tool are usually optimized for some<br />
specific test cases by the authors, they do not necessarily<br />
reflect general good performance on other datasets. On the<br />
other hand, one should be careful of overfitting parameters<br />
on specific characteristics of the data, as such parameters<br />
will lead to suboptimal results when using the same<br />
parameter settings on other datasets. In this study we first<br />
optimized parameters using a grid-based approach. Next,<br />
to avoid overfitting we used the optimal parameters on one<br />
dataset to score the performance on another dataset, in an<br />
approach akin to cross-validation.<br />
RESULTS & DISCUSSION<br />
We evaluated 41 different module detection methods<br />
covering all five approaches. Overall, our analysis showed<br />
that certain decomposition methods, those based on the<br />
independent component analysis, outperform current stateof-the-art<br />
clustering methods. However, despite their<br />
theoretical advantages, neither biclustering nor network<br />
inference methods are able to outperform clustering<br />
methods. Importantly, our results are stable across datasets,<br />
module definitions and scoring metrics, demonstrating the<br />
robustness of our evaluation methodology.<br />
FIGURE 1. Overview of our evaluation methodology.<br />
The applications of our work are twofold. First, if local coexpression<br />
and overlap are of interest, we discourage the<br />
use of biclustering methods and suggest the use of<br />
decomposition instead. Secondly, we provide a new<br />
comprehensive evaluation methodology which can be used<br />
to compare novel methods with the current state-of-the-art.<br />
23
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O4<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O4. LATEBICLUSTERING: EFFICIENT DISCOVERY OF TEMPORAL LOCAL<br />
PATTERNS WITH POTENTIAL DELAYS<br />
Joana P. Gonçalves 1,2* & Sara C. Madeira 3,4 .<br />
Pattern Recognition and Bioinformatics Group, Department of Intelligent Systems, Delft University of Technology 1 ;<br />
Division of Molecular Carcinogenesis, The Netherlands Cancer Institute 2 ; Department of Computer Science and<br />
Engineering, Instituto Superior Técnico, Universidade de Lisboa 3 ; INESC-ID 4 . * research@joanagoncalves.org<br />
Temporal transcriptomes can provide valuable insight into the dynamics of transcriptional response and gene regulation.<br />
In particular, many studies seek to uncover functional biological units by identifying and grouping genes with common<br />
expression patterns. Nevertheless, most analytical tools available for this purpose fall short in their ability to consider<br />
biologically reasonable models and adequately incorporate the temporal dimension. Each biological task is likely to<br />
occur within a time period that does not necessarily span the whole time course of the experiment, and genes involved in<br />
such a task are expected to coordinate only while the task is ongoing. LateBiclustering is an efficient algorithm to<br />
identify this type of coordinated activity, while allowing genes to participate in distinct biological tasks with multiple<br />
partners over time. Additionally, LateBiclustering is able to capture temporal delays suggestive of transcriptional<br />
cascades: one of the hallmarks of gene expression and regulation.<br />
INTRODUCTION<br />
The discovery of patterns in temporal transcriptomes<br />
exposes gene expression dynamics and contributes to<br />
understand the machinery involved in its modulation.<br />
Various analytical tools are employed in this regard.<br />
Differential expression summarizes an entire time course<br />
into one feature, thus lacking detail. Clustering maintains<br />
respects the chronological order, but focuses on global<br />
similarities and tends to identify rather broad patterns,<br />
associated with unspecific functions. Biclustering offers<br />
increased granularity by additionally searching for local<br />
patterns, but allows for arbitrary jumps in time, eventually<br />
leading to patterns that are incoherent from a temporal<br />
perspective.<br />
METHODS<br />
LateBiclustering is an efficient algorithm for the<br />
identification of transcriptional modules, here termed<br />
LateBiclusters. Each LateBicluster is a group of genes<br />
showing a similar expression pattern with potential delays,<br />
within a particular time frame that does not necessarily<br />
span the whole time course of the transciptome.<br />
LateBiclustering only reports maximal LateBiclusters, that<br />
is, those that cannot be extended and are not fully<br />
contained in any other LateBicluster.<br />
LateBiclustering takes as input a gene-time expression<br />
matrix of real values. Each gene expression profile is first<br />
normalized to zero mean and unit standard deviation. A<br />
discretization is further applied to discern variations<br />
between consecutive time points into three levels: downtrend,<br />
no-change and up-trend. Upon discretization each<br />
gene profile can be seen as a string.<br />
<br />
<br />
A generalized suffix tree is built to find common<br />
patterns in the gene profiles. Internal nodes<br />
satisfying certain properties are marked for their<br />
potential to denote LateBiclusters.<br />
When an internal node does not satisfy the basic<br />
conditions for LateBicluster maximality, a<br />
procedure is applied to remove occurrences<br />
leading to non-maximal LateBiclusters. For this<br />
purpose, LateBiclustering uses a bit array<br />
representing the occurrences underlying each<br />
<br />
internal node. During the maximality update<br />
procedure, the bit array of the inspected node is<br />
compared against those of internal children nodes<br />
(right-max) and nodes from which the inspected<br />
node receives suffix links (left-max).<br />
Finally, LateBiclustering comes with different<br />
heuristics to report a single pattern occurrence per<br />
gene in each maximal LateBicluster. A heuristic<br />
is necessary because there may be multiple<br />
occurrences of a pattern in the profile of a given<br />
gene, which is a direct consequence of allowing<br />
the discovery of delayed patterns.<br />
RESULTS & DISCUSSION<br />
LateBiclustering is the first efficient algorithm suitable for<br />
the discovery of biclusters with temporal delays. It runs in<br />
polynomial time, while previous methods yielded<br />
exponential time complexity. LateBiclustering was able to<br />
find planted biclusters in synthetic data. It also identified<br />
biologically relevant LateBiclusters associated with<br />
Saccharomyces cerevisiae’s response to heat stress, and<br />
interesting time-lagged responses.<br />
FIGURE 1. Schematic of the LateBiclustering algorithm.<br />
REFERENCES<br />
Gonçalves JP & Madeira SC. IEEE/ACM Transactions on<br />
Computational Biology and Bioinformatics, 11(5), 801–813<br />
(2014).<br />
24
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O5<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O5. INFERRING DEVELOPMENTAL CHRONOLOGIES FROM SINGLE CELL<br />
RNA<br />
Robrecht Cannoodt 1,2,3* , Katleen De Preter 3 & Yvan Saeys 1,2 .<br />
Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center, Ghent 1 ; Department of<br />
Respiratory Medicine, Ghent University Hospital, Ghent 2 ; Center of Medical Genetics, Ghent University Hospital,<br />
Ghent 3 . * robrecht.cannoodt@ugent.be<br />
With the advent of single cell RNA sequencing, it is now possible to analyse the transcriptomes of hundreds of individual<br />
cells in an unbiased manner. Reconstructing the developmental chronology of differentiating cells is a challenging task,<br />
and doing so in a unsupervised and robust manner is a hitherto untackled problem. We developed a truly unsupervised<br />
developmental chronology inference technique, and evaluated its performance and robustness using multiple datasets.<br />
INTRODUCTION<br />
Early attempts at inferring the chronologies of single cells<br />
are MONOCLE (Trapnell et al., 2014) and NBOR<br />
(Schlitzer et al., <strong>2015</strong>). However, these techniques are not<br />
unsupervised as they require knowledge of the cell type of<br />
each cell prior to analysis, which biases the results to prior<br />
knowledge and possibly obstructs the discovery of novel<br />
subpopulations.<br />
METHODS<br />
Our approach consists of four steps.<br />
In the first step, the feature space (~30000 genes) is<br />
reduced to three dimensions.<br />
Secondly, outliers are detected and removed, using a K-<br />
nearest neighbour approach. After outlier removal, the<br />
original feature space is again reduced to three dimensions.<br />
Next, a nonparametric nonlinear curve is iteratively fitted<br />
to the data.<br />
Finally, each cell is projected onto the curve, thus<br />
resulting in a cell chronology.<br />
RESULTS & DISCUSSION<br />
A single-cell RNAseq dataset (Schlitzer et al., <strong>2015</strong>)<br />
contains profilings of DC progenitor cells. These cells are<br />
expected to differentiate from MDP to CDP to PreDC. Our<br />
method is able to intuitively visualise known population<br />
groups (Figure 1), as well as infer the developmental<br />
chronology of the individual cells (Figure 2).<br />
We evaluated our method on four datasets (Shalek et al.,<br />
2014; Trapnell et al., 2014; Buettner et al., <strong>2015</strong> and<br />
Schlitzer et al., <strong>2015</strong>), and found it to perform better and<br />
more robustly than existing methods MONOCLE and<br />
NBOR.<br />
This approach opens opportunities to further study known<br />
mechanisms or investigate unknown key regulatory<br />
structures in cell differentiation, or detect novel<br />
subpopulations in a truly unsupervised manner.<br />
REFERENCES<br />
Buettner F et al. Nature Biotechnology 33, 155-160 (<strong>2015</strong>).<br />
Schlitzer A et al. Nature Immunology 16, 718-726 (<strong>2015</strong>).<br />
Shalek A et al. Nature 509, 363-369 (2014).<br />
Trapnell C et al. Nature Biotechnology 32, 381-386 (2014).<br />
FIGURE 1. After feature space reduction and outlier detection of 244 DC<br />
progenitor cells (Schlitzer et al., <strong>2015</strong>), our method can intuitively<br />
visualise known populations.<br />
FIGURE 2. An iterative curve fitting results in a smooth curve reflecting<br />
the developmental chronology. After projecting each cell to the curve,<br />
regulatory patterns in expression which correlate with this timeline can<br />
be investigated.<br />
25
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O6<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O6. COMBINING TREE-BASED AND DYNAMICAL SYSTEMS<br />
FOR THE INFERENCE OF GENE REGULATORY NETWORKS<br />
Vân Anh Huynh-Thu 1* & Guido Sanguinetti 2,3 .<br />
GIGA-R & Department of Electrical Engineering and Computer Science, University of Liège 1 ; School of Informatics,<br />
University of Edinburgh 2 ; SynthSys – Systems and Synthetic Biology, University of Edinburgh 3 . * vahuynh@ulg.ac.be<br />
INTRODUCTION<br />
Reconstructing the topology of gene regulatory networks<br />
(GRNs) from time series of gene expression data remains<br />
an important open problem in computational systems<br />
biology. Current approaches can be broadly divided into<br />
model-based and model-free approaches, and face one of<br />
two limitations: model-free methods are scalable but<br />
suffer from a lack of interpretability, and cannot in general<br />
be used for out of sample predictions. On the other hand,<br />
model-based methods focus on identifying a dynamical<br />
model of the system; these are clearly interpretable and<br />
can be used for predictions, however they rely on strong<br />
assumptions and are typically very demanding<br />
computationally. Here, we aim to bridge the gap between<br />
model-based and model-free methods by proposing a<br />
hybrid approach to the GRN inference problem, called<br />
Jump3 (Huynh-Thu & Sanguinetti, <strong>2015</strong>). Our approach<br />
combines formal dynamical modelling with the efficiency<br />
of a nonparametric, tree-based method, allowing the<br />
reconstruction of GRNs of hundreds of genes.<br />
METHODS<br />
Gene expression model. At the heart of the Jump3<br />
framework, we use the on/off model of gene expression<br />
(Ptashne & Gann, 2002), where the rate of transcription of<br />
a gene can vary between two levels depending on the<br />
activity state μ of the promoter of the gene. The expression<br />
x of a gene is modelled through the following stochastic<br />
differential equation:<br />
dx i = (A i μ i (t) + b i – λ i x i )dt + σdω(t),<br />
where subscript i refers to the i-th target gene. Here, the<br />
promoter state μ i (t) is a binary variable (the promoter is<br />
either active or inactive) that depends on the expression<br />
levels of the transcription factors (TFs) that bind to the<br />
promoter. A i , b i and λ i are kinetic parameters, and the term<br />
σdω(t) represents a white noise-driving process with<br />
variance σ 2 .<br />
Network reconstruction with jump trees. Recovering<br />
the regulatory links pointing to gene i amounts to finding<br />
the genes whose expression is predictive of the promoter<br />
state μ i . To achieve this goal, we propose a procedure that<br />
learns, for each target gene i, an ensemble of decision trees<br />
predicting the promoter state μ i at any time t from the<br />
expression levels of the candidate regulators at the same<br />
time t. However, standard tree-based methods cannot be<br />
applied here since the output μ i (t) is a latent variable. We<br />
therefore propose a new decision tree algorithm called<br />
“jump tree”, which splits the observations by maximising<br />
the marginal likelihood of the dynamical on/off model.<br />
The learned tree-based model is then used to derive an<br />
importance score for each candidate regulator, computed<br />
as the sum of the likelihood gains that are obtained at all<br />
the tree nodes where this regulator was selected to split the<br />
observations. The importance of a candidate regulator j is<br />
used as weight for the putative regulatory link of the<br />
network that is directed from gene j to gene i.<br />
RESULTS & DISCUSSION<br />
We evaluated Jump3 on the networks of the DREAM4 In<br />
Silico Network challenge (Prill et al., 2010). For each<br />
network topology, two types of simulated expression data<br />
were used: data simulated using the on/off model (toy<br />
data) and the time series data that was provided in the<br />
context of the DREAM4 challenge. We compared Jump3<br />
to other GRN inference methods: two model-free methods,<br />
which are time-lagged variants of GENIE3 (Huynh-Thu et<br />
al., 2010) and CLR (Faith et al., 2007) respectively; two<br />
model-based methods, namely Inferelator (Greenfield et<br />
al., 2010) and TSNI (Bansal et al., 2006), and G1DBN<br />
(Lèbre, 2009), a method based on dynamic Bayesian<br />
networks. Areas Under the Precision-Recall curves<br />
(AUPRs) obtained for size-100 networks are shown in<br />
Table 1. Jump3 yields the highest AUPR in the case of the<br />
toy data. As expected, its performance decreases when the<br />
networks are inferred from the DREAM4 data, due to the<br />
mismatch between the on/off model and the one used to<br />
simulate the data. However, Jump3 still outperforms the<br />
other methods.<br />
Toy<br />
DREAM4<br />
Jump3 0.272 ± 0.060 0.187 ± 0.058<br />
GENIE3-lag 0.114 ± 0.010 0.176 ± 0.056<br />
CLR-lag 0.088 ± 0.008 0.169 ± 0.047<br />
Inferelator 0.069 ± 0.006 0.144 ± 0.036<br />
TSNI 0.020 ± 0.003 0.042 ± 0.010<br />
G1DBN 0.104 ± 0.024 0.114 ± 0.043<br />
TABLE 1. Comparison of network inference methods (mean AUPR and<br />
standard deviation).<br />
We also applied Jump3 to gene expression data from<br />
murine bone marrow-derived macrophages treated with<br />
interferon gamma (Blanc et al., 2011). Several of the hub<br />
TFs in the predicted network have biologically relevant<br />
annotations. They include interferon genes, one gene<br />
associated with cytomegalovirus infection, and cancerassociated<br />
genes, showing the potential of Jump3 for<br />
biologically meaningful hypothesis generation.<br />
REFERENCES<br />
Bansal M et al. Bioinformatics 22, 815-822 (2006).<br />
Blanc M et al. PLoS Biol 9, e1000598 (2011).<br />
Faith JJ et al. PLoS Biol 5, e8 (2007).<br />
Greenfield A. PLoS ONE 5, e13397 (2010).<br />
Huynh-Thu VA & Sanguinetti G. Bioinformatics 31, 1614-1622 (<strong>2015</strong>).<br />
Huynh-Thu VA et al. PLoS ONE 5, e12776 (2010).<br />
Lèbre S. Stat Appl Genet Mol Biol 8, Article 9 (2009).<br />
Prill RJ et al. PLoS ONE 5, e9202 (2010).<br />
Ptashne M & Gann A. Genes and Signals. Cold Harbor Spring<br />
Laboratory Press (2002).<br />
26
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O7<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O7. MODELING THE REGULATION OF Β-CATENIN SIGNALLING BY WNT<br />
STIMULATION AND GSK3 INHIBITION<br />
Annika Jacobsen 1 , Nika Heijmans 2 , Reneé van Amerongen 2 , Folkert Verkaar 3 ,<br />
Martine J. Smit 3 , Jaap Heringa 1 & K. Anton Feenstra 1 *.<br />
1 Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, The Netherlands; 2 Van Leeuwenhoek Centre<br />
for Advanced Microscopy and Section of Molecular Cytology, Swammerdam Institute for Life Sciences, University of<br />
Amsterdam, The Netherlands; 3 Division of Medicinal Chemistry, VU University Amsterdam, The Netherlands.<br />
*k.a.feenstra@vu.nl<br />
The Wnt/β-catenin signaling pathway is crucial for stem cell self-renewal, proliferation and differentiation. Hyperactive<br />
Wnt/β-catenin signaling caused by genetic alterations plays an important role in oncogenesis. In our newly developed<br />
Petri net model, GSK3 inhibition leads to significantly higher pathway activation (high β-catenin levels) compared to<br />
WNT stimulation, which is confirmed by TCF/LEF luciferase reporter assays experimentally. Using this validated model<br />
we can now simulate changes in Wnt/β-catenin signaling resulting from different mutations found in breast and<br />
colorectal cancer. We propose that this model can be used further to investigate different players affecting Wnt/β-catenin<br />
signaling during oncogenic transformation and the effect of drug treatment.<br />
WNT/Β-CATENIN<br />
Wnt/β-catenin signaling is important for stem cell<br />
maintenance and developmental processes and is highly<br />
conserved in all multicellular organisms (1, 2). The<br />
pathway regulates the expression of specific target genes<br />
by changing the levels of the transcriptional co-activator,<br />
β-catenin which activates the TCF/LEF transcription<br />
factors. Wnt/β-catenin signaling is active in stem cells<br />
located in Wnt rich environments.<br />
APC and AXIN are key proteins of the destruction<br />
complex, which targets β-catenin for destruction.<br />
Mutations in APC, AXIN and β-catenin play important<br />
roles in oncogenesis (2, 3). To better understand its role in<br />
oncogenesis, we here create a Petri net (PN) model of the<br />
Wnt/β-catenin signaling pathway, that uses available<br />
coarse-grained data, such as binary interactions and semiquantitative<br />
protein levels. Using this model and<br />
validating experiments we show how different strengths of<br />
Wnt stimulation and GSK3 inhibition activate signaling<br />
over time.<br />
PETRI-NET MODELLING<br />
We built a PN model of Wnt/β-catenin signaling describing<br />
the logic of known (inter)actions, cf. our previous<br />
work (5). In a PN, a place represents an entity (e.g. gene),<br />
a transition indicates the activity occurring between the<br />
places (e.g. gene expression), and these are connected by<br />
directed edges called arcs that represent their interactions<br />
(e.g., activation of gene expression by a protein).<br />
TRANSCRIPTION AND PROTEIN ASSAYS<br />
TCF/LEF transcription was measure by TOPFLASH<br />
reporter activity at several time points and at different<br />
concentrations of Wnt3a stimulation and GSK3 inhibition<br />
by CHIR99021. Active and total β-catenin (CTNNB1)<br />
levels were measured by Western blot.<br />
VALIDATED ACTIVATION & INHIBITION<br />
We simulate the model with initial Wnt and GSK3 token<br />
levels ranging from 0 to 5 to represent addition of Wnt and<br />
inhibition of GSK3. Figure 1 shows the four different β-<br />
catenin responses for Wnt addition (purple) and GSK3<br />
inhibition (green). At low GSK3 levels, β-catenin linearly<br />
increases, but at high GSK levels β-catenin remains low.<br />
At high Wnt levels, β-catenin shows a transient response,<br />
with the peak height increasing with Wnt levels. The<br />
increase of β-catenin is due to sequestration of AXIN to<br />
the cell membrane, which inactivates the destruction<br />
complex. Increase in β-catenin activates transcription of<br />
AXIN2 which triggers the negative feedback.<br />
FIGURE 1. Pathway response for different levels of Wnt and activity of<br />
GSK3. When adding Wnt, the pathway transiently activates but GSK3<br />
inhibition permanently activates.<br />
TCF/LEF reporter assay validation experiments for both<br />
perturbations show that transcriptional activity of<br />
TCF/LEF is both dosage and time dependent,<br />
corresponding well for GKS3 inhibition. Wnt3a stimulation,<br />
on the other hand, does activate expression, but we<br />
do not observe the β-catenin dosage or time effect<br />
predicted by our model. Measuring β-catenin by Western<br />
blot reveals a consistent increase upon pathway activation,<br />
however protein levels and changes are on the border of<br />
experimental sensitivity.<br />
In conclusion, our Petri net model recapitulates much of<br />
the known behavior of the Wnt/β-catenin pathway upon<br />
Wnt stimulation and GSK3 inhibition, and hints at<br />
subtleties in the mechanism that will help us gain further<br />
understanding in the role of this pathway in development<br />
and oncogenesis.<br />
REFERENCES<br />
1. Clevers & Nusse (2012) Cell. 149:1192-1205<br />
2. Holstein (2012) Cold Spring Harb Perspect Biol. 4:a007922<br />
3. MacDonald, Tamai & He (2009) Dev Cell. 17:9-26<br />
4. Klaus & Birchmeier (2008) Nat. Rev. Cancer. 8:387-398<br />
5. Bonzanni et al., (2009) Bioinformatics. 25:2049-2056<br />
27
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O8<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O8. RANKED TILING BASED APPROACH TO DISCOVERING PATIENT<br />
SUBTYPES<br />
Thanh Le Van 1,* , Jimmy Van den Eynden 3 , Dries De Maeyer 2 , Ana Carolina Fierro 5 , Lieven Verbeke 5 , Matthijs van<br />
Leeuwen 4 , Siegfried Nijssen 1,4 , Luc De Raedt 1 & Kathleen Marchal 5,6 .<br />
Department of Computer Science 1 , Centre of Microbial and Plant Genetics 2 , KULeuven, Belgium; Department of<br />
Medical Biochemistry, University of Gothenburg 3 , Sweden; Leiden Institute for Advanced Computer Science 4 ,<br />
Universiteit Leiden, The Netherlands; Department of Plant Biotechnology and Bioinformatics 5 , Department of<br />
Information Technology, iMinds 6 , Ghent University, Belgium. * thanh.levan@cs.kuleuven.be<br />
Cancer is a heterogeneous disease consisting of many subtypes that usually have both shared and distinguishing<br />
mechanisms. To derive good subtypes, it is essential to have a computational model that can score their homogeneity<br />
from different angles, for example, mutated pathways and gene expression. In this paper, we introduce our ongoing work<br />
which studies a constraint-based optimisation model to discover patient subtypes as well as their perturbed pathways<br />
from mutation, transcription and interaction data. We propose a way to solve the optimisation problem based on<br />
constraint programming principles. Experiments on a TCGA breast cancer dataset demonstrate the promise of the<br />
approach.<br />
INTRODUCTION<br />
Discovering patient subtypes and understanding their<br />
mechanisms are essential to provide precise treatments to<br />
patients. There have been efforts to understand how<br />
mutation causes subtypes such as the work by Hofree et<br />
al., (2013). However, to the best knowledge of the authors,<br />
it is still an open question on how to combine mutation<br />
and expression data to derive good subtypes. Therefore,<br />
we study a new computation model that can discover<br />
subtypes as well as their specific mutated genes and<br />
expressed genes from mutation, transcription and<br />
interaction data.<br />
METHODS<br />
We conjecture that a subtype consists of a number of<br />
patients who have the same set of differentially expressed<br />
genes and a set of mutated genes that hit the same<br />
pathways.<br />
To find both mutations and expressions of patient subtypes,<br />
we extend our recent ranked tiling method (Le Van et al.,<br />
2014). Ranked tiling is a data mining method proposed to<br />
mine regions with high average rank values in a rank<br />
matrix. In this type of matrix, each row is a complete<br />
ranking of the columns. We find that rank matrices are a<br />
good abstraction for numeric data and are useful to<br />
integrate datasets that are at different scales.<br />
To apply the ranked tiling method, we first transform the<br />
given numeric expression matrix, where rows are<br />
expressed genes and columns are patients, into a ranked<br />
expression matrix. Then, we search for a region in the<br />
transformed matrix that has high average rank scores.<br />
However, different from the ranked tiling method, we<br />
impose a further constraint that the columns (patients) of<br />
the region should also have a number of mutated genes<br />
that have high rank scores in a network with respect to a<br />
network model. We formalise this as a constraint<br />
optimisation problem and use a constraint solver to solve<br />
it.<br />
RESULTS & DISCUSSION<br />
We apply our method on TCGA breast cancer dataset and<br />
discover eight subtypes. Compared to PAM50 annotations,<br />
our method divide the Basal subtype into three sub-groups<br />
named S2, S3 and S6. The LumA subtype is divided into<br />
04 smaller groups, namely, S1, S4, S7 and S8. Finally, our<br />
method could recover the Her2 subtype in S5.<br />
To validate the mined subtypes in the patient dimension,<br />
we assume PAM50 annotations are true labels for them.<br />
Then, grouping patients into subtypes can be seen as a<br />
multi-class prediction problem, for which we can calculate<br />
F1 score to measure the average accuracy. We also<br />
compare our scores with state-of-the-art, including<br />
iCluster+ (Mo, Q. et al., 2013), NBS (Hofree et al., 2013)<br />
and SNF (Wang B. et al., 2014). The result (not shown)<br />
illustrates that our subtypes are more homogeneous than<br />
the ones produced by iCluster+ and NBS and are<br />
comparable to those by SNF.<br />
To validate the mined subtypes in the gene dimension, we<br />
perform geometric tests to see how their mutated genes<br />
and expressed genes are related to cancer pathways. The<br />
figure below is the heatmap showing the log_10 p-values<br />
of the tests. In this Figure, we can see that the discovered<br />
subtypes have specific perturbed pathways.<br />
FIGURE 1. Cancer pathway enrichment analysis using mined mutated<br />
genes and expressed genes of subtypes<br />
REFERENCES<br />
Hofree et al., Nat Methods 10(11), 1108–15 (2013).<br />
Le Van et al., ECML/PKDD 2014 (2), 98–113 (2014)<br />
Mo, Q. et al., PNAS 110(11), 4245–50 (2013)<br />
Wang, B. et al., Nature methods, 11(3), 333–7 (2014)<br />
28
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O9<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O9. DEVELOPMENT OF A DNA METHYLATION-BASED SCORE<br />
REFLECTING TUMOUR INFILTRATING LYMPHOCYTES<br />
Martin Bizet 1,2,3*# , Jana Jeschke 1# , Christine Desmedt 4 , Emilie Calonne 1 , Sarah Dedeurwaerder 1 ,<br />
Gianluca Bontempi 2,3 , Matthieu Defrance 1,2 , Christos Sotiriou 4 and Francois Fuks 1<br />
Laboratory of Cancer Epigenetics, Faculty of Medicine, Université Libre de Bruxelles 1 ; Interuniversity Institute of<br />
Bioinformatics in Brussels, Université Libre de Bruxelles & Vrije Universiteit Brussel 2 ; Machine Learning Group,<br />
Computer Science Department, Université Libre de Bruxelles, Brussels 3 ; Breast Cancer Translational Research<br />
Laboratory, Jules Bordet Institute, Université Libre de Bruxelles 4 ; # These authors contributed equally to this work;<br />
* mbizet@ulb.ac.be<br />
Tumour infiltrating lymphocytes (TIL) are increasingly recognised as one of the key feature to predict outcome and<br />
therapy response in malignancies. However, measuring quantities of TIL remains challenging since it relies on subjective<br />
and spatially-restricted measurements from a pathologist. In this study we used genome-scale DNA-methylation profiles<br />
from breast tumours to develop a so-called MeTIL score, which reflects TIL level within whole-tumour samples. We<br />
demonstrate the robustness to noise of the MeTIL score using simulated data as well as the ability of the MeTIL score to<br />
sensitively measure TIL in patient samples and to improve prediction of outcome.<br />
INTRODUCTION<br />
Breast cancer (BC) is one of the most common and<br />
deadliest diseases in women from Western countries.<br />
Tumour infiltrating lymphocytes (TIL) emerged as one of<br />
the key feature to predict outcome and response to<br />
treatment in this disease [ 1 ]. However the measurement of<br />
TIL levels remains challenging because it relies on manual<br />
readings of a tumour cancer slide by a pathologist, which<br />
is subjective by nature and does not necessary reflect the<br />
whole-tumour TIL content. In this study we took<br />
advantage of the high tissue-specificity of DNAmethylation<br />
patterns [ 2 ] to develop a so-called MeTIL<br />
score, which predicts the amount of lymphocytes within<br />
the tumour.<br />
METHODS<br />
The MeTIL score has been developed in 3 key-steps:<br />
We first used genome-scale DNA-methylation<br />
profiles data from 11 cell-lines (8 normal or<br />
cancerous epithelial breast and 3 T-lymphocytes)<br />
to extract 29 cytosines specifically unmethylated<br />
in T-lymphocytes (delta-beta < -0.8 and standard<br />
deviation between groups < 0.1).<br />
We then applied a cross-validated pipeline,<br />
associating mRMR feature selection and randomforest<br />
algorithm, on 118 BC samples to extract a<br />
minimal set of cytosines, which methylation level<br />
is predictive for quantities of TIL.<br />
Finally we used a “normalised PCA” approach to<br />
compute a unique MeTIL score from the<br />
individual methylation values.<br />
The robustness of the relation between the MeTIL score<br />
and TIL levels was also assessed using spearman<br />
correlation computed from 10 000 simulations with<br />
varying proportion of TIL (Fig.1B&C). The simulated<br />
data took two sources of noise into account:<br />
<br />
<br />
Technical noise modeled as a Gaussian noise<br />
Perturbations due to the presence of other celltypes<br />
within the tumour microenvironment that<br />
are not lymphocytic or epithelial, modeled by a<br />
methylation value sampled randomly among the<br />
array.<br />
Lastly, we measured TIL quantities with the MeTIL score<br />
in three independent BC cohorts and applied COX<br />
regression models to evaluate the prognostic value of the<br />
MeTIL score.<br />
RESULTS & DISCUSSION<br />
We first applied a hierarchical clustering analysis and<br />
observed that BC samples with high TIL infiltration show<br />
a hypomethylated pattern for all MeTIL markers (Fig.1A).<br />
Furthermore we demonstrated, using simulations, a strong<br />
correlation between the MeTIL score and TIL levels, even<br />
when high level of noise (0.7 times the standard deviation)<br />
and high proportion of perturbing unknown cell-types<br />
(70%) were included in the model (Fig.1B).<br />
(A)<br />
(C)<br />
(B)<br />
FIGURE 1. The MeTIL score reflects TIL levels (A) Heatmap showing the<br />
methylation values of the 5 MeTIL markers. A ‘TIL high’ group with a<br />
hypomethylated pattern (orange) appeared. (B) Color-map of the<br />
spearman correlation between MeTIL score and TIL level for increasing<br />
noise (y-axis) and abundance of unknown cell-types (x-axis) based on<br />
simulations. (C) Methylation value of each MeTIL marker was simulated<br />
as the sum of the methylation level in lymphocyte (M1), epithelial cell<br />
(M2) and other cell-types (random value M3) weighted by their<br />
proportion in the tissue (f1, f2, f3) and an Gaussian noise (e).<br />
Finally, we observed consistent patterns of TIL levels<br />
within BC subtypes in independent cohorts suggesting the<br />
robust nature of our score to evaluate TIL levels.<br />
Furthermore, COX regressions analysis revealed a<br />
prognostic value for the MeTIL score in triple negative<br />
and luminal BC (p-value < 0.05).<br />
REFERENCES<br />
[ 1 ] Loi, S., et al. Official journal of the European Society for Medical Oncology /<br />
ESMO 25, 1544-1550 (2014).<br />
[ 2 ] Jeschke, J., Collignon, E., Fuks, F. FEBS J., 282, 9:1801-14. (<strong>2015</strong>).<br />
29
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O10<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O10. PREDICTION OF CELL RESPONSES TO SURFACE TOPOGRAPHIES<br />
USING MACHINE LEARNING TECHNIQUES<br />
Aliaksei S Vasilevich 1 *,Shantanu Singh 2 , Aurélie Carlier 1 & Jan de Boer 1 .<br />
Laboratory for Cell Biology-inspired Tissue Engineering, Merln Institute, Maastricht University 1 , Imaging Platform,<br />
Broad Institute of MIT and Harvard 2 . *a.vasilevich@maastrichtuniversity.nl<br />
Topographical cues have been repeatedly shown to influence cell fate dramatically (Bettinger et. al., 2009). This<br />
phenomenon opens new opportunities to design the interaction between biomaterials and biological tissues in a<br />
predictable manner. Unfortunately, the exact mechanism of topographical control of cell behavior remains largely<br />
unknown. We have therefore developed a technology in our laboratory to determine an optimal surface topography for<br />
virtually any application in biomedical field. Previously we have reported that we can control cell shape by our surfaces<br />
in a predictable manner (Hulsman et.al., <strong>2015</strong>). Here we demonstrate that we can successfully predict not only cell shape,<br />
but also cell response on protein level based on the properties of our topographies. The results of our study show that we<br />
are able to design materials for biomedical applications that require a particular cell behavior.<br />
INTRODUCTION<br />
The TopoChip, a micro topography screening platform,<br />
enables the assessment of cell response to 2176 unique<br />
topographies in a single high-throughput screen. The<br />
topographical features were randomly selected from an in<br />
silico library of more than 150 million of topographies,<br />
which were designed from algorithm that synthesized<br />
patterns based on simple geometric elements – circles,<br />
triangles and rectangles (Unadkat et al, 2011). In our<br />
previous studies, we have demonstrated that these surface<br />
topographies exert a mitogenic effect on hMSCs (Unadkat<br />
et al, 2011), as well as on cell shape (Hulsman et. al.,<br />
<strong>2015</strong>). In this paper, we show that these topographies can<br />
also be used to modulate the ALP expression in human<br />
mesenchymal stromal cells, as well as pluripotency in<br />
human induced pluripotent stem (iPS) cells. We further<br />
show that computational models can be build to predict<br />
these protein levels using surface topography parameters.<br />
METHODS<br />
Cell response to topography was captured by high-content<br />
imaging. Using image analysis and data mining methods<br />
described previously (Hulsman et.al., <strong>2015</strong>),<br />
multiparametric “profiles” of cellular response were<br />
obtained. Multiple replicates of each topography were<br />
used to estimate the median level of a cellular response of<br />
interest – either ALP in human mesenchymal stromal cells<br />
(hMSCs), or the median number of Oct4 positive cells in<br />
population of human induced pluripotent stem cell<br />
(hIPSCs). We aimed to predict the cellular response based<br />
on surface topography parameters using machine learning<br />
methods. To learn and validate these methods (specifically,<br />
classifiers), the data were split into training and testing<br />
sets in a 3:1 proportion respectively. In the training step,<br />
we performed a 10-fold cross-validation to obtain optimal<br />
parameters for each classifier. The caret package (Kuhn<br />
M., 2008) in R (R core team, <strong>2015</strong>) was used to perform<br />
the analysis.<br />
RESULTS & DISCUSSION<br />
In the first project, we conducted a screening on the<br />
TopoChip with hMSCs in order to find topographies that<br />
would be able to increase the ALP level, a protein that is<br />
an early marker of osteogenesis. We were able to<br />
successfully find such surfaces and confirm results<br />
experimentally (publication in preparation). To move<br />
further we decided to check how accurately we can make a<br />
prediction of ALP level in hMSCs based on topographical<br />
features. Focussing only on extreme examples, we<br />
selected 100 high- and and low-scoring topographies and<br />
used the model validation scheme described in Methods to<br />
find the most accurate binary classifier for our data set.<br />
We tested several classifiers and identified random forest<br />
as most precise, which obtained an accuracy of 96% on<br />
the held-out test set.<br />
In a second project, we aim to find a topography that will<br />
increase proliferation and pluripotency of hIPSCs. We<br />
used Oct4 as a marker of pluripotency. The screening was<br />
performed on one half of the Topochip (1000+ surfaces),<br />
which were then ranked based on the number of Oct4<br />
positive cells. One hundred high- and low-scoring surfaces<br />
were chosen to train a classifier. Using logistic regression ,<br />
we obtained 72% accuracy on a held-out test set. We used<br />
this model to predict surfaces that would increase<br />
pluripotency in hIPSCs among surfaces that were not<br />
included in the initial screening. Topographies were<br />
ranked according to their predicted probability score and<br />
top 30 surfaces were chosen for experimental validation.<br />
We found that 79% of selected surfaces were predicted<br />
accurately.<br />
In summary, the combination of our screening methods<br />
and machine learning algorithms open new avenues to<br />
design surfaces with desired properties for variable<br />
applications. Our next step will be to find a surface with<br />
maximum ALP level from our virtual library based on our<br />
screening data.<br />
REFERENCES<br />
Bettinger C J, Langer R, & Borenstein J T. “Engineering Substrate<br />
Micro- and Nanotopography to Control Cell Function.” Angewandte<br />
Chemie (International ed. in English) 48.30 (2009).<br />
Hulsman M et. al., Analysis of high-throughput screening reveals the<br />
effect of surface topographies on cellular morphology, Acta<br />
Biomaterialia, 15, (<strong>2015</strong>).<br />
Kuhn M. “Building Predictive Models in R Using the caret Package”<br />
Journal of Statistical Software, Vol. 28, (2008)<br />
R Core Team. R: A language and environment for statistical computing.<br />
R Foundation for Statistical Computing, Vienna, Austria. URL<br />
http://www.R-project.org/. (<strong>2015</strong>)<br />
Unadkat H V. et al. “An Algorithm-Based Topographical Biomaterials<br />
Library to Instruct Cell Fate.” Proceedings of the National Academy<br />
of Sciences of the United States of America 108.40 (2011).<br />
30
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O11<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O11. ANALYSIS OF MASS SPECTROMETRY QUALITY CONTROL METRICS<br />
Wout Bittremieux 1 , Pieter Meysman 1 , Lennart Martens 2 , Bart Goethals 1 , Dirk Valkenborg 3 & Kris Laukens 1 .<br />
Advanced Database Research and Modeling (ADReM) & Biomedical Informatics Research Center Antwerp (biomina),<br />
University of Antwerp / Antwerp University Hospital 1 ; Department of Biochemistry & Department of Medical Protein<br />
Research, Ghent University / VIB 2 ; Flemish Institute for Technological Research (VITO) 3 .<br />
* wout.bittremieux@uantwerpen.be<br />
Mass-spectrometry-based proteomics is a powerful analytical technique to identify complex protein samples, however,<br />
its results are still subject to a large variability. Lately several quality control metrics have been introduced to assess the<br />
performance of a mass spectrometry experiment. Unfortunately these metrics are generally not sufficiently thoroughly<br />
understood. For this reason, we present a few powerful techniques to analyse multiple experiments based on quality<br />
control metrics, identify low-performance experiments, and provide an interpretation of outlying experiments.<br />
INTRODUCTION<br />
Mass-spectrometry-based proteomics is a powerful<br />
analytical technique that can be used to identify complex<br />
protein samples. Despite many technological and<br />
computational advances, performing a mass spectrometry<br />
experiment is still a highly complicated task and its results<br />
are subject to a large variability. To understand and<br />
evaluate how technical variability affects the results of an<br />
experiment, lately several quality control (QC) and<br />
performance metrics have been introduced. Unfortunately,<br />
despite the availability of such QC metrics covering a<br />
wide range of qualitative information, a systematic<br />
approach to quality control is often still lacking.<br />
As most quality control tools are able to generate several<br />
dozens of metrics, any single experiment can be<br />
characterized by multiple QC metrics. Therefore it is<br />
often not clear which metrics are most interesting in<br />
general, or even which metrics are relevant in a specific<br />
situation. To take into account the multidimensional data<br />
space formed by the numerous metrics, we have applied<br />
advanced techniques to visualize, analyze, and interpret<br />
the QC metrics.<br />
METHODS<br />
Outlier detection can be used to detect deviating<br />
experiments with a low performance or a high level of<br />
(unexplained) variability. These outlying experiments can<br />
subsequently be analyzed to discover the source of the<br />
reduced performance and to enhance the quality of future<br />
experiments.<br />
However, it is insufficient to know that a specific<br />
experiment is an outlier; it is also of vital importance to<br />
know the reason. To understand why an experiment is an<br />
outlier, we have used the subspace of QC metrics in which<br />
the outlying experiment can be differentiated from the<br />
other experiments. This provides crucial information on<br />
how to interpret an outlier, which can be used by domain<br />
experts to increase interpretability and investigate the<br />
performance of the experiment.<br />
RESULTS & DISCUSSION<br />
Figure 1 shows an example of interpreting a specific<br />
experiment that has been identified as an outlier. As can<br />
be seen, two QC metrics mainly contribute to this<br />
experiment being an outlier. The explanatory subspace<br />
formed by these QC metrics can be extracted, which can<br />
then be interpreted by domain experts, resulting in insights<br />
in relationships between various QC metrics.<br />
FIGURE 1. QC metrics importances for interpreting an outlying<br />
experiment.<br />
Next, by combining the explanatory subspaces for all<br />
individual outliers, it is possible to get a general view on<br />
which QC metrics are most relevant when detecting<br />
deviating experiments. When taking the various<br />
explanatory subspaces for all different outliers into<br />
account, a distinction between several of the outliers can<br />
be made in terms of the number of identified spectra<br />
(PSM’s). As can be seen in Figure 2, for some specific QC<br />
metrics (highlighted in italics) the outliers result in a<br />
notably lower number of PSM's compared to the nonoutlying<br />
experiments.<br />
Because monitoring a large number of QC metrics on a<br />
regular basis is often unpractical, it is more convenient to<br />
focus on a small number of user-friendly, well-understood,<br />
and discriminating metrics. As the QC metrics highlighted<br />
in Figure 2 are shown to indicate low-performance<br />
experiments, these metrics are prime candidates to monitor<br />
on a continuous basis to quickly detect faulty experiments.<br />
FIGURE 2. Comparison of the number of PSM’s between the non-outlying<br />
and the outlying experiments.<br />
31
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O12<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O12. XILMASS: A CROSS-LINKED PEPTIDE IDENTIFICATION ALGORITHM<br />
Şule Yılmaz 1,2,3* , Masa Cernic 4 , Friedel Drepper 5 , Bettina Warscheid 5 , Lennart Martens 1,2,3 & Elien Vandermarliere 1,2,3 .<br />
Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent, Belgium 2 ;<br />
Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 ; Department of Biochemistry, Molecular and<br />
Structural Biology, Jožef Stefan Institute, Ljubljana, Slovenia 4 ; Functional Proteomics and Biochemistry, Department of<br />
Biochemistry and Functional Proteomics, Institute for Biology II and BIOSS Centre for Biological Signaling Studies,<br />
University of Freiburg, Freiburg, Germany 5 . *sule.yilmaz@ugent.be<br />
Chemical cross-linking coupled with mass spectrometry (XL-MS) facilitates the determination of protein structure and<br />
the understanding of protein interactions. The current computational approaches rely on different strategies with a limited<br />
number of open-source and easy-to-use search algorithms. We therefore built a novel cross-linked peptide identification<br />
algorithm, called Xilmass which has a novel database construction and a new scoring function adapted from traditional<br />
database search algorithms. We compared the performance of Xilmass against one of the most popular and publicly<br />
available algorithms: pLink, and a recently published algorithm Kojak. We found that Xilmass identified 140 spectra<br />
whereas Kojak and pLink identified 119 and 35, respectively. We mapped the cross-linking sites on the structure which<br />
resulted in the identification of 20 possible cross-linking sites. These findings show that Xilmass allows the identification<br />
of cross-linking sites.<br />
INTRODUCTION<br />
The structure of a protein is crucial for its functionality.<br />
Protein structure is commonly determined by X-ray<br />
crystallography or nuclear magnetic resonance (NMR). X-<br />
ray crystallography is only feasible for crystallizable<br />
proteins and NMR has a protein size limitation. Due to<br />
these restrictions, protein complexes are much more<br />
difficult to approach with these classical methods.<br />
However, chemical cross-linking of the complex coupled<br />
with mass spectrometry (XL-MS) allows to study of these<br />
protein complexes. The identification of the measured<br />
fragmentation spectra is a challenging task. One approach<br />
to identify cross-linked peptides is to linearize crosslinked<br />
peptide-pairs in order to generate a database to<br />
perform traditional search engines (Maiolica et al., 2007).<br />
However, a traditional search engine is not directly<br />
applicable to identify cross-linked peptides. Another<br />
approach is to rely on the usage of labeled cross-linkers,<br />
but this has a decreased performance when unlabeled<br />
cross-linkers are used. We therefore built an algorithm,<br />
Xilmass, which is designed for the identification of XL-<br />
MS fragmentation spectra without linearization of peptides<br />
and the requirement of labeled cross-linkers. We also<br />
introduced a new way of representation of a cross-linked<br />
peptide database and directly implemented a new scoring<br />
function.<br />
METHODS<br />
The data sets were derived from human calmodulin (CaM)<br />
and the actin binding domain of plectin (plectin-ABD)<br />
which were cross-linked by DSS. The data sets were<br />
analyzed on a Velos Orbitrap Elite.<br />
Cross-linked peptides were identified by Xilmass, pLink<br />
(Yang et al., 2012) and Kojak (Hoopmann et al., <strong>2015</strong>).<br />
The identifications of both Xilmass and Kojak were<br />
validated by Percolator (Käll et al., 2007) at q-value=0.05.<br />
pLink returned a validated list at FDR=0.05.<br />
The findings on cross-linking sites were validated with the<br />
aid of the available structures (Plectin PDB-entry: 4Q57<br />
and calmodulin PDB-entry: 2F3Y). The cross-linking sites<br />
were predicted by X-Walk (Kahraman et al., 2011) and<br />
PyMOL was used for the visualization.<br />
RESULTS & DISCUSSION<br />
We compared the number of identified spectra and crosslinking<br />
sites from Xilmass, pLink and Kojak. Xilmass<br />
identified 140 spectra whereas Kojak and pLink identified<br />
119 and 35 spectra, respectively (at FDR=0.05). Xilmass<br />
identified 53 cross-linking sites from the 140 spectra with<br />
37 obtained from at least 2 peptide-to-spectrum matches<br />
(PSMs). Kojak identified more cross-linking sites (60),<br />
however, only 26 cross-linking sites have at least 2 PSMs.<br />
The identified cross-linking sites by Xilmass were<br />
manually verified on the structure (Figure1). We defined<br />
20 cross-linking sites as possible (Cα-Cα distances within<br />
30Å (orange)) and not-predicted (Cα-Cα distances<br />
exceeding 30Å (blue)). These findings show that Xilmass<br />
allows the identification of cross-linking sites.<br />
FIGURE 1. The identified cross-linking sites were mapped on the plectin<br />
protein structure to manually verify them (PDB-entry:4Q57)<br />
REFERENCES<br />
Hoopmann ,M R et al. Journal of Proteome Research, 14, 2190–2198<br />
(<strong>2015</strong>)<br />
Kahraman,A. et al. Bioinformatics, 27, 2163–2164 (2011)<br />
Käll,L. et al. Nature Methods, 4, 923–925 (2007)<br />
Maiolica,A. et al. Molecular & cellular proteomics:MCP, 6, 2200–2211<br />
(2007)<br />
Yang,B. et al. Nature Methods, 9, 904–906 (2012)<br />
32
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O13<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O13. AUTOMATED ANATOMICAL INTERPRETATION OF DIFFERENCES<br />
BETWEEN IMAGING MASS SPECTROMETRY EXPERIMENTS<br />
Nico Verbeeck 1* , Jeffrey Spraggins ,2 , Yousef El Aalamat 3,4 , Junhai Yang 2 ,<br />
Richard M. Caprioli 2 , Bart De Moor 3,4 ,Etienne Waelkens 5,6 & Raf Van de Plas 1,2 .<br />
Delft Center for Systems and Control (DCSC), Delft University of Technology 1 ; Mass Spectrometry Research Center<br />
(MSRC),Vanderbilt University 2 ; STADIUS Center for Dynamical Systems, Signal Processing, and Data Analytics, Dept.<br />
of Electrical Engineering (ESAT), KU Leuven 3 ; iMinds Medical IT, KU Leuven 4 ; Dept. of Cellular and Molecular<br />
Medicine, KU Leuven 5 ; Sybioma, KU Leuven 6 . * n.verbeeck@tudelft.nl<br />
Imaging mass spectrometry (IMS) is a powerful molecular imaging technology that generates large amounts of data,<br />
making manual analysis often practically infeasible. In this work we aid the differential analysis of multiple IMS datasets<br />
by linking these data to an anatomical atlas. Using matrix factorization based multivariate analysis techniques, we are<br />
able to identify differential biomolecular signals between individual tissue samples in an obesity case study on mouse<br />
brain. The resulting differential signals are then automatically interpreted in terms of anatomical structures using a<br />
convex optimization approach and the Allen Mouse Brain Atlas. The automated anatomical interpretation facilitates<br />
much deeper exploration by the biomedical expert for these types of very rich data sets.<br />
INTRODUCTION<br />
Imaging Mass Spectrometry (IMS) is a relatively new<br />
molecular imaging technology that enables a user to<br />
monitor the spatial distributions of hundreds of<br />
biomolecules in a tissue slice simultaneously. This unique<br />
property makes IMS an immensely valuable technology in<br />
biomedical research. However, it also leads to very large<br />
amounts of data in a single analysis (e.g. >1 TB), making<br />
manual analysis of these data increasingly impractical. In<br />
order to aid the exploration of these data, we have recently<br />
developed a framework that integrates IMS data with an<br />
anatomical atlas. The framework uses the anatomical data<br />
in the atlas to automatically interpret the IMS data in terms<br />
of anatomical structures, and guides the user towards<br />
relevant findings within a single tissue section. In this<br />
work, we extend this framework towards the automated<br />
interpretation of biomolecular differences between<br />
multiple IMS datasets.<br />
METHODS<br />
We demonstrate our method on IMS data of multiple<br />
mouse brain sections, and use the Allen Mouse Brain<br />
Atlas as the curated anatomical data source that is linked<br />
to the MALDI-based IMS measurements. We spatially<br />
map the data of each individual IMS dataset to the<br />
anatomical atlas using both rigid and non-rigid registration<br />
techniques. This establishes a common reference space<br />
and allows for direct comparison of spatial locations<br />
between the different IMS datasets. Group Independent<br />
Component Analysis (GICA) is then used to automatically<br />
extract the differentially expressed biomolecular patterns,<br />
after which convex optimization is used to automatically<br />
interpret the differential components in terms of known<br />
anatomical structures (Verbeeck et al, 2014), directly<br />
listing the anatomical areas in which changes occur.<br />
RESULTS & DISCUSSION<br />
We demonstrate our approach in an obesity case study on<br />
mouse brain. All tissue sections are cryosectioned at 10<br />
μm and thaw-mounted onto ITO coated glass slides after<br />
which they are sublimated with CMBT matrix. MALDI<br />
IMS images are collected using the Bruker 15T solariX<br />
FTICR MS with a spatial resolution of 50 μm, collecting<br />
approximately 35,000 pixels per experiment.<br />
The IMS data of the different experiments are registered to<br />
the anatomical reference space provided by the Allen<br />
Mouse Brain Atlas, establishing an inter-experiment<br />
study-wide reference space. Analysis of the IMS<br />
measurements using GICA reveals multiple biomolecular<br />
patterns that differentiate between the various dietary<br />
conditions examined by the study. The retrieved<br />
differentially expressed biomolecular patterns are then<br />
translated to combinations of anatomical structures using<br />
our convex optimization approach, similar to what a<br />
human investigator intends to do. This automated<br />
interpretation of inter-experiment differences can serve as<br />
a great accelerator in the exploration of IMS data, as it<br />
avoids the time-and resource-intensive step of having a<br />
histological expert manually interpret the differential<br />
patterns.<br />
FIGURE 1. Automated anatomical interpretation of a biomolecular<br />
pattern that is differentially expressed in coronal mouse brain sections<br />
between a high fat and a low fat diet in our obesity case study.<br />
REFERENCES<br />
Verbeeck, N. et al. Automated anatomical interpretation of ion<br />
distributions in tissue: linking imaging mass spectrometry to curated<br />
atlases. Anal. Chem. 86, 8974–8982 (2014).<br />
33
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O14<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O14. ENHANCEMENT OF IMAGING MASS SPECTROMETRY DATA<br />
THROUGH REMOVAL OF SPARSE INTENSITY VARIATIONS<br />
Yousef El Aalamat 1,2* , Xian Mao 1,2 , Nico Verbeeck 3 , Junhai Yang 4 , Bart De Moor 1,2 ,<br />
Richard M. Caprioli 4 , Etienne Waelkens 5,6 & Raf Van de Plas 3,4 .<br />
Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data<br />
Analytics, KU Leuven 1 ; iMinds Medical IT, KU Leuven 2 ; Delft Center for Systems and Control, Delft University of<br />
Technology 3 ; Mass Spectrometry Research Center (MSRC),Vanderbilt University 4 ; Department of Cellular and<br />
Molecular Medicine, KU Leuven 5 ; Sybioma, KU Leuven 6 . *yelaalam@esat.kuleuven.be<br />
Imaging mass spectrometry (IMS) is rapidly evolving as a label-free, spatially resolved molecular imaging tool for the<br />
direct analysis of biological samples. However, mass spectrometry (MS) measurements are subject to different types of<br />
noise. In IMS, one of the most abundant noise types in ion images is the presence of localized intensity spikes, known<br />
also as sparse intensity variations, which occur on top of the biological ion distribution pattern. In this study, we develop<br />
a method that addresses the issue of sparse intensity noise. We use low-rank approximations of the IMS data to separate<br />
and filter sparse intensity variations from the MS signals. The efficiency of the developed method is tested using MS<br />
measurements of coronal sections of mouse brain and strong de-noising performance is demonstrated both along the<br />
spatial and the spectral domain.<br />
INTRODUCTION<br />
Imaging mass spectrometry (IMS) provides unique<br />
capabilities for biomedical and biological research.<br />
However, its measurements tend to be subject to different<br />
types of noise. One of the more abundant noise types in<br />
IMS are localized intensity spikes, which can be seen as<br />
sparse intensity variations on top of the true biological ion<br />
patterns. This kind of noise can have a substantial impact,<br />
particularly on low ion intensity measurements where the<br />
signal-to-noise ratio (SNR) can be significantly affected.<br />
We present a method to filter sparse intensity variations<br />
from IMS data, and demonstrate its use to de-noise IMS<br />
measurements both along the spatial and the spectral<br />
domain.<br />
METHODS<br />
We introduce a de-noising algorithm based on low-rank<br />
approximation, a concept from linear algebra. The method<br />
can separate sparse intensity variations from biological<br />
and tissue sample patterns, which hold up across multiple<br />
ions and pixels. This approach decomposes IMS data into<br />
two parts, namely a structured data matrix and a sparse<br />
data matrix. Since the noise tends to be sparse in nature, it<br />
will have a propensity to be collected into the sparse data<br />
part. The structured part tends to capture the de-noised<br />
IMS signals, effectively de-noising the ion images and the<br />
spectral profiles in the process. This de-noising method<br />
allows us to automatically filter sparse intensity variations<br />
from the underlying tissue signal without requiring any<br />
parameter tuning.<br />
RESULTS & DISCUSSION<br />
The filter method is demonstrated on two IMS<br />
experiments (one lipid-focused and one protein-focused)<br />
acquired from coronal sections of mouse brain. For the<br />
protein experiment, the tissue section was coated with<br />
sinapinic acid, and measurements were acquired using a<br />
Bruker AutoFlex MALDI-TOF/TOF in positive linear<br />
mode at a spatial resolution of 100 μm and with a mass<br />
range extending from m/z 3000 to 22000. For the lipid<br />
experiment, the tissue section was sublimated with 1,5-<br />
diaminonaphthalene, and the measurements were acquired<br />
using a Bruker AutoFlex MALDI-TOF/TOF in negative<br />
reflectron mode at a spatial resolution of 80 μm and with a<br />
mass range extending from m/z 400 to 1000. The case<br />
studies demonstrate robust de-noising performance,<br />
retrieving the underlying tissue signal efficiently and<br />
consistently using the structured data matrix. On the<br />
spatial side, we observe a clean-up effect in the spatial<br />
distributions of both high- and low-intensity ions. The<br />
effect is especially impactful for low-intensity ions,<br />
showing a strong increase in the amount of spatial<br />
structure that can be retrieved from low SNR<br />
measurements and revealing patterns that would have<br />
gone unnoticed otherwise. On the spectral side, we<br />
observe an improved SNR after applying the method.<br />
Thus, at the cost of computational analysis, the de-noising<br />
method described here provides a means of increasing the<br />
amount of information that can be extracted from an IMS<br />
experiment, without requiring user interaction or<br />
additional measurement.<br />
FIGURE 1. Impact on both spatial and spectral domain. Top: example of<br />
de-noised ion image. Bottom: plot of a spectrum before (blue) and after<br />
(red) removal of sparse intensity variations.<br />
34
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O15<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O15. DETERMINANTS OF COMMUNITY STRUCTURE<br />
IN THE PLANKTON INTERACTOME<br />
Gipsi Lima-Mendez 1,2* , Karoline Faust 1,2,3 , Nicolas Henry 4 , Johan Decelle 4 , Sébastien Colin 4 , Fabrizio Carcillo 2,3,5 ,<br />
Simon Roux 6 , Gianluca Bontempi 5 , Matthew B. Sullivan 6 , Chris Bowler 7 , Eric Karsenti 7,8 , Colomban de Vargas 4 &<br />
Jeroen Raes 1,2 .<br />
Department of Microbiology and Immunology, Rega Institute KU Leuven 1 ; VIB Center for the Biology of Disease 2 ;<br />
Laboratory of Microbiology, Vrije Universiteit Brussel, Belgium 3 ; CNRS, UMR 7144, Station Biologique de Roscoff 4 ;<br />
Interuniversity Institute of Bioinformatics in Brussels (IB) 2 , Machine Learning Group, Université Libre de Bruxelles 5 ;<br />
Department of Ecology and Evolutionary Biology, University of Arizona, USA 6 ; Ecole Normale Supérieure, Institut de<br />
Biologie (IBENS), France 7 ; European Molecular Biology Laboratory 8 .*Gipsi.limamendez@vib-kuleuven.be<br />
Identifying the abiotic and biotic factors that shape species interactions are fundamental yet unsolved goals in ecology.<br />
Here, we integrate organismal abundances and environmental measures from Tara Oceans to reconstruct the first global<br />
photic-zone co-occurrence network. Environmental factors are incomplete predictors of community structure. Putative<br />
biotic interactions are non-randomly distributed across phylogenetic groups, and show both local and global patterns.<br />
Known and novel interactions were identified among grazers, primary producers, viruses and symbionts. The high<br />
prevalence of parasitism suggests that parasites are important regulators in the ocean food web. Together, this effort<br />
provides a foundational resource for ocean food web research and integrating biological components into ocean models.<br />
INTRODUCTION<br />
Determining the relative importance of both biotic and<br />
abiotic processes represents a grand challenge in ecology.<br />
Here we analyze sequence on plankton organisms and<br />
environmental data from the Tara-Oceans project. We<br />
applied network inference methods to construct a globalocean<br />
cross-kingdom species interaction network and<br />
disentangled the biotic and abiotic signals shaping this<br />
interactome (Lima-Mendez, et al., <strong>2015</strong>).<br />
METHODS<br />
Methods are described in details in (Lima-Mendez, et al.,<br />
<strong>2015</strong>). Briefly:<br />
<br />
<br />
Network inference. Taxon-taxon networks were<br />
constructed as in (Faust, et al., 2012), selecting<br />
Spearman and Kullback-Leibler dissimilarity.<br />
Edges with merged multiple-test-corrected p-<br />
values below 0.05 were kept. Taxon-environment<br />
networks were computed with the same<br />
procedure and merged with taxon-taxon networks<br />
for environmental triplet detection.<br />
Indirect taxon edge detection. For each triplet<br />
consisting of two taxa and one environmental<br />
parameter, we computed the interaction<br />
information (II) and taxon edges were considered<br />
indirect when II
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O16<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O16. BIOINFORMATICS TOOLS FOR ACCURATE ANALYSIS OF AMPLICON<br />
SEQUENCING DATA FOR BIODIVERSITY ANALYSIS<br />
Mohamed Mysara 1-3 , Yvan Saeys 4,5 , Natalie Leys 1 , Jeroen Raes 2,6 & Pieter Monsieurs 1* .<br />
Unit of Microbiology, Belgian Nuclear Research Centre SCK•CEN, Mol; Belgium 1; Department of Bioscience<br />
Engineering, Vrije Universiteit Brussel VUB, Brussels, Belgium 2 ; Department of Structural Biology, Vlaams Instituut<br />
voor Biotechnologie VIB, Brussels, Belgium 3 ; Data Mining and Modeling Group, VIB Inflammation Research Center,<br />
Ghent, Belgium 4 , Department of RespiratoryMedicine, Ghent University Hospital, Ghent, Belgium 5 , Department of<br />
Microbiology and Immunology, REGA institute, KU Leuven, Belgium 6 . * pmonsieu@sckcen.be<br />
High-throughput sequencing technologies have created a wide range of new applications, also in the field of microbial<br />
ecology. Yet when used in 16S rRNA biodiversity studies, it suffers from two important problems: the presence of PCR<br />
artefacts (called chimera) and sequencing errors resulting from the sequencing sequencing technologies. In this work<br />
three artificial intelligence-based algorithms are proposed, CATCh, NoDe and IPED, to handle these two problems. A<br />
benchmarking study was performed comparing CATCh/NoDe (for 454 pyrosequencing) or CATCh/IPED (for Illumina<br />
MiSeq sequencing) with other state-of-the art tools, showing a clear improvement in chimera detection and reduction of<br />
sequencing errors respectively, and in general leading to more accurate clustering of the sequencing reads in Operational<br />
Taxonomic Units (OTUs). All algorithms are available via http://science.sckcen.be/en/Institutes/EHS/MCB/MIC<br />
/Bioinformatics/.<br />
INTRODUCTION<br />
The revolution in new sequencing technologies has led to<br />
an explosion of possible applications, including new<br />
opportunities for microbial ecological studies via the<br />
usage of 16S rDNA amplicon sequencing. However,<br />
within such studies, all sequencing technologies suffer<br />
from the presence of erroneous sequences, i.e. (i) chimera,<br />
introduced by wrong target amplification in PCR, and (ii)<br />
sequencing errors originating from different factors during<br />
the sequencing process. As such, there is a need for<br />
effective algorithms to remove those erroneous sequences<br />
to be able to accurately assess the microbial diversity.<br />
METHODS<br />
First, a new algorithm called CATCh (Combining<br />
Algorithms to Track Chimeras) was developed by<br />
integrating the output of existing chimera detection tools<br />
into a new more powerful method. Second, NoDe (Noise<br />
Detector) was introduced, an algorithm that identifies and<br />
corrects erroneous positions in 454-pyrosequencing reads.<br />
Third, IPED (Illumina Paired End Denoiser) algorithm<br />
was developed to handle error correction in Illumina<br />
MiSeq sequencing data as the first tool in the field. After<br />
identifying those positions likely to contain an error, those<br />
sequencing reads are subsequently clustered with correct<br />
reads resulting in error-free consensus reads. The three<br />
algorithms were benchmarked with state-of-the-art tools.<br />
RESULTS & DISCUSSION<br />
Via a comparative study with other chimera detection<br />
tools, CATCh was shown to outperform all other tools,<br />
thereby increasing the sensitivity with up to 14% (see<br />
Figure 1).<br />
FIGURE 1. Plot indicating the effect of applying 5% indels (shown on the<br />
left) and 5% mismatches (shown on the right), on the performance of<br />
different chimera detection tools. CATCh was found to outperform other<br />
existing tools.<br />
Similarly, NoDe and IPED were benchmarked against<br />
other denoising algorithms, thereby showing a significant<br />
improvement in reduction of the error rate up to 55% and<br />
75% respectively (see Figure 2). The combined effect of<br />
our algorithms for chimera removal and error correction<br />
also had a positive effect on the clustering of reads in<br />
operational taxonomic units (OTUs), with an almost<br />
perfect correlation between the number of OTUs and the<br />
number of species present in the mock communities.<br />
Indeed, when applying our improved pipeline containing<br />
CATCh and NoDe on a 454 pyrosequencing mock dataset,<br />
our pipeline could reduce the number of OTUs to 28 (i.e.<br />
close 18, the correct number of species). In contrast,<br />
running the straightforward pipeline without our<br />
algorithms included would inflate the number of OTUs to<br />
98. Similarly, when tested on Illumina MiSeq sequencing<br />
data obtained for a mock community, using a pipeline<br />
integrating CATCh and IPED, the number of OTUs<br />
returned was 33 (i.e. close to the real number of 21<br />
species), while 86 OTUs was obtained using the default<br />
mothur pipeline.<br />
REFERENCES<br />
Mysara M., Leys N., Raes J., Monsieurs P.- NoDe: a fast error-correction<br />
algorithm for pyrosequencing amplicon reads.- In: BMC<br />
Bioinformatics, 16:88(<strong>2015</strong>), p. 1-15.- ISSN 1471-2105<br />
Mysara M., Saeys Y., Leys N., Raes J., Monsieurs P.- CATCh, an<br />
Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing<br />
Studies.- In: Applied and Environmental Microbiology, 81:5(<strong>2015</strong>),<br />
p. 1573-1584.- ISSN 0099-2240<br />
36
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O17<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O17. GENE CO-EXPRESSION ANALYSIS IDENTIFIES BRAIN REGIONS AND<br />
CELL TYPES INVOLVED IN MIGRAINE PATHOPHYSIOLOGY: A GWAS-<br />
BASED STUDY USING THE ALLEN HUMAN BRAIN ATLAS<br />
Sjoerd M.H. Huisman 1,2* , Else Eising 3 , Ahmed Mahfouz 1,2 , Lisanne Vijfhuizen 3 , International Headache Genetics<br />
Consortium, Boudewijn P.F. Lelieveldt 2 , Arn M.J.M. van den Maagdenberg 3,4 & Marcel J.T. Reinders 1 .<br />
DBL, Dept. of Intelligent Systems, Delft University of Technology, The Netherlands 1 ; LKEB, Dept. of Radiology, Leiden<br />
University Medical Center, The Netherlands 2 ; Dept. of Human Genetics, Leiden University Medical Center, The<br />
Netherlands 3 ; Dept. of Neurology, Leiden University Medical Center, The Netherlands 4 . * s.m.h.huisman@tudelft.nl<br />
Migraine is a common brain disorder, with a heritability of around 50%. To understand the genetic component of this<br />
disease, a large genome wide association study has been carried out. Several loci were identified, but their interpretation<br />
remained challenging. We integrated the GWAS results with gene expression data, from healthy human brains, to<br />
identify anatomical regions and biological pathways implicated in migraine pathophysiology.<br />
INTRODUCTION<br />
Genome Wide Association Studies (GWAS) are<br />
frequently used to find common variants with small effect<br />
sizes. However, they often provide researchers with short<br />
lists of single nucleotide polymorphisms (SNPs) with<br />
uncertain connections to biological functions.<br />
We present an analysis of GWAS data for migraine, where<br />
the full list of SNP statistics is used to find groups of<br />
functionally related migraine-associated genes. For this<br />
end we make use of gene co-expression in the healthy<br />
human brain.<br />
We performed genome wide clustering of genes, followed<br />
by enrichment analysis for migraine candidate genes. In<br />
addition, we constructed local co-expression networks<br />
around high-confidence genes. Both approaches converge<br />
on distinct biological functions and brain regions of<br />
interest.<br />
METHODS<br />
Migraine GWAS data was obtained from the International<br />
Headache Genetics Consortium, with 23,285 cases and<br />
95,425 controls (Anttila et al., 2013). Genes were scored<br />
by SNP load and divided into high-confidence genes,<br />
migraine candidate genes, and non-migraine genes.<br />
Spatial gene expression data in the healthy adult human<br />
brain was obtained from the Allen Brain Institute<br />
(Hawrylycz et al., 2012). It contains microarray<br />
expression values of 3702 samples from 6 donors. Robust<br />
gene co-expressions were used to cluster genes into 18<br />
modules, which were then tested for enrichment of<br />
migraine candidate genes, and functionally characterized.<br />
In a second approach, local co-expression networks were<br />
built around the high-confidence migraine genes. These<br />
local networks were then compared to the modules of the<br />
first approach.<br />
RESULTS & DISCUSSION<br />
The genome wide analysis revealed several modules of<br />
genes enriched in migraine candidates. Two modules have<br />
preferential expression in the cerebral cortex and are<br />
enriched in synapse related annotations and neuron<br />
specific genes. A third module contains oligodendrocytes<br />
and genes preferentially expressed in subcortical regions.<br />
The local co-expression networks, of the second approach,<br />
converge on the same pathways and expression patterns,<br />
even though the high confidence genes lie mostly outside<br />
of the modules of interest. This provides a control to the<br />
results of the first approach.<br />
FIGURE 1. The co-expression network around high confidence migraine<br />
genes of the second approach. Genes (and links between them) of the<br />
migraine modules of the first approach are coloured in red, yellow, blue,<br />
and green.<br />
The analyses confirm the previously observed link<br />
between migraine and cortical neurotransmission. They<br />
also point to the involvement of subcortical myelination,<br />
which is in line with recent tentative findings. These<br />
results show that more relevant information can be<br />
extracted from GWAS results, using (publicly available)<br />
tissue specific expression patterns.<br />
REFERENCES<br />
Anttila V. et al. Genome-wide meta-analysis identifies new susceptibility<br />
loci for migraine. Nat. Genet. 45, 912–7, (2013).<br />
Hawrylycz M.J. et al. An anatomically comprehensive atlas of the adult<br />
human brain transcriptome. Nature 489, 391–9, (2012).<br />
37
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O18<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O18. SPATIAL CO-EXPRESSION ANALYSIS OF STEROID RECEPTORS IN<br />
THE MOUSE BRAIN IDENTIFIES REGION-SPECIFIC REGULATION<br />
MECHANISMS<br />
Ahmed Mahfouz 1,2* , Boudewijn P.F. Lelieveldt 1,2 , Aldo Grefhorst 3 , Isabel M. Mol 4 , Hetty C.M. Sips 4 , José K. van den<br />
Heuvel 4 , Jenny A. Visser 3 , Marcel J.T. Reinders 2 , & Onno C. Meijer 4 .<br />
Department of Radiology, Leiden University Medical Center 1 ; Delft Bioinformatics Lab, Delft University of<br />
Technology 2 ; Department of Internal Medicine, Erasmus University Medical Center 3 ; Department of Internal Medicine,<br />
Leiden University Medical Center 4 . * a.mahfouz@lumc.nl<br />
Steroid hormones coordinate the activity of many brain regions by binding to nuclear receptors that act as transcription<br />
factors. This study uses genome wide correlation of gene expression in the mouse brain to discover 1) brain regions that<br />
respond in a similar manner to particular steroids, 2) signaling pathways that are used in a steroid receptor and brain<br />
region-specific manner, and 3) potential target genes and relationships between groups of target genes. The data<br />
constitute a rich repository for the research community to support new insights in neuroendocrine relationships, and to<br />
develop novel ways to manipulate brain activity in research of clinical settings.<br />
INTRODUCTION<br />
Steroid receptors are pleiotropic transcription factors that<br />
coordinate adaptation to different physiological states. An<br />
important target organ is the brain, but its complexity<br />
hampers the understanding of their modulation.<br />
METHODS<br />
We used the Allen Brain Atlas (ABA) (Lein et al., 2007),<br />
the most comprehensive repository of in situ<br />
hybridization-based gene expression in the adult mouse<br />
brain, to identify genes that have three dimensional (3D)<br />
spatial gene expression profiles similar to steroid receptors.<br />
To validate the functional relevance of this approach, we<br />
analyzed the co-expression relationship of the<br />
glucocorticoid receptor (Gr) and estrogen receptor alpha<br />
(Esr1) and their known transcriptional targets in their<br />
brain regions of action. Next, we studied the regionspecific<br />
co-expression of nuclear receptors and their coregulators<br />
to identify potential partners mediating the<br />
hormonal effects on dopaminergic transmission. Finally,<br />
to illustrate the potential of using spatial co-expression to<br />
predict region-specific steroid receptor targets in the brain,<br />
we identified and validated gene which responded to<br />
changes in estrogen in the arcuate nucleus and medial<br />
preoptic area of the mouse hypothalamus.<br />
RESULTS & DISCUSSION<br />
For each steroid receptor, we ranked genes based on their<br />
spatial co-expression across the whole brain as well as in<br />
each of the aforementioned 12 brain structures separately.<br />
For each steroid receptor, strongly co-expressed genes<br />
within a brain region are likely related to the localized<br />
functional role of the receptor. For example, out of the top<br />
10 genes co-expressed with Esr1 across the whole brain, 4<br />
were previously shown to be regulated by Esr1 and/or<br />
estrogens in various tissues (Gpr101, Calcr, Ngb, and<br />
Gpx3)<br />
We assessed the extent of co-expression of glucocorticoid<br />
(GC)-responsive genes (Datson et al., 2012) with Gr in the<br />
whole brain, the hippocampus and its substructures the<br />
dentate gyrus (DG) and the different subregions of the<br />
cornu ammonis (CA). GC-responsive genes were<br />
significantly co-expressed with Gr in the DG, but<br />
interestingly also in the whole brain and in the CA3 region<br />
(FDR-corrected p < 1.8×10 -3 ; Mann-Whitney U-Test).<br />
Similarly, A Mann-Whitney U-test showed that a set of 15<br />
genes that are sensitive to gonadal steroids (Xu et al.,<br />
2012) is significantly correlated to Esr1 across the whole<br />
brain (FDR-corrected p = 8.69 ×10 -14 ), as well as in the<br />
hypothalamus (p = 3.85×10 -10 ) , the brain region<br />
responsible for the sexual behavior in animals.<br />
In order to identify putative region-dependent coregulators<br />
of steroid receptors, we analyzed the coexpression<br />
relationships of the each steroid receptor and a<br />
set of 62 nuclear receptor co-regulators as present on a<br />
peptide array (Nwachukwu et al., 2014). We focused our<br />
analysis on well-established target regions of steroid<br />
hormone action, dopaminergic brain regions (ventral<br />
tegmental area; VTA & substantia nigra; SN). We found<br />
three significantly co-expressed co-regulators with<br />
androgen receptor (Ar): Pnrc2, Pak6 and Trerf1,<br />
suggesting that these receptors may be involved in<br />
mediating Ar effects on dopaminergic transmission.<br />
In order to validate the predictive value of high correlated<br />
expression with a steroid receptor, we analyzed the<br />
response of top 10 genes that are strongly co-expressed<br />
with Esr1 in the hypothalamus to the estrogen<br />
diethylstilbesterol (DES) in castrated male mice using<br />
qPCR. We performed quantitative double in situ<br />
hybridization (dISH) for Esr1 and the six mRNAs (Irs4,<br />
Magel2, Adck4, Unc5, Ngb, and Gdpd2) that showed more<br />
than 1.3 fold enrichment in qPCR. We found Irs4 and<br />
Magel2 mRNA were both significantly upregulated by<br />
DES treatment (1.9 and 2.4-fold, respectively).<br />
REFERENCES<br />
Lein E. et al. Nature 445, 168–76 (2007).<br />
Datson N. et al. Hippocampus 22, 359–71 (2012).<br />
Xu X. et al., Cell 3, 596–607 (2012).<br />
Nwachukwu J. et al. eLife 3, e02057 (2014).<br />
38
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O19<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O19. A SYSTEMS BIOLOGY COMPENDIUM FOR LEISHMANIA DONOVANI<br />
Bart Cuypers 1,2,3* , Pieter Meysman 1,2 , Manu Vanaerschot 3 , Maya Berg 3 , Malgorzata Domagalska 3 , Jean-Claude<br />
Dujardin 3,4# & Kris Laukens 1,2# .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />
Antwerpen (biomina) 2 ; Molecular Parasitology Unit, Department of Biomedical Sciences, Institute of Tropical Medicine,<br />
Antwerp 3 ; 4 Department of Biomedical Sciences, University of Antwerp 4 . * bart.cuypers@uantwerpen.be # shared senior<br />
authors<br />
Leishmania donovani is the cause of visceral leishmaniasis in the Indian subcontinent and poses a threat to public health<br />
due to increasing drug resistance. Only little is known about its very peculiar molecular biology and there has been little<br />
‘omics integration effort so far. Here we present an integratory database or ‘omics compendium that contains all<br />
genomics, transcriptomics proteomics and metabolomics experiments that are currently publically available for<br />
Leishmania donovani. Additionally the user interface contains analysis tools for new datasets that uses smart data mining<br />
strategies like frequent itemset mining to link results from different ‘omics layers.<br />
INTRODUCTION<br />
The protozoan parasite Leishmania donovani causes<br />
visceral leishmaniasis (VL), a life threatening disease<br />
which affects 500 000 people each year. With only four<br />
drugs available and rapidly emerging drug resistance,<br />
knowledge about the parasite’s resistance mechanisms is<br />
essential to boost the development of new drugs. However,<br />
only little is known about the gene regulation of<br />
Leishmania and the few findings indicate major<br />
differences to known gene expression systems. Indeed, no<br />
polymerase II promotors have ever been found in<br />
Leishmania 1 . Genes are constitutively transcribed in large<br />
polycistronic units and subsequently spliced into<br />
individual mRNAs (trans-splicing) 1 . A modified thymine,<br />
Base J, marks the end of transcription units and functions<br />
as a stop signal for the RNA polymerase 2 . Gene<br />
expression is then assumed to be regulated at the posttranscriptional<br />
level (mRNA stability, translation<br />
efficiency, epigenetic factors, etc…) but evidence to<br />
support this is scarce 1 . Integration of different ‘omics<br />
could shed light on these gene regulatory mechanisms, but<br />
there has been little integration effort so far.<br />
METHODS<br />
We developed an easy to use tool, able to import and<br />
connect all existing L. donovani –omics experiments.<br />
Genomics, epigenomics, transcriptomics, proteomics,<br />
metabolomics and phenotypic data was collected and<br />
added to a MySQL database compendium, further<br />
complemented with publicly available data. Relations<br />
between different ‘omics layers were explicitly defined<br />
and provided with a level of confidence. Python scripts<br />
were developed to preprocess, analyse and import the data.<br />
To allow comparability between different experiments,<br />
platforms and labs the three integration principles of the<br />
COLOMBOS bacterial expression compendium were<br />
adapted 3 . 1) Use the same data-analysis pipeline for all<br />
data. 2) Work with contrasts to a control condition instead<br />
of expression values. 3) Annotate these contrasts in a<br />
unified and structured manner.<br />
Next to this vast data source a set of integrative dataanalysis<br />
tools was developed based on data mining<br />
strategies. For example: One tool uses frequent itemset<br />
mining algorithms to detect which proteins and<br />
metabolites frequently exhibit the same behaviour under<br />
different conditions. Another tool converts several –omics<br />
layers to a network format that can be opened in<br />
Cytoscape and can thus be the basis for network analysis.<br />
The Django and Twitter Bootstrap frameworks were used<br />
to create a web portal to make the tools accessible to any<br />
Leishmania researcher.<br />
RESULTS & DISCUSSION<br />
Excellent public gene, protein, metabolite annotation<br />
databases for Leishmania and related species are already<br />
available (e.g. TriTrypDB and GeneDB). However, the<br />
strength of our tool is that it links these annotation data to<br />
‘omics experiments that are either provided by the user, or<br />
that are publically available. New experiments can quickly<br />
be preprocessed, analysed and integrated in the database<br />
via its python back end. The compendium is therefore not<br />
only a look-up tool (e.g. under which conditions is this<br />
gene or metabolite upregulated?), but has tools available<br />
to also analyse the user-provided data with intelligent data<br />
mining tools (e.g. which metabolites/genes are typically<br />
upregulated in drug-resistant strains?). These new<br />
experiments provide additional confidence and<br />
information about the biological entities in the database.<br />
Unlike many other databases, the compendium has an<br />
elaborate quality control system. Every result provided by<br />
the tools can be traced back to the experimental data,<br />
which contains the necessary quality control plots to<br />
support the experiment’s validity. Additionally, it contains<br />
all relevant information about the extractions and the<br />
origin of the biological material.<br />
Using the compendium and its tools, we characterized the<br />
development and drug-resistance in a system biology<br />
context of Leishmania donovani. The genomes of more<br />
than 200 strains were examined for associations with<br />
phenotypical features and a subset was linked to<br />
transcriptomics, proteomics and metabolomics results. The<br />
compendium and its scripts were designed to be generic<br />
and can therefore be used for other organisms with only<br />
minor changes.<br />
REFERENCES<br />
1. Donelson, J. (1999) PNAS. 96, 2579–258.<br />
2. Van Luenen, H. G. a M. et al. (2012) Cell. 150, 909–21.<br />
3. Meysman. et al. (2014) Nucleic acids research. 42, D649-<br />
D653.<br />
39
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O20<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O20. MULTI-OMICS INTEGRATION: RIBOSOME PROFILING<br />
APPLICATIONS<br />
Volodimir Olexiouk 1 , Elvis Ndah 1 , Sandra Steyaert 1 , Steven Verbruggen 1 , Eline De Schutter 1 , Alexander Koch 1 , Daria<br />
Gawron 2 , Wim Van Criekinge 1 , Petra Van Damme 2 , Gerben Menschaert 1,* .<br />
Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and<br />
Bioinformatics, Faculty of Bioscience Engineering, Ghent University 1 ; Dept. Medical Protein Research, VIB-Ghent<br />
University 2 . * Gerben.menschaert@ugent.be<br />
Ribosome profiling is a relatively new NGS technology that enables the monitoring of the in vivo synthesis of mRNAencoded<br />
translation products measured at the genome-wide level. The technique, also sometimes referred to as RIBOseq,<br />
uses the property of translating ribosomes to protect mRNA fragments from nuclease digestion and allows to determine<br />
genomic positions of translating ribosomes with sub-codon to single-nucleotide precision. Since the advent of the<br />
technology, several bioinformatics solutions have been devised to investigate this type of data. Here we will present<br />
several solutions to detect novel proteoforms by combining RIBOseq and mass spectrometry data, to detect putatively<br />
coding small open reading frames (sORFs), and to evaluate the impact of DNA and RNA methylation on the translation<br />
level.<br />
INTRODUCTION<br />
Integration of different OMICS technologies is routinely<br />
adapted to investigate biological systems. Our lab focuses<br />
on high-throughput data analysis and the development of<br />
novel data integration methodologies. Currently our focus<br />
goes to ribosome profiling (Ingolia et al., 2011), an NGS<br />
based technique to measure the so-called translatome (i.e.<br />
the mRNA that shows ribosome occupancy). This<br />
technique is applied in combination with other sequencing<br />
based protocols to measure expression (RNAseq),<br />
translation (mass spectrometry) and to chart maps of<br />
regulatory elements such as DNA methylation (reduced<br />
representation bisulfite sequencing, RRBS) and RNA<br />
methylation (m 6 Aseq) to address several biological<br />
questions.<br />
METHODS<br />
For the integration of RIBOseq and mass spectrometry<br />
(MS), we devised a tool called PROTEOFORMER<br />
(www.biobix.be/proteoformer). This proteogenomics tool<br />
consists of several steps. It starts with the mapping of<br />
ribosome-protected fragments (RPFs) and quality control<br />
of subsequent alignments. It further includes modules for<br />
identification of transcripts undergoing protein synthesis,<br />
positions of translation initiation with sub-codon<br />
specificity and single nucleotide polymorphisms (SNPs).<br />
We used PROTEOFORMER to create protein sequence<br />
search databases from publicly available mouse and inhouse<br />
performed human RIBOseq experiments and<br />
evaluated these with matching proteomics data (Crappé et<br />
al., <strong>2015</strong>).<br />
Another pipeline based on RIBOseq data is built around<br />
the discovery of putatively coding small open reading<br />
frames (sORFs). Herein, the first step is to delineate<br />
sORFs based on RPF coverage throughout the coding<br />
sequence and at the translation initiation site. Afterwards,<br />
state-of-the-art tools and metrics accessing the coding<br />
potential of sORFs are implemented and a list of candidate<br />
sORFs for downstream analysis is compiled (e.g. MSbased<br />
identification).<br />
To assess the impact of DNA-methylation at the<br />
translation level a double knockout DNMT model was<br />
studied (WT and DNMT1 + 3B knockout HCT116 cell<br />
line). Genome-wide DNA methylation profiling was<br />
performed using RRBS, while ribosome profiling,<br />
quantitative shotgun and positional proteomics (Nterminal<br />
COFRADIC) were used to obtain protein<br />
expression data.<br />
An initial experiment to integrate m6Aseq (measuring the<br />
m6A epitranscriptome) and ribosome profiling has also<br />
been executed on HCT116 cells.<br />
RESULTS & DISCUSSION<br />
The RIBOseq-MS integration (through<br />
PROTEOFORMER) increases the overall protein<br />
identification rates with 3% and 11% (improved and new<br />
identifications) for human and mouse respectively and<br />
enables proteome-wide detection of 5’-extended<br />
proteoforms, upstream ORF (uORF) translation and nearcognate<br />
translation start sites. The PROTEOFORMER<br />
tool is available as a stand-alone pipeline and has been<br />
implemented in the galaxy framework for ease of use.<br />
The sORF pipeline was tested and curated on three<br />
different cell-lines (HCT116: human, E14 mESC: mouse,<br />
and S2: fruitfly). The public repository has been made<br />
available at www.sorfs.org (Olexiouk V. et al., in review),<br />
and so far includes the datasets mentioned above.<br />
In the study for the effect of DNA methylation at the<br />
proteome level in the DNMT double knock-out we found<br />
that the knockout cells show more significantly upregulated<br />
than down-regulated genes and that these upregulated<br />
genes were characterized by higher levels of<br />
promoter methylation in the wild type cells. Both the MS<br />
and RIBOseq analyses corroborated these findings.<br />
Preliminary results based on the m6A sequencing confirm<br />
previous findings on know m6A sequence motifs and<br />
enrichment of m6A sites in specific functional regions<br />
(around translation start sites and in 3’UTR regions) and<br />
moreover some examples hint at an effect of m6A on<br />
ribosomal pausing, after integrating m6A- and RIBOseq<br />
data.<br />
REFERENCES<br />
Ingolia N. et al. Cell 11;147(4):789-802 (2011).<br />
Crappé, J., Ndah, E. et al. NAR 11;43(5):e29 (<strong>2015</strong>).<br />
40
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O21<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O21. CLUB-MARTINI: SELECTING FAVORABLE INTERACTIONS<br />
AMONGST AVAILABLE CANDIDATES: A COARSE-GRAINED SIMULATION<br />
APPROACH TO SCORING DOCKING DECOYS<br />
Qingzhen Hou 1* , Kamil K. Belau 2 , Marc F. Lensink 3 , Jaap Heringa 1 & K. Anton Feenstra 1* .<br />
Center for Integrative Bioinformatics VU (IBIVU), VU University Amsterdam, De Boelelaan 1081A, 1081 HV<br />
Amsterdam, The Netherlands 1 ; Intercollegiate Faculty of Biotechnology, University of Gdańsk - Medical University of<br />
Gdańsk, Kładki 24, 80-822 Gdańsk, Poland 2 ; Institute for Structural and Functional Glycobiology (UGSF), CNRS<br />
UMR8576, FRABio FR3688, University Lille, 59000, Lille, France 3 .<br />
Protein-protein Interactions (PPIs) play a central role in all cellular processes. Large-scale identification of native binding<br />
orientations is essential to understand the role of particular protein-protein interactions in their biological context. We<br />
estimate the binding free energy using coarse-grained simulations with the MARTINI forcefield, and use those to rank<br />
decoys for 15 CAPRI benchmark targets. In our top 100 and top 10 ranked structures, for the 'easier' targets that have<br />
many near-native conformations, we obtain a strong enrichment of acceptable or better quality structures; for the 'hard'<br />
targets with very few near-native complexes in the decoys, our method is still able to retain structures which have native<br />
interface contacts. Moreover, CLUB-MARTINI is rather precise for some targets and able to pinpoint near-native<br />
binding modes in top 1, 5, 10 and 20 selections.<br />
INTRODUCTION<br />
Measuring binding free energy is essential to understand the<br />
relevance of particular protein-protein interactions in their<br />
biological context. Moreover, at the atomic scale, molecular<br />
simulations give us insight into the physically realistic details<br />
of these interactions. In our recent study, we successfully<br />
applied coarse-grained molecular dynamics simulations to<br />
estimate binding free energy with similar accuracy as and<br />
500-fold less time consuming than full atomistic simulation<br />
(May et al., 2014). The approach relied on the availability of<br />
crystal structures of the protein complex of interest. Here, we<br />
investigate the effectiveness of this approach as a scoring<br />
method to identify stable binding conformations out of<br />
docking decoys from protein docking.<br />
We apply our method as an evaluation method to rank more<br />
than 19 000 docked protein conformations, or ‘decoys’, for<br />
15 benchmark targets from the Critical Assessment of<br />
PRedicted Interactions (CAPRI) (Lensink & Wodak, 2014).<br />
METHODS<br />
For each target, the binding free energy of all decoys was<br />
calculated, using the MARTINI forcefield as introduced<br />
before (May et al., 2014). In short, for a set of closely spaced<br />
separation distances, we calculate the constraint force applied<br />
to maintain the set distance. Integrating this force yields a<br />
potential of mean force (PMF), from which the binding free<br />
energy is extracted as the highest minus the lowest value.<br />
Previously, for accuracy, we used up to 20 replicate<br />
simulations for each distance in the PMF, but for efficiency,<br />
here we use only a single replicate initially. We then selected<br />
the lowest-scoring half to run an additional four replicates to<br />
obtain better sampling and more accurate estimates of the<br />
binding free energy. In total, we used approximately 800 000<br />
core-hours of compute time.<br />
RESULTS & DISCUSSION<br />
We obtained strong enrichment of acceptable and high<br />
quality structures in the TOP 100 based on our PMF free<br />
energies, as shown in Figure 1. We estimate the error of our<br />
energies to be significant. This can be approved by increasing<br />
sampling, but remains very expensive.<br />
Moreover, for several targets, we can select near-native<br />
structures in top 1, top 5 and top 10 as shown in Table 1,<br />
which means that, overall, our method is rather precise. From<br />
estimates of the error, we expect we can improve accuracy by<br />
extending the amount of sampling done at each distance. In<br />
conclusion, our approach can find favorable interactions from<br />
available candidates produced by docking programs. To the<br />
best of our knowledge, this is the first time interaction free<br />
energy from a coarse-grained force field is used as a scoring<br />
method to rank docking solutions at a large scale.<br />
FIG. 1. Enrichment in<br />
percentage of<br />
acceptable or better<br />
structures. For each of<br />
the 13 targets with<br />
acceptable or better<br />
decoys, two columns<br />
(from left to right)<br />
stand for CAPRI<br />
Score_set and top 100<br />
in our rank of binding<br />
free energy calculation. Red, orange and yellow represent the fractions of<br />
high, medium and acceptable quality structures over the number of all or<br />
selected docking decoys. The order (left to right) is based on the fraction<br />
of acceptable structures in each target (easy to difficult)<br />
Table 1. Success selections of top ranked structures<br />
Selection Target\Quality High Medium Acceptable<br />
Total<br />
(% )<br />
TOP 1<br />
T47 1 0 0 100<br />
T53 0 0 1 100<br />
T47 3 2 0 100<br />
TOP 5<br />
T41 0 0 4 80<br />
T53 0 0 3 60<br />
T37 0 2 0 40<br />
T47 7 3 0 100<br />
T41 0 1 7 80<br />
TOP 10 T53 0 1 5 60<br />
T37 0 3 0 30<br />
T50 0 0 1 10<br />
T47 14 6 0 100<br />
T41 0 4 13 85<br />
T53 0 3 9 60<br />
TOP 20 T37 0 4 2 30<br />
T50 0 0 3 15<br />
T40 1 2 0 15<br />
T46 0 0 1 5<br />
REFERENCES<br />
May, Pool, Van Dijk, Bijlard, Abeln, Heringa & Feenstra. Coarsegrained<br />
versus atomistic simulations: realistic interaction free energies<br />
for real proteins. Bioinformatics (2014) 30: 326-334.<br />
Lensink & Wodak. Score_set: A CAPRI benchmark for scoring protein<br />
complexes. Proteins (2014) 82:3163-3169.<br />
41
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O22<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O22. PEPSHELL: VISUALIZATION OF CONFORMATIONAL PROTEOMICS<br />
DATA<br />
Elien Vandermarliere 1,2* , Davy Maddelein 1,2 , Niels Hulstaert 1,2 , Elisabeth Stes 1,2 , Michela Di Michele 1,2 ,<br />
Kris Gevaert 1,2 , Edgar Jacoby 3 , Dirk Brehmer 3 & Lennart Martens 1,2 .<br />
Department of Medical Protein Research, VIB 1 ; Department of Biochemistry, Ghent University 2 ; Oncology Discovery,<br />
Janssen Research and Development – Janssen Pharmaceutica, Beerse 3 . * elien.vandermarliere@ugent.be<br />
Proteins are dynamic molecules; they undergo crucial conformational changes induced by post-translational<br />
modifications and by binding of cofactors or other molecules. The characterization of these conformational changes and<br />
their relation to protein function is a central goal of structural biology. Unfortunately, most conventional methods to<br />
obtain structural information do not provide information on protein dynamics. Therefore, mass spectrometry-based<br />
approaches, such as limited proteolysis, hydrogen-deuterium exchange, and stable-isotope labelling, are frequently used<br />
to characterize protein conformation and dynamics, yet the interpretation of these data can be cumbersome and time<br />
consuming. Here, we present PepShell, a tool that allows interactive data analysis of mass spectrometry-based<br />
conformational proteomics studies by visualization of the identified peptides both at the sequence and structure levels.<br />
Moreover, PepShell allows the comparison of experiments under different conditions which include proteolysis times or<br />
binding of the protein to different substrates or inhibitors.<br />
INTRODUCTION<br />
The study of protein structure with mass spectrometry,<br />
called conformational proteomics, is frequently used to<br />
characterize protein conformations and dynamics. Most of<br />
these methods exploit the surface accessibility of amino<br />
acids within the native protein conformation or more<br />
specifically, the differences in protein surface accessibility<br />
in different situations within a protein structure.<br />
The experimental setup and subsequent workflow of a<br />
conformational proteomics experiment do not deviate<br />
drastically from that of a classic mass spectrometry-based<br />
experiment in which peptides present in a complex peptide<br />
mixture are identified. The final outcome of a<br />
conformational proteomics experiment is a list of peptides.<br />
These peptide lists typically span multiple experimental<br />
conditions across which the structural observations are to<br />
be compared; the peptide lists have to be combined and, if<br />
available, mapped onto the structure of the protein.<br />
To fulfill these latter steps, we developed PepShell<br />
(Vandermarliere et al., <strong>2015</strong>), to guide the interpretation<br />
of mass spectrometry-based proteomics data in the context<br />
of protein structure and dynamics.<br />
TOOL DESCRIPTION<br />
PepShell aids the user in the interpretation of the outcome<br />
of conformational proteomics experiments and is<br />
composed of three panels: the experiment comparison<br />
panel, the PDB view panel, and the statistics panel.<br />
<br />
The data to analyze<br />
PepShell allows the input from limited proteolysis,<br />
hydrogen-deuterium exchange, MS footprinting and<br />
stable-isotope labelling experiments. The data have to<br />
be present in a comma-separated text file format. The<br />
project selection interface allows the user to select a<br />
reference project and to indicate which setups need to<br />
be compared with each other.<br />
<br />
Experiment comparison<br />
This panel allows the comparison of the selected<br />
experimental setups at the sequence level. For each<br />
experimental condition, the identified and quantified<br />
peptides are mapped onto the sequence of the protein<br />
of interest.<br />
The PDB view panel<br />
Here, the detected peptides are mapped on the protein<br />
structure. The main requirement is the availability of a<br />
3D structure of the protein of interest.<br />
<br />
Statistics within PepShell<br />
In this panel, the peptides of interest can be analyzed<br />
in more detail. The outcome from CP-DT (Fannes et<br />
al., 2013) for tryptic cleavage probability for each<br />
tryptic cleavage position is given. Also detailed<br />
comparison of the peptide ratios over the different<br />
experimental setups is allowed.<br />
CONCLUSIONS<br />
The increasing popularity of structural proteomics is in<br />
stark contrast with the availability of efficient tools to<br />
visualize this multitude of data. There are however some<br />
tools available that aid data interpretation; but these are<br />
approach-specific and are aimed primarily at mass<br />
spectrometrists with a specific focus on the experimental<br />
mass spectrometry data and their processing and<br />
interpretation. PepShell on the other hand is intended to<br />
support downstream users to interpret the results obtained<br />
from a variety of conformational proteomics approaches.<br />
PepShell uses the peptide lists to compare different<br />
experimental conditions and allows the visualization of<br />
these differences onto the structure of the protein. As such,<br />
PepShell bridges the gap between mass spectrometrybased<br />
proteomics data and their interpretation in the<br />
context of protein structure and dynamics.<br />
PepShell is an open source Java application. Its binaries,<br />
source code and documentation can be found at:<br />
compomics.github.io/projects/pepshell.html<br />
REFERENCES<br />
Fannes T et al. J Proteome Res 12, 2253-2259 (2013).<br />
Vandermarliere E et al. J Proteome Res 14, 1987-1990 (<strong>2015</strong>).<br />
42
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O23<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O23. INTERACTIVE VCF COMPARISON USING SPARK NOTEBOOK<br />
Thomas Moerman 1,2,5* , Dries Decap 3,5 , Toni Verbeiren 2,5 , Jan Fostier 3,5 , Joke Reumers 4,5 , Jan Aerts 2,5 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Visual Data Analysis Lab, ESAT –<br />
STADIUS, Dept. of Electrical Engineering, KU Leuven – iMinds Medical IT 2 ; Department of Information Technology,<br />
Ghent University – iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium 3 ; Janssen Research & Development,<br />
a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium 4 ; ExaScience Life Lab, Kapeldreef 75, 3001 Leuven,<br />
Belgium 5 . * thomas.moerman@esat.kuleuven.be<br />
Researchers benefit greatly from tools that allow hands-on, interactive and visual experimentation with data, unimpeded<br />
by setup complexities nor scaling issues resulting from large data sizes. In our contribution we present an implementation<br />
of an interactive VCF comparison tool, making use of a technology stack based on Apache Spark [1], Big Data<br />
Genomics Adam [2] and Spark Notebook [3].<br />
INTRODUCTION<br />
Current genomics data formats and processing pipelines<br />
are not designed to scale well to large datasets [1]. They<br />
were also not conceived to be used in an interactive<br />
environment. The bioinformatics field typically struggles<br />
with these difficulties as high-throughput, next-generation<br />
sequencing jobs produce large data files. Although many<br />
high-quality bioinformatics processing tools exist, it is<br />
often hard to express analyses in a consolidated and<br />
reproducible fashion. These tools typically do not allow to<br />
interactively iterate on an analysis while visualizing<br />
results.<br />
OBJECTIVE<br />
Analysis tools preferably provide the expressive power to<br />
define ad hoc queries on data. Biologists or clinical<br />
researchers, when dealing with genomic variants encoded<br />
in VCF files, typically perform queries comparing one<br />
protocol to another, tumor to normal, treated to untreated<br />
cell lines and so on. Ideally these comparisons make use<br />
of all quality-related metrics stored in VCF files (e.g.<br />
coverage depth, quality score) as well as the actual region<br />
annotations (e.g. repeat regions, exonic regions) and<br />
generate visual output. We aim to implement a tool that<br />
provides the necessary expressiveness as well as the<br />
computational power needed for making these types of<br />
analyses practical and interactive.<br />
APPROACH<br />
Recent advances in computation platform technology<br />
(Spark) and notebook technologies (Spark Notebook)<br />
enable orchestration of distributed jobs on cluster<br />
infrastructure from a programmable environment running<br />
in a browser. These technologies, combined with Adam<br />
[2], a library specifically designed for processing nextgeneration<br />
sequencing data, provide the necessary<br />
architectural bedrock for our purposes.<br />
Analyses are expressed in a high-level programming<br />
language (Scala), operating on specialized data structures<br />
(Spark resilient distributed datasets, or RDDs [1]) that<br />
make abstraction of the complexity of defining distributed<br />
computations on data sets too large for single node<br />
processing. Adam meets the need for an explicit data<br />
schema for abstraction of the different bioinformatics file<br />
formats.<br />
RESULTS & CONTRIBUTIONS<br />
Our work focuses on the pairwise comparison of annotated<br />
VCF files. Our contributions consist of two open-source<br />
Scala libraries: VCF-comp [4] and Adam-FX [5]. VCFcomp<br />
implements the concordance by variant position<br />
algorithm, which segregates the variants from two VCF<br />
inputs (A, B) into 5 categories: A/B-unique, concordant<br />
(equal variants on position) and A/B-discordant (different<br />
variants on position). This results in a distributed data<br />
structure from which we project visualizations, presented<br />
to the user by means of the Spark Notebook interface.<br />
FIGURE 1 Allele frequency distribution for concordant and unique<br />
variants in a tumor vs. normal VCF comparison.<br />
FIGURE 2 Functional impact (SnpEff annotation) histogram for<br />
concordant, unique and discordant variants in a tumor vs. normal VCF<br />
comparison.<br />
Adam-FX extends the Adam data structures and file<br />
parsing logic in order to support queries on SnpEff [6],<br />
SnpSift [7], dbSNP and Clinvar annotations.<br />
We believe our tool facilitates the comparison of<br />
annotated VCF files in an interactive manner while<br />
reducing runtime by leveraging the Spark platform.<br />
REFERENCES<br />
[1] Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant<br />
abstraction for in-memory cluster computing."<br />
[2] Massie, Matt, et al. "Adam: Genomics formats and processing<br />
patterns for cloud scale computing."<br />
[3] https://github.com/andypetrella/spark-notebook<br />
[4] https://github.com/tmoerman/vcf-comp<br />
[5] https://github.com/tmoerman/adam-fx<br />
[6] Cingolani, P, et al. "A program for annotating and predicting the<br />
effects of single nucleotide polymorphisms, SnpEff: SNPs in the<br />
genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Fly<br />
(Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672<br />
43
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: O24<br />
Oral presentation<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
O24. 3D HOTSPOTS OF RECURRENT RETROVIRAL INSERTIONS REVEAL<br />
LONG-RANGE INTERACTIONS WITH CANCER GENES<br />
Sepideh Babaei 1 , Waseem Akhtar 2 , Johann de Jong 3 , Marcel Reinders 1 & Jeroen de Ridder 1* .<br />
Delft Bioinformatics Lab, Delft University of Technology 1 ; Division of Molecular Genetics 2 ;<br />
Division of Molecular Carcinogenesis, The Netherlands Cancer Institute 3 . * j.deridder@tudelft.nl<br />
Genomically distal mutations can contribute to deregulation of cancer genes by engaging in chromatin interactions. To<br />
study this, we overlay viral cancer-causing insertions obtained in a murine retroviral insertional mutagenesis screen with<br />
genome-wide chromatin conformation capture data. In this talk, we show that insertions tend to cluster in 3D hotspots<br />
within the nucleus. The identified hotspots are significantly enriched for known cancer genes, and bear the expected<br />
characteristics of bona-fide regulatory interactions, such as enrichment for transcription factor binding sites.<br />
Additionally, we observe a striking pattern of mutual exclusive integration. This is an indication that insertions in these<br />
loci target the same gene, either in their linear genomic vicinity or in their 3D spatial vicinity. Our findings shed new<br />
light on the repertoire of targets obtained from insertional mutagenesis screening and underlines the importance of<br />
considering the genome as a 3D structure when studying effects of genomic perturbations.<br />
Evidence is mounting that the organization of the genome<br />
in the cell nucleus is extremely important for gene<br />
regulation. This finding is facilitated by recent<br />
technological advances (i.e. Hi-C) that enabled researchers<br />
to accurately capture the 3D conformation of<br />
chromosomes in the cellular nucleus at a high resolution.<br />
We have exploited a large existing Hi-C dataset to take 3D<br />
chromosome conformation into account while determining<br />
hotspots of viral cancer-causing mutations. These<br />
identified hotspots are significantly enriched for known<br />
cancer genes, and bear the expected characteristics of<br />
bona-fide regulatory interactions, such as enrichment for<br />
transcription factor binding sites. Additionally, we observe<br />
a striking pattern of mutual exclusive integration. This is<br />
an indication that insertions in these loci target the same<br />
gene through long-range interactions (1).<br />
In a second study (2), we performed a similar analysis that<br />
shows a striking relation between genome conformation<br />
and expression correlation in the brain. Although recent<br />
studies have shown a strong correlation between<br />
chromatin interactions and gene co-expression exists,<br />
predicting gene co-expression from frequent long-range<br />
chromatin interactions remains challenging. We address<br />
this by characterizing the topology of the cortical<br />
chromatin interaction network using scale-aware<br />
topological measures. We demonstrate that based on these<br />
characterizations it is possible to accurately predict spatial<br />
co-expression between genes in the mouse cortex.<br />
Consistent with previous findings, we find that the<br />
chromatin interaction profile of a gene-pair is a good<br />
predictor of their spatial co-expression. However, the<br />
accuracy of the prediction can be substantially improved<br />
when chromatin interactions are described using scaleaware<br />
topological measures of the multi-resolution<br />
chromatin interaction network. We conclude that, for coexpression<br />
prediction, it is necessary to take into account<br />
different levels of chromatin interactions ranging from<br />
direct interaction between genes (i.e. small-scale) to<br />
chromatin compartment interactions (i.e. large-scale).<br />
In this talk, I will focus on the computational and<br />
statistical methods that are required to make an insightful<br />
overlaying high-resolution conformation maps obtained<br />
using Hi-C with ~20.000 cancer-causing retroviral<br />
mutations and expression maps from the Allen Brain<br />
Atlas.<br />
FIGURE 1. Circos visualization of the insertions clusters that co-localize<br />
with the Notch1 locus.<br />
REFERENCES<br />
(1) Babaei, S. et al. Nature Communications (<strong>2015</strong>).<br />
(2) Babaei and Mahfouz et al. PLoS Computational Biology (<strong>2015</strong>)<br />
44
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P1. KNN-MDR APPROACH FOR DETECTING GENE-GENE<br />
INTERACTIONS<br />
Sinan Abo alchamlat 1 & Frédéric Farnir 1,* .<br />
Fundamental and Applied Research for Animals & Health (FARAH), Sustainable Animal Production, University of<br />
Liège 1 . * f.farnir@ulg.ac.be<br />
These last years have seen the emergence of a wealth of biological information. Facilitated access to the genome<br />
sequence, along with massive data on genes expression and on proteins have revolutionized the research in many fields<br />
of biology. For example, the identification of up to several millions SNPs in many species and the development of chips<br />
allowing for an effective genotyping of these SNPs in large cohorts have triggered the need for statistical models able to<br />
identify the effects of individual and of interacting SNPs on phenotypic traits in this new high-dimensional landscape.<br />
Our work is a contribution to this field...............................................................................................................<br />
INTRODUCTION<br />
GWAS has allowed the identification of hundreds of<br />
genetic variants associated to complex diseases and traits,<br />
and provided valuable information into their genetic<br />
architecture (Wu M et al., 2010). Nevertheless, most<br />
variants identified so far have been found to confer<br />
relatively small information about the relationship<br />
between changes at the genomic level and phenotypes<br />
because of the lack of reproducibility of the findings, or<br />
because these variants most of the time explain only a<br />
small proportion of the underlying genetic variation (Fang<br />
G et al., 2012). This observation, quoted as the ‘missing<br />
heritability’ problem (Manolio T et al., 2009) of course<br />
raises the question: where does the unexplained genetic<br />
variation come from? A tentative explanation is that genes<br />
do not work in isolation, leading to the idea that sets of<br />
genes (or genes networks) could have a major effect on the<br />
tested traits while almost no marginal – i.e. individual<br />
gene – effect is detectable. Consequently, an important<br />
question concerns the exact relationship between the<br />
genomic configuration, including the interactions between<br />
the involved genes, and the phenotypic expression.<br />
METHODS<br />
To tackle this subject, different statistical methods such as<br />
MDR (Multi Dimensional Reduction) have been proposed<br />
for detecting gene-gene interaction (Ritchie, D., et al.,<br />
2001); their relative performances remain largely unclear,<br />
and their extension to situations combining many variants<br />
turns out to be challenging. So we propose a novel MDR<br />
approach using K-Nearest Neighbors (KNN) methodology<br />
(KNN-MDR) for detecting gene-gene interaction as a<br />
possible alternative, especially when the number of<br />
involved determinants is potentially high. The idea behind<br />
our method is to replace the status allocation used in<br />
classical MDR methods by a KNN approach: the majority<br />
vote occurs in the k (a parameter that must be tuned and<br />
depends on the various possible scenarios) nearest<br />
neighbors instead of within the (potentially empty) cell<br />
determined by the tested attributes of the individual to be<br />
classified. The steps other than classification are identical<br />
in both methods (i.e. cross-validation, attributes selection,<br />
training and tests balanced accuracy computations, best<br />
model selection procedure).<br />
RESULTS & DISCUSSION<br />
Experimental results on both simulated data and real<br />
genome-wide data from Wellcome Trust Case Control<br />
Consortium (WTCCC) (Wellcome Trust Case Control C.,<br />
2007) show that KNN-MDR has interesting properties in<br />
terms of accuracy and power, and that, in many cases, it<br />
significantly outperforms its recent competitors.<br />
FIGURE 1. Comparison of the inter-chromosomal interactions detected<br />
on the WTCCC dataset by KNN-MDR and other interaction methods<br />
using this same dataset as example (Shchetynsky et al. (<strong>2015</strong>); Zhang et<br />
al. (2012))<br />
The results of this study allow us to draw some<br />
conclusions about the performance of KNN-MDR: on the<br />
one hand, the performance of the KNN-MDR method to<br />
detect gene-gene interactions are similar to the<br />
performance of MDR for small problems. On the other<br />
hand, KNN-MDR has significant advantages in large<br />
samples and large number of markers (such as GWAS) to<br />
detect the existence of genes effect. So KNN-MDR can be<br />
seen as a new and more comprehensive method than MDR<br />
and other competitors for detecting gene-gene interaction.<br />
REFERENCES<br />
Wu M et al. American journal of human genetics 86, 929-942 (2010).<br />
Fang G et al. PloS one 7, 1932-6203 (2012).<br />
Manolio T et al. Nature 461, 747-753 (2009).<br />
Ritchie, D., et al. Am J Hum Genet,69, 138-147 (2001).<br />
Wellcome Trust Case Control C. Nature, 447(7145):661-678 (2007).<br />
Shchetynsky K et al. Clinical immunology 158(1):19-28 (<strong>2015</strong>).<br />
Zhang J et al. American Medical Journal 3(1) (<strong>2015</strong>).<br />
45
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P2. CONSERVATION AND DIVERSITY OF SUGAR-RELATED CATABOLIC<br />
PATHWAYS IN FUNGI<br />
Maria Victoria Aguilar Pontes*, Eline Majoor, Claire Khosravi, Ronald P. de Vries, Miaomiao Zhou<br />
Fungal Physiology, CBS-KNAW Fungal Biodiversity Centre, Utrecht, The Netherlands; Fungal Molecular Physiology,<br />
Utrecht University, The Netherlands.*v.aguilar@cbs.knaw.nl, e.majoor@cbs.knaw.nl, c.khosravi@cbs.knaw.nl,<br />
r.devries@cbs.knaw.nl, m.zhou@cbs.knaw.nl<br />
INTRODUCTION<br />
Plant polysaccharides are among the major substrates for<br />
many fungi. After extracellular degradation, the<br />
monomeric components (mainly monosaccharides) are<br />
taken up by the cells and used as carbon sources to enable<br />
the fungus to grow. This would also imply that the range<br />
of catabolic pathways of a fungus may be correlated to the<br />
decomposition of the polysaccharides it can degrade.<br />
Several carbon catabolic pathways have been studied in<br />
different fungi able to grow on plant biomass such as<br />
Aspergillus niger (De Vries, et al., 2012).<br />
In this study we have tested this hypothesis by identified<br />
the presence of genes of a number of catabolic pathways<br />
in selected fungi from the Ascomycota and the<br />
Basidiomycota.<br />
METHODS<br />
A total of 104 fungal genomes were identified from the<br />
JGI fungal program (Grigoriev IV, et al., 2011), Broad<br />
Institute of Harvard and MIT, AspGD (Arnaud, et al.,<br />
2012) and NCBI genbank (Benson, et al., 2012) (data<br />
version March 2013).<br />
We identified A. niger genes involved in individual<br />
pathways from literature. Genome scale protein ortholog<br />
clusters were detected according to (Li, et al., 2003), using<br />
inflation factor 1, E-value cutoff 1E-3, percentage match<br />
cut off 60% as for identification of distant homologs<br />
(Boekhorst, et al., 2007). The all-vs-all BlastP search<br />
required by OrthoMCL was carried out in a grid of 500<br />
computers by parallel fashion. The orthologs clusters were<br />
then curated manually by expert knowledge and literature<br />
search. Manual curation was aided by aligning the amino<br />
acid sequences of the hits for each query together with a<br />
suitable outgroup by MAFFT (Katoh, et al., 2009; Katoh,<br />
et al., 2005), after which neighbor joining trees were<br />
generated using MEGA5 with 1000 bootstraps. Genes that<br />
were clearly separated from the query branch in the trees<br />
were removed from the results.<br />
RESULTS & DISCUSSION<br />
Patterns of pathway gene presence are conserved among<br />
clades. Galacturonic acid and rhamnose pathways are<br />
missing in yeast. Pentose pathway is conserved in<br />
Pezizomycetes and Basidiomycota, which explains their<br />
ability to grow on pentose as carbon source (www.funggrowth.org).<br />
These results may indicate that different evolutionary<br />
tracks have led to different metabolic strategies.<br />
The expression of metabolic genes will be evaluated for<br />
those species for which transcriptome data are available.<br />
The results will be compared to growth profiling data of<br />
the species on a set of plant-related poly- and<br />
monosaccharides to determine to which extent the genome<br />
content fits the physiological ability of the species.<br />
ACKNOWLEDGEMENTS<br />
The comparative genomics analysis was carried out on the<br />
Dutch national e-infrastructure with the support of SURF<br />
Foundation (e-infra1300787).<br />
REFERENCES<br />
Arnaud, M.B., et al., Nucleic Acids Res, 40, 653-659 (2012).<br />
Benson, D.A., et al., Nucleic Acids Res, 40, 48-53 (2012).<br />
Boekhorst, J., et al., BMC Bioinformatics, 8, 356-363 (2007).<br />
De Vries, R.P., et al. Pan Stanford Publishing Pte. Ltd, Singapore (2012).<br />
Grigoriev IV, et al., Mycology, 2, 192-209 (2011).<br />
Katoh, K., et al., Methods Mol Biol, 537, 39-64 (2009).<br />
Katoh, K., et al., Nucleic Acids Res, 33, 511-518 (2005).<br />
Li, L., et al., Genome Res, 13, 2178-2189 (2003).<br />
46
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P3. VISUALIZING BIOLOGICAL DATA THROUGH WEB COMPONENTS<br />
USING POLIMERO AND POLIMERO-BIO<br />
Daniel Alcaide 1,2* , Ryo Sakai 1,2 , Raf Winand 1,2 , Toni Verbeiren 1,2 , Thomas Moerman 1,2 , Jansi Thiyagarajan & Jan Aerts.<br />
KU Leuven Department of Electrical Engineering-ESAT, STADIUS, VDA-lab, Belgium 1 ; iMinds Medical IT, Leuven,<br />
Belgium. * daniel.alcaide@esat.kuleuven.be<br />
Although there are currently several tools for fast prototyping in data visualization, the specifics of the biological domain<br />
often require the development of custom visuals. This leads to the issue that we end up re-implementing the base visuals<br />
over and over if we want to build them into a specific analysis tool. This work presents a proof-of-principle library for<br />
creating composable linked data visualizations, including an initial collection of parsers and visuals with an emphasis on<br />
biology. With Polimero and Polimero-bio, we want to create a library to build scalable domain-specific visual data<br />
exploration tools using a collection of D3-based reusable web components.<br />
INTRODUCTION<br />
As a visual data analysis lab, we often combine<br />
(brush/link) well-known data visualization techniques<br />
(scatterplots, barcharts, etc.). Despite it is possible to use<br />
general-purpose tools like Tableau or Excel, the singular<br />
needs of the biological field usually demand the creation<br />
of particular data visualizations which are not included in<br />
these commercial solutions (Figure 1).<br />
These visuals implementations need to be re-implemented<br />
for each new tool created. The present solution tries to be<br />
an alternative to create composable linked data<br />
visualizations.<br />
<br />
<br />
<br />
<br />
<br />
<br />
Modular: Each element is an independent module<br />
that has a specific purpose (data, visualization,<br />
computation)<br />
Composable: The elements can be combined<br />
setting up new functionalities (linking, filtering,<br />
reading different data sources)<br />
Encapsulated: Web components aim to provide<br />
the user a simple element interface, avoiding to<br />
have to deal with the underlying code.<br />
Reusable: The same element can be used in the<br />
same project for different objectives.<br />
Linkable: Polimero elements can speak to each<br />
other, allowing the use of events for brushing and<br />
linking.<br />
Embeddable: The elements can be added to any<br />
existing frameworks that use HTML (e.g. ipython<br />
notebook).<br />
FIGURE 1. Klaudia-plot - Visualization created with Polimero that shows<br />
the read pairs mapped around a deletion in the NA12878 genome on<br />
chromosome 20.<br />
METHODS<br />
Polimero is a library that uses Polymer implementation for<br />
creating visual web components. (www.polymerproject.org).<br />
Web components are an emerging W3C standard for<br />
extending the HTML platform to create web-based apps.<br />
This new technology includes custom elements, HTML<br />
templates, shadow DOM, and HTML imports (Figure 2).<br />
The D3-based custom elements that Polimero and<br />
Polimero-bio offer, allow us to create a scalable<br />
framework for building domain-specific visual data<br />
exploration tools.<br />
Leveraging the web components concepts, the main<br />
characteristics of Polimero library are:<br />
FIGURE 2. HTML example – Representing Polimero elements to create<br />
visualization.<br />
RESULTS & DISCUSSION<br />
This library makes it possible to create applications that<br />
are composable, encapsulated, and reusable. This is<br />
valuable both for the developer/designer who can easily<br />
create and plug-in custom visual encodings, and for the<br />
end-user who can create linked visualizations by dragging<br />
existing components onto a canvas using the Polimerodesigner.<br />
Polimero and Polimero-bio are still in development but<br />
they are available at www.bitbucket.org/vda-lab/polimero.<br />
47
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P4. DISEASE-SPECIFIC NETWORK CONSTRUCTION BY SEED-AND-EXTEND<br />
Ganna Androsova 1* , Reinhard Schneider 1 & Roland Krause 1 .<br />
Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belval, Luxembourg 1 .<br />
* ganna.androsova@uni.lu<br />
INTRODUCTION<br />
Molecular interaction networks are dense structures of<br />
protein interactions, from which we would like to extract<br />
relevant sub-networks specific to the disease of interest.<br />
Such a disease-specific network is often constructed by the<br />
seed-and-extend algorithm, which extracts the relevant<br />
genes from an organism-wide, weighted interaction<br />
network, typically as its first-neighbourhood. Seed-andextend<br />
is suitable when disease biomarkers are poorly<br />
investigated and the knowledge about biomarker<br />
interaction partners is missing or when the interacting<br />
partners are established but the connections are missing<br />
between them.<br />
Our syndrome of interest is the postoperative cognitive<br />
impairment frequently experienced by elderly patients,<br />
characterized by progressive cognitive and sensory decline.<br />
The acute phase of cognitive impairment is postoperative<br />
delirium (POD). The underlying pathophysiological<br />
mechanisms have not been studied in depth due to<br />
mulitifactorial pathogenesis of this postoperative cognitive<br />
impairment. The known POD-related genes can be<br />
integrated into the draft network for exploration on a<br />
systems level.<br />
Here, we investigate how stable the results of such<br />
analysis are when the input set of seed genes is varied, and<br />
what is the role of stringency in the initial selection of the<br />
networks. Ideally, we would like to find the “sweet spot”<br />
that provides a biologically meaningful trade-off between<br />
false-positives and -negatives to be used for such analyses.<br />
METHODS<br />
The list of disease-related genes/proteins was retrieved<br />
from literature studies in the PubMed database.<br />
We extended the seed list with directly linked interactors<br />
by seed-and-extend from protein-protein interaction<br />
network databases. We extracted all interactions between<br />
seeds and connected neighbours, which resulted in the<br />
first-degree network.<br />
Next, we evaluated a biological enrichment of the<br />
extracted network, its topological parameters, overlap with<br />
other diseases and clustered the network into the smaller<br />
sub-networks.<br />
RESULTS & DISCUSSION<br />
The POD network (Figure 1) follows a free-scale<br />
distribution and consists of 541 proteins with 5,242<br />
interactions between them.<br />
FIGURE 1. Postoperative delirium molecular network.<br />
The network was evaluated topologically by degree<br />
assortativity, density, shortest path, eccentricity and other<br />
measures. Pathways enrichment analysis showed<br />
glucocorticoid receptor signalling, immune response, and<br />
dopamine signalling as relevant to POD (Figure 2).<br />
FIGURE 2. Postoperative delirium pathway enrichment analysis.<br />
Top 5 hub proteins included UBC_HUMAN,<br />
GCR_HUMAN, P53_HUMAN, HS90A_HUMAN and<br />
EGFR_HUMAN. Appearance of p53 and other very<br />
frequent genes among top 5 hubs in our but also several<br />
other studies, motivated us to investigate its relevance to<br />
the disease and question the possible data bias. We<br />
compare how size, specificity and completeness of the<br />
input seed list can affect the resulting network and<br />
retrieval of the other disease-related proteins.<br />
48
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW<br />
COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE<br />
AND HIVE<br />
Amin Ardeshirdavani 1* , Erika Souche 2 , Martijn Oldenhof 3 & Yves Moreau 1 .<br />
KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven<br />
Department of Human Genetics 2; KU Leuven Facilities for Research 3. *amin.ardeshirdavani@esat.kuleuven.be<br />
Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others,<br />
efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and<br />
optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate<br />
supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and<br />
act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data<br />
from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache<br />
Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files.<br />
INTRODUCTION<br />
The goal of this project is trying to overcome the<br />
difficulties between massive NGS data and low data<br />
process ability. We want propose a data process and<br />
storage model specifically for NGS data. To address our<br />
goal we develop an application based on this model to test<br />
whether its process ability is highly increased. The target<br />
users of this application are researchers with intermediatelevel<br />
computer skills. The new model should meet certain<br />
demands, which are scalable, high tolerant and availability.<br />
Data import procedure should be fast and occupies the<br />
smallest storage volume. It also needs to make querying<br />
data faster and possible from remote place. In order to<br />
achieve these demands, three open source projects:<br />
Apache Hadoop, HBase and Hive are integrated as the<br />
backbone and on top of them a user-friendly interface<br />
designed application is developed to make this integration<br />
more straightforward.<br />
METHODS<br />
Generally, Hadoop is for utilizing distributed MapReduce<br />
data processing, HBase is the platform for complex<br />
structured data storage and Hive is for data retrieve from<br />
HBase using of Structural Query Language (SQL) syntax.<br />
Though Hadoop and HBase are popular recently, the<br />
combination of Hadoop, HBase and Hive is rare to be<br />
implemented in bioinformatics field.<br />
Here we mainly discuss gene variation data analysis. Thus<br />
the application developing is focusing on parsing and<br />
storing VCF (Variant Call Format) file. The application is<br />
designed to dynamically adapt VCF file structures with<br />
respect to variant callers. For example in<br />
UnifiedGenotyper calls SNPs and InDels separately by<br />
considering each variant is independent, yet the other<br />
caller HaplotypeCaller calls variants by using local<br />
assembly. For gene variation analysis, the VCF files of<br />
different samples need to be queried and the results should<br />
be able to export for further usage. Normally a VCF file<br />
for each sample or a group of samples is considerably<br />
large, so the efficiency of processing is for sure very<br />
crucial.<br />
The model we have decided is the integration of Hadoop,<br />
HBase and Hive; Hadoop will be used for data processing,<br />
HBase for storage and Hive for querying. Since all of<br />
these projects need distributed cluster to optimize the<br />
performance, it is crucial to decide the suitable<br />
architecture for our application. The cluster will be the<br />
major processing and storage platform. The single server<br />
outside the cluster will act as a client for users. Our<br />
application can connect remotely to the Hive server for<br />
researchers.<br />
RESULTS & DISCUSSION<br />
The tests show clearly that the Apache integration<br />
performances much better than SQL model when dealing<br />
with large size VCF files. Also, for small VCF files, the<br />
integration performance is acceptable. So we conclude that<br />
Apache integration could be a good solution for this kind<br />
of file management. Our newly developed application H3<br />
VCF with user-friendly interface is a nice tool for users<br />
without high level IT knowledge so they can conveniently<br />
use the integration to tackle VCF files. User can either<br />
choose to build his/ her own local computer cluster or use<br />
Amazon EMR to easily create a cluster with Apache<br />
projects for a few dollars.<br />
49
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P6. ENTEROCOCCUS FAECIUM GENOME DYNAMICS DURING<br />
LONG-TERM PATIENT GUT COLONIZATION<br />
Jumamurat R. Bayjanov 1* , Jery Baan 1 , Mark de Been 1 , Mick Watson 2 & Willem van Schaik 1 .<br />
Department of Medical Microbiology, University Medical Center Utrecht, Utrecht, The Netherlands 1 ; Edinburgh<br />
Genomics, The University of Edinburgh, Edinburgh, Scotland 2 . * J.Bayjanov@umcutrecht.nl<br />
Enterococcus faecium – recently evolved multi-drug resistant nosocomial pathogen – is able to rapidly colonize human<br />
gut. Previous work on animal, healthy human and clinical E. faecium strains has shown that clinical isolates form a<br />
distinct lineage. However, these studies lack detailed niche-specific and longitudinal evolutionary dynamics analysis of<br />
this organism. Here we show longitudinal within-host evolutionary dynamics analysis of E. faecium gut isolates, which<br />
were sampled from five patients over the period of 8 years. Whole-genome sequencing analysis showed that rapid<br />
diversification of E. faecium clones in patient gut is mainly due to recombinations and phages. High diversification<br />
allows E. faecium clones to acquire new genes including antibiotic resistance genes, which allows this bacterium to<br />
rapidly colonize hostile environments.<br />
INTRODUCTION<br />
In recent decades, Enterococcus faecium, normally a<br />
harmless gut commensal, has emerged as an important<br />
multi-drug resistant nosocomial pathogen. Previous work<br />
has shown that clinical isolates of E. faecium form a subpopulation<br />
that is distinct from strains isolated from<br />
animals and healthy humans (Lebreton et al., 2013). We<br />
used whole-genome sequencing to characterize how<br />
clinical E. faecium strains evolve during long-term patient<br />
gut colonization.<br />
METHODS<br />
The genomes of 96 E. faecium gut isolates, obtained over<br />
8 years from 5 different patients, were sequenced using<br />
Illumina HiSeq 2x100bp paired-end sequencing. Quality<br />
filtering of sequence reads was performed using Nesoni<br />
(version 0.117) (Nesoni, 2014) and high-quality reads<br />
were assembled into contiguous sequences using Spades<br />
assembler (version 3.1.0) (Bankevich et al., 2012).<br />
Subsequently, assembled sequences were annotated using<br />
Prokka (v 1.10) (Seeman T, 2014). In addition to these 96<br />
genomes, we also included publicly available genome<br />
sequences of 70 E. faecium strains, which were<br />
downloaded from NCBI Genbank database. In the set of<br />
166 strains, orthology between genes were identified using<br />
orthAgogue (Ekseth et al., 2014) and orthologous genes<br />
were clustered into ortholog groups using MCL algorithm<br />
(Enright et al., 2002). Core genome alignments were then<br />
constructed by concatenating core gene sequences and<br />
were filtered for recombinations using Gubbins (Croucher<br />
et al., <strong>2015</strong>). Subsequently, recombination-filtered core<br />
genome alignments were used to construct a phylogenetic<br />
tree. In addition to core-genome based analyses, we have<br />
also studied gene gain and loss across time.<br />
RESULTS & DISCUSSION<br />
As expected all of 96 isolates were grouped in E. faecium<br />
clade A, with only one strain clustering in clade A-2,<br />
which mainly contains animal isolates. The remaining 95<br />
strains were assigned to clade A-1, which is almost<br />
exclusively comprised of clinical isolates. The<br />
phylogenetic tree showed 5 clusters of closely related<br />
strains of patients, revealing the microevolution of E.<br />
faecium strains during gut colonization. We also anticipate<br />
that direct transfer of strains had occurred between<br />
patients during hospitalization in the same ward.<br />
Additionally, analysis of gene gain and loss across time<br />
showed that loss and gain of prophages is an important<br />
factor in generating genetic diversity during gut<br />
colonization.<br />
This study highlights the ability of E. faecium clones to<br />
rapidly diversify, which may contribute to the ability of<br />
this bacterium to efficiently colonize new environments<br />
and rapidly acquire antibiotic resistance determinants.<br />
REFERENCES<br />
Lebreton F, et. al. “Emergence of epidemic multidrug-resistant<br />
Enterococcus faecium from animal and commensal strains”. MBio.<br />
4(4):e00534-13, 2013.<br />
Nesoni. https://github.com/Victorian-Bioinformatics-Consortium/nesoni<br />
Bankevich A, et. al. "SPAdes: A New Genome Assembly Algorithm and<br />
Its Applications to Single-Cell Sequencing". Journal of<br />
Computational Biology 19(5):455-477, 2012<br />
Seemann T. "Prokka: rapid prokaryotic genome annotation".<br />
Bioinformatics. 30(14):2068-9, 2014.<br />
Ekseth OK, et. al. "orthAgogue: an agile tool for the rapid prediction of<br />
orthology relations". Bioinformatics. 30(5):734-6, 2014.<br />
Enright AJ, et. al. "An efficient algorithm for large-scale detection of<br />
protein families". Nucleic Acids Res. 40:1575-1584, 2002.<br />
Croucher NJ, et. al. "Rapid phylogenetic analysis of large samples of<br />
recombinant bacterial whole genome sequences using Gubbins".<br />
Nucleic Acids Res. 43(3):e15, <strong>2015</strong>.<br />
50
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P7. XCMS OPTIMISATION IN HIGH-THROUGHPUT LC-MS QC<br />
Charlie Beirnaert 1,2* , Matthias Cuykx 3 , Adrian Covaci 3 & Kris Laukens 1,2 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical Informatics Research Centre<br />
Antwerp (biomina) 2 ; Toxicological Centre, University of Antwerp 3 . * charlie.beirnaert@uantwerpen.be<br />
In high-throughput untargeted metabolomics studies, quality control is still a prominent bottleneck. In analogy to a<br />
recently developed QC tool for proteomics, work in our research group aims to develop a QC environment specific for<br />
metabolomics. One component in this work is the XCMS analysis software for LC-MS data, which is very inputparameter-sensitive.<br />
The presented work deals with the automatic optimisation of the XCMS parameters by building<br />
further upon an existing framework for XCMS optimisation. The additions to this framework will be the inclusion of<br />
quantified resolution data by using the otherwise ignored profile-data and intelligent use of the isotopic profile of<br />
measured compounds.<br />
INTRODUCTION<br />
Metabolomics is the study of small molecules or<br />
metabolites. These metabolites have an enormous<br />
chemical diversity and are only now starting to be<br />
identified in a high-throughput fashion. Reason for this is<br />
the adoption of high performance liquid chromatography<br />
mass spectrometry and nuclear magnetic resonance<br />
spectroscopy. However, the data analysis of these large<br />
datasets is not trivial, specifically for LC-MS there are<br />
almost more ways of analysing data than there are<br />
researchers. Arguably, the most common used software<br />
platform for the initial analysis is XCMS (Smith et al.,<br />
2006). However, the output of XCMS is very dependent<br />
on the input-parameters. Often the default parameters are<br />
chosen or they are adapted to the intuition of the<br />
researcher, with no account of the introduction of false<br />
positives etc. Optimization algorithms have been<br />
constructed by using a dilution series (Eliasson et al.,<br />
2012) and by using the carbon isotope (Libiseller et al.,<br />
<strong>2015</strong>). In this work, we build further upon the latter by<br />
including quantified information from the profile m/z<br />
domain (the continuous data in the m/z dimension) where<br />
accurate resolutions can be obtained for the mono-isotopic<br />
peaks and other isotopes. The developed optimisation can<br />
be used for both the data analysis and the quality control<br />
framework that is under development.<br />
METHODS<br />
The proposed work uses XCMS to find the peaks of<br />
interest in the data. To optimise this process, the results<br />
from XCMS are analysed for the occurrence of peaks and<br />
their isotopes. In this step, the raw profile data is inspected<br />
around the, by XCMS, identified peaks for the<br />
quantification of the peak resolution and for the<br />
occurrence of missed isotopes.<br />
Centroid vs Profile data: Modern day MS specialists use<br />
centroid data because the file size is considerably lower.<br />
The mass spectrometer converts the continuous data in the<br />
m/z dimension to a collection of spikes where each<br />
approximately Gaussian peak is converted to a single<br />
spike (delta function with the same height as the original<br />
peak). All other data is discarded. The result is a huge<br />
reduction in the file size but a loss of the peak shape and,<br />
as a result, no quantification of the resolution is possible.<br />
Optimization parameter: The peaks and their isotopes<br />
are characterized by a Gaussian in the chromatographic<br />
dimension and spaced apart by 1.0063 Da in the m/z<br />
dimension. When an isotope is missing or the extracted<br />
peak does not appear in enough samples (for example in<br />
50% of the samples in the sample group), the peak is<br />
categorized as “unreliable”. When a peak is present in all<br />
samples or has a clear isotopic distribution it is considered<br />
as “reliable”. With these measures a so called peak picking<br />
score can be calculated, which in turn can be optimised by<br />
a variety of methods. This results in an increase in reliable<br />
peaks, while not increasing false positives.<br />
Analysis & Quality control: The optimisation of the<br />
XCMs parameters is useful both in the analysis of the data<br />
itself, but it is also applicable in quality control for large<br />
scale LC-MS experiments. By being able to quantify the<br />
resolutions of all relevant peaks in a dataset corresponding<br />
to a control sample, it is possible to monitor the quality of<br />
spectra, and when combining this with other QC<br />
frameworks, like iMonDB (Bittremieux et al., <strong>2015</strong>) it is<br />
possible to assure the quality of all experiments in a long<br />
lasting study.<br />
RESULTS & DISCUSSION<br />
The aim is to use the profile data to improve the available<br />
optimization algorithms available. It remains to be seen<br />
whether the extra information in this data (compared to<br />
centroid data) justifies the increased need of computer<br />
resources. Nonetheless, profile data provides a valuable<br />
contribution in LC-MS optimization, because it enables<br />
researchers to evaluate (quantitatively) and improve the<br />
m/z resolution.<br />
REFERENCES<br />
Smith CA et al. Anal. Chem. 78(3), 779-789, (2006).<br />
Eliasson M. et al. Anal. Chem. 84(15), 6869-6876, (2012).<br />
Libiseller G. et al. BMC Bioinformatics 16:118, (<strong>2015</strong>).<br />
Bittremieux W. et al. J. Proteome Res. 14(5), 2360-2366, (<strong>2015</strong>).<br />
51
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P8. IDENTIFICATION OF NUMTS THROUGH NGS DATA<br />
Vincent Branders 1,2* , Chedly Kastally 2 & Patrick Mardulyn 2 .<br />
Machine Learning Group, Institute of Information and Communication Technologies, Electronics and Applied<br />
Mathematics (ICTEAM), Université catholique de Louvain 1 ; Evolutionary Biology and Ecology, Université libre de<br />
Bruxelles 2 . * vincent.branders@uclouvain.be<br />
Numts are copies of mitochondrial DNA sequences that have been transferred into the nuclear genome. Due to their<br />
similarity with mitochondrial DNA sequences, numts have led to many misinterpretations from overestimation of<br />
diversity to wrong association between cystic fibrosis and mitochondrial genome variation. To avoid such bias induced<br />
by numts, theses sequences have to be identified. Current methodologies are based on comparisons of existing nuclear<br />
and mitochondrial sequences and searches for similarities. The Pacific Biosciences (PacBio) new technology generates<br />
sequencing reads that span thousands of base pairs, which gives the opportunity to identify numts by looking for reads<br />
with regions similar to mitochondrial sequences and surrounded by regions highly different from it. It should allow the<br />
systematic identification of numts without a complete known nuclear reference.<br />
INTRODUCTION<br />
The transfer of DNA from mitochondria to the nucleus<br />
generates nuclear copies of mitochondrial DNA (numts).<br />
Numts have been found in many species including yeasts,<br />
rodents and plants. Due to their similarity to mitochondrial<br />
DNA, numts are responsible for many misinterpretations,<br />
both in mitochondrial disease studies and phylogenetic<br />
reconstructions (Hazkani-Covo et al., 2010). Numt<br />
variation have commonly been misreported as<br />
mitochondrial mutations in patients (Yao et al., 2008).<br />
Moreover, DNA barcoding was found to overestimate the<br />
number of species when numts are coamplified (Song et<br />
al., 2008). Current methods identify such sequences by<br />
aligning mitochondrial sequences against the nuclear<br />
genome and identifying similar regions (Figure 1, left).<br />
The PacBio technology allows the sequencing of DNA<br />
fragments spanning thousands of bases pairs. This size<br />
should allow the identification of numts without the need<br />
of a complete nuclear reference (the insect species<br />
Gonioctena intermedia for example). Indeed, it should be<br />
possible to use a mitochondrial assembly to identify<br />
PacBio reads with a central region similar to the<br />
mitochondrial sequence enclosed by nuclear regions that<br />
are dissimilar to it (Figure 1, right).<br />
FIGURE 1. Identification of numts – Existing methods (left) and proposed<br />
method (right). Comparison of mitochondrial sequence to nuclear<br />
sequence (left) or long reads (right).<br />
METHODS<br />
The proposed approach aligns PacBio reads to a<br />
mitochondrial genome (here de novo assemblies of PacBio<br />
reads and Illumina HiSeq 2000 reads are used). In these<br />
long reads, numts are identified with one region similar<br />
to the mitochondrial genome but surrounded by regions<br />
that are not similar. We introduce different criteria to<br />
distinguish reads that are presumably numts and reads of<br />
mitochondrial origin (Figure 2). DNA sequences comes<br />
from an insect (Gonioctena intermedia) without reference<br />
genome.<br />
FIGURE 2. Mitochondrial reads and numts with nuclear borders.<br />
RESULTS & DISCUSSION<br />
A systematic identification of potential numts is proposed:<br />
through alignments, we identify 10 mitochondrial reads<br />
and 34 reads with potential numt for one particular<br />
mitochondrial region (the widely studied cytochrome<br />
oxidase I gene). As an exploratory research, we highlight<br />
the usefulness of Pacific Biosciences data in the<br />
identification of numts when no nuclear reference is<br />
available. It only requires PacBio reads and a<br />
mitochondrial assembly. The proposed approach is more<br />
efficient than an identification of numts through short<br />
reads that would require the complete reconstruction of<br />
both mitochondrial and nuclear genomes. A systematic<br />
identification of numts in non-models organisms should<br />
avoid misinterpretations in studies where numts could be<br />
sources of bias. Our current distinction of numts and<br />
mitochondrial reads is quite simple. A detailed analysis of<br />
this distinction could be a perspective of improvements.<br />
REFERENCES<br />
Hazkani-Covo E. et al. PLOS Genetics 6, 1-11 (2010).<br />
Song H. et al. PNAS 105, 13486-13491 (2008).<br />
Yao Y. G. et al. Journal of Medical Genetics 45, 769-772 (2008).<br />
52
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING<br />
SCHEMES FOR BACTERIA<br />
Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*.<br />
Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium.<br />
INTRODUCTION<br />
As next-generation sequencing in general, and whole<br />
genome sequencing (WGS) in particular, is increasingly<br />
adopted in public health for routine surveillance tasks,<br />
there is a clear need to incorporate this new technology in<br />
the day-to-day operational workflow of a public health<br />
institute. As cluster detection based on WGS data is<br />
evolving into a commodity, thanks to technologies such as<br />
whole genome multi-locus sequence typing (wgMLST),<br />
the question remains as to how WGS-based data analysis<br />
can be used to build up a human-friendly but highprecision<br />
and epidemiologically consistent naming<br />
strategy for communication purposes.<br />
METHODS<br />
For various organisms, the use of so-called ‘SNP<br />
addresses’ (based on single nucleotide polymorphisms or<br />
SNPs) has been proposed to build up a hierarchical<br />
naming scheme (see [1], [2]). This idea relies on single<br />
linkage clustering of isolates at different levels of<br />
similarity or distance, hence leading to a hierarchical name.<br />
However, the main difficulty here is to define the<br />
appropriate levels of similarity to cluster on, and the<br />
dependence of the naming scheme on the samples at hand.<br />
Moreover, the SNP approach might not provide the best<br />
type of data for this due to its relatively large volatility.<br />
In this work, we present a mathematical framework to<br />
define the levels of similarity upon which single linkage<br />
clustering makes sense. For this, we model the observed<br />
multimodal distribution of pairwise similarities between<br />
samples to obtain a theoretical model of the similarity<br />
distribution, and from there infer the most likely breaking<br />
points for stable similarity cutoffs. This is done in a dataindependent<br />
manner, and is therefore applicable to SNP<br />
data, but also to wgMLST data and even gene presenceabsence<br />
data. We assess the stability of the naming<br />
scheme by using a cross-validation approach.<br />
RESULTS & DISCUSSION<br />
We apply our methods to propose a wgMLST-based<br />
naming scheme for Listeria monocytogenes. Using a<br />
reference dataset of the diversity within Listeria<br />
monocytogenes, and an extensive data set of over 4000<br />
isolates from real-time surveillance, we show the stability<br />
of the naming scheme, and the epidemiological<br />
concordance.<br />
REFERENCES<br />
[1] Dallman T et al., Applying phylogenomics to understand the<br />
2 emergence of Shiga Toxin producing Escherichia coli<br />
3 O157:H7 strains causing severe human disease in the<br />
4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029<br />
[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium<br />
tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi:<br />
10.1016/j.tube.2014.02.005<br />
53
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P10. FROM SNPS TO PATHWAYS: AN APPROACH TO STRENGTHEN<br />
BIOLOGICAL INTERPRETATION OF GWAS RESULTS<br />
Elisa Cirillo 1,* , Michiel Adriaens 2 & Chris T Evelo 1,2 .<br />
1 Department of Bioinformatics – BiGCaT, Maastricht University, The Netherlands<br />
2 Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, The Netherlands<br />
* elisa.cirillo@maastrichtuniversity.nl<br />
Pathway and network analysis are established and powerful methods for providing a biological context for a variety of<br />
omics data, including transcriptomics, proteomics and metabolomics. These approaches could in theory also be a boon<br />
for the interpretation of genetic variation data, for instance in the context of Genome Wide Association Studies (GWAS),<br />
as it would allow the study of genetic variants in the context of the biological processes in which the implicated genes<br />
and proteins are involved. However, currently genetic variation data cannot easily be integrated into pathways.<br />
Additionally, it is not clear how to visualise and interpret genetic variation data once connected to pathway content. In<br />
this project we take up that challenge and aim to (i) visualise SNPs from a Type 2 Diabetes Mellitus (T2DM) GWAS<br />
dataset on pathways and (ii) generate and analyze a network of all associated genes and pathways. Together, this could<br />
enable a comprehensive pathway and network interpretation of genetic variations in the context of T2DM.<br />
INTRODUCTION<br />
GWAS has become a common approach for discovery of<br />
gene disease relationships, in particular for complex<br />
diseases like T2DM (Wellcome Trust Case Control,<br />
2009). However, biological interpretation remains a<br />
challenge, especially when it concerns connecting genetic<br />
findings with known biological processes. We wish to<br />
improve the interpretation of GWAS results, using a<br />
meaningful network representation that links SNPs to<br />
biological processes.<br />
METHODS<br />
We selected a GWAS data set related to T2DM from a<br />
meta GWAS resource for diseases created by Jhonson et<br />
al. (2009), and we extracted 1971 SNPs associated with<br />
T2DM.<br />
We identified the location for each SNP using Variant<br />
Effect Prediction (VeP) (http://www.ensembl.org) and we<br />
classified them in 5 categories (Figure 1): exonic, 3' UTR,<br />
5' UTR, intronic and intergenic. SNPs located in the first<br />
three categories are easily connected to genes using<br />
BioMart Ensembl (http://www.ensembl.org/). Pathways<br />
related with these genes are identified from the curated<br />
collection of WikiPathways (Kutmon et al., <strong>2015</strong>). SNPs,<br />
genes and pathways are visualized in networks using<br />
Cytoscape (Shannon et al., 2003).<br />
RESULTS & DISCUSSION<br />
We analysed four gene related SNP categories: 3' and 5'<br />
UTR, intronic and exonic. The exonic category was<br />
divided into 8 SNP sub-categories based on sequence<br />
interpretation: up- and downstream, splice region,<br />
synonymous, missense, stop/gain, transcription factor<br />
binding, and non-coding transcript. For each of the 11<br />
resulting categories we created a SNP-disease genepathway<br />
network. Disease related genes are not always<br />
included in pathways and this is also the case for disease<br />
genes in which GWAS resulting SNPs were found. For the<br />
SNPs that are related to genes in pathways we did a<br />
pathway gene set enrichment analysis and evaluated<br />
whether the resulting pathways were already known to be<br />
related to T2DM.<br />
SNPs in intergenic region need to be analysed and<br />
visualized differently. A possible approach might be using<br />
the expression quantitative trait locus (eQTL) data, which<br />
relates SNPs in intergenic regions to modulation of gene<br />
expression distally. Such datasets are available for many<br />
different human tissues and can provide additional<br />
regulatory information for pathways and the genes they<br />
comprise.<br />
FIGURE 1. Pie chart of the 5 SNPs categories. The total number of SNPs<br />
is 2767.<br />
REFERENCES<br />
Wellcome Trust Case Control Genome-wide association study of 14,000<br />
cases of seven common diseases and 3,000 shared controls. Nature.<br />
2007;447(7145):661-78.<br />
Johnson A, O'Donnell C. An Open Access Database of Genome-wide<br />
Association Results. BMC Medical Genetics. 2009;10(1):6.<br />
Kutmon M, Riutta A, Nunes N, Hanspers K, Willighagen E, Bohler A,<br />
Mélius J, Waagmeester A, Sinha S, Miller R, Coort S, Cirillo E<br />
Smeets B, Evelo C, Pico A. WikiPathways: Capturing the Full<br />
Diversity of Pathway Knowledge . Accepted September <strong>2015</strong>, NAR-<br />
02735- E- Database issue 2016.<br />
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al.<br />
Cytoscape: A Software Environment for Integrated Models of<br />
Biomolecular Interaction Networks. Genome Research.<br />
2003;13(11):2498-504.<br />
54
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P11. IDENTIFICATION OF TRANSCRIPTION FACTOR CO-ASSOCIATIONS<br />
IN SETS OF FUNCTIONALLY RELATED GENES<br />
Pieter De Bleser 1,2,4* , Arne Soetens 1,2,4 & Yvan Saeys 1,3,4 .<br />
VIB Inflammation Research Center 1 ; Department of Biomedical Molecular Biology 2 , Department of Respiratory<br />
Medicine 3 , Ghent University 4 . * pieterdb@irc.vib-ugent.be<br />
Co-associations between transcription factors (TFs) have been studied genome-wide and resulted in the identification of<br />
frequently co-associated pairs of TFs. Co-association of TFs at distinct binding sites is contextual: different combinations<br />
of TFs co-associate at different genomic locations, producing a condition-dependent gene expression profile for a cell.<br />
Here, we present a novel method to identify these condition-dependent co-associations of TFs in sets of functionally<br />
related genes.<br />
INTRODUCTION<br />
The functional expression of genes is achieved by<br />
particular interactions of regulatory transcription factors<br />
(TFs) operating at specific DNA binding sites of their<br />
target genes. Dissecting the specific co-associations of TFs<br />
that bind each target gene represent a difficult challenge.<br />
Co-associations of transcription factor pairs have been<br />
studied genome-wide and resulted in the identification of<br />
frequently co-associated pairs of TFs (ENCODE Project<br />
Consortium, 2012). It was found that TFs co-associate in a<br />
context-specific fashion: different combinations of TFs<br />
bind different target sites and the binding of one TF might<br />
influence the preferred binding partners of other TFs. Here,<br />
we present a tool to identify these condition-dependent coassociations<br />
of TFs in sets of functionally related genes<br />
(e.g. metabolic pathways, tissues, sets of TF target genes,<br />
sets of differentially regulated genes).<br />
METHODS<br />
In a first step, we determine the set of regulatory TFs for<br />
each gene (Tang et al., 2011) in the set using the ChIP-Seq<br />
binding data for 237 TFs from the ReMap database<br />
(Griffon et al., <strong>2015</strong>). This results in a number of<br />
regulatory ChIP-Seq binding regions per TF per gene,<br />
represented as a matrix in which each row corresponds to<br />
a gene while the columns correspond to the used TF. In a<br />
next step, this matrix is used as input to the distance<br />
difference matrix (DDM) algorithm, modified to<br />
accommodate this data. The DDM algorithm is a method<br />
that simultaneously integrates statistical over<br />
representation and co-association of TFs (De Bleser et al.,<br />
2007). The result matrix is subsequently reduced, retaining<br />
only the columns of over-represented and co-associated<br />
TFs. Visualization is done by (1) hierarchical clustering of<br />
the reduced result matrix and reordering of the columns<br />
and (2) conversion of the reduced result matrix into a SIF<br />
(simple interaction file format) file, summarizing the<br />
regulator-regulated relationships between transcription<br />
factors and target genes. This SIF file can be imported into<br />
CytoScape for visualization of the regulatory network.<br />
RESULTS & DISCUSSION<br />
FOXF1, TBX3, GATA6, IRX3, PITX2, DLL1 and<br />
NKX2-5 are experimentally verified target genes of the<br />
EZH2 transcription factor (Grote et al., 2013).<br />
Running the transcription factor co-association analysis<br />
method on this data set results in the clustering solution<br />
plot shown in Figure 1.<br />
The strongest associations between TFs are found between<br />
EZH2, POU5F1, SUZ12 and CTBP2. A secondary cluster<br />
of transcription factor associations is composed of<br />
EOMES, SMAD2+3 and NANOG.<br />
The finding of SUZ12 as a cofactor can be accounted for:<br />
EZH2 and SUZ12 are subunits of Polycomb repressive<br />
complex 2 (PRC2), which is responsible for the repressive<br />
histone 3 lysine 27 trimethylation (H3K27me3) chromatin<br />
modification (Yoo and Hennighausen, 2012). CTBP2 is a<br />
known transcriptional repressor (Turner and Crossley,<br />
2001).<br />
The method has been applied previously for the<br />
identification of TFs associated with both high tissuespecificity<br />
and high gene expression levels (Rincon et al.,<br />
<strong>2015</strong>). The method will be made available as a web tool.<br />
FIGURE 1. Transcription factor co-associations in the EZH2 data set.<br />
Note the tendency of EZH2 to co-localize with POU5F1, SUZ12 and<br />
CTBP2.<br />
REFERENCES<br />
De Bleser,P. et al. (2007) A distance difference matrix approach to identifying<br />
transcription factors that regulate differential gene expression. Genome Biol., 8,<br />
R83.<br />
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements<br />
in the human genome. Nature, 489, 57–74.<br />
Griffon,A. et al. (<strong>2015</strong>) Integrative analysis of public ChIP-seq experiments reveals<br />
a complex multi-cell regulatory landscape. Nucleic Acids Res., 43, e27.<br />
Grote,P. et al. (2013) The tissue-specific lncRNA Fendrr is an essential regulator of<br />
heart and body wall development in the mouse. Dev. Cell, 24, 206–214.<br />
Rincon,M.Y. et al. (<strong>2015</strong>) Genome-wide computational analysis reveals<br />
cardiomyocyte-specific transcriptional Cis-regulatory motifs that enable<br />
efficient cardiac gene therapy. Mol. Ther. J. Am. Soc. Gene Ther., 23, 43–52.<br />
Tang,Q. et al. (2011) A comprehensive view of nuclear receptor cancer cistromes.<br />
Cancer Res., 71, 6940–6947.<br />
Turner,J. and Crossley,M. (2001) The CtBP family: enigmatic and enzymatic<br />
transcriptional co-repressors. BioEssays News Rev. Mol. Cell. Dev. Biol., 23,<br />
683–690.<br />
Yoo,K.H. and Hennighausen,L. (2012) EZH2 methyltransferase and H3K27<br />
methylation in breast cancer. Int. J. Biol. Sci., 8, 59–65.<br />
55
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P12. PHENETIC: MULTI-OMICS DATA INTERPRETATION USING<br />
INTERACTION NETWORKS<br />
Dries De Maeyer 1,2,3* , Bram Weytjens 1,2,3 , Luc De Raedt 4 & Kathleen Marchal 2,3 .<br />
Centre for Microbial and Plant Genetics, KULeuven 1 ; Department for Information Sciences (INTEC, IMinds), UGent 2 ;<br />
Department for Plant Biotechnology and Bioinformatics, UGent 3 ; Department of Computer Science, KULeuven 4 .<br />
* dries.demaeyer@biw.kuleuven.be<br />
The omics revolution has introduced new challenges when studying interesting phenotypes. High throughput omics<br />
technologies such as next-generation sequencing and microarray technologies generate large amounts of data.<br />
Interpreting the resulting data from these experiments is not trivial due to the data’s size and the inherent noise of the<br />
underlying technologies. In addition to this, the “omics” technologies have led to an ever expanding biological<br />
knowledge which has to be taken into account when interpreting new experimental results. Interaction network in<br />
combination with subnetwork inference methods provide a solution to this problem by mining the current public<br />
interactomics knowledge using experimental omics data to better understand the molecular mechanisms driving the<br />
interesting phenotypes under study.<br />
INTRODUCTION<br />
Computational methods are becoming essential for<br />
analyzing large scale omics datasets in the light of current<br />
knowledge. By representing publicly available<br />
interactomics knowledge as interaction networks<br />
subnetwork inference methods can extract the actual<br />
molecular mechanisms that drive an interesting phenotype.<br />
The PheNetic framework is such a method that allows for<br />
mining interaction networks with multi-omics datasets.<br />
Using this framework different types of biological<br />
applications have been analyzed in the past such as KOtranscriptomics<br />
interpretation (De Maeyer, 2013),<br />
expression analysis (De Maeyer, <strong>2015</strong>) and distinguishing<br />
driver from passenger mutation from eQTL experiments<br />
(De Maeyer).<br />
METHODS<br />
Interaction networks provide a flexible representation of<br />
public biological interactomics knowledge. These<br />
networks represent the physical interactions between<br />
genes and their corresponding gene products in the<br />
interactome of the organism under research (Cloots, 2011).<br />
The interaction network integrates different layers of<br />
homogeneous interactomics data, e.g. signalling, proteinprotein,<br />
(post)transcriptional and metabolic interactomics<br />
data, into a single heterogeneous network representation.<br />
The PheNetic framework uses interaction networks to find<br />
biologically valid paths which connect (in)activated genes<br />
selected from multi-omics data sets. These paths provide a<br />
biological explanation of how the genes from these data<br />
sets can trigger each other. Finding the best explanations<br />
or paths in the interaction network corresponds to finding<br />
that subnetwork that best explains the observed results and<br />
provides an insight into the molecular mechanisms that<br />
drive the interesting phenotype. Depending on the type of<br />
biological application and provided data different types of<br />
paths can be used to infer the subnetwork such as KOtranscriptomics<br />
interpretation (De Maeyer, 2013),<br />
expression analysis (De Maeyer, <strong>2015</strong>) and interpreting<br />
eQTL experiments (De Maeyer).<br />
RESULTS & DISCUSSION<br />
In a first setup PheNetic was used to study the pathways<br />
and processes involved in acid resistance in Escherichia<br />
coli (De Maeyer, 2013). Using our framework we were<br />
able to determine the different molecular pathways that<br />
drive acid resistance and identify the regulators that<br />
underlie this phenotype. It was shown that subnetwork<br />
inference methods outperform naïve gene rankings in<br />
identifying the biological pathways associated with the<br />
phenotype under research based.<br />
In a second setup PheNetic was used to interpret<br />
expression data (De Maeyer, <strong>2015</strong>) to extract from the<br />
interaction network those parts of the interaction network<br />
that show differences in expression. This method was<br />
provided as a web server that can be accessed at<br />
http://bioinformatics.intec.ugent.be/<br />
phenetic and that allows for an intuitive and visual<br />
interpretation of the inferred subnetworks.<br />
In a third setup PheNetic was used to select driver<br />
mutations from passenger mutations in coupled genetictranscriptomics<br />
data sets from evolution experiments (De<br />
Maeyer). Evolved strains with the same phenotype are<br />
expected to have consistent changes in the same pathways.<br />
Therefore, finding the subnetwork that best connects the<br />
mutations to the differentially expressed genes over all<br />
strains is expected to identify the driver mutations over<br />
passenger mutations in combination with identifying the<br />
molecular mechanisms that induce the observed change in<br />
phenotype. This approach provides a systemic insight in<br />
both the biological processes and genetic background that<br />
induces phenotype.<br />
Based on the different approaches it can be concluded that<br />
PheNetic is a flexible framework for subnetwork selection<br />
that allows for solving a large variety of biological<br />
applications using multi-omics data sets.<br />
REFERENCES<br />
Cloots, L., & Marchal, K. (2011). Curr Opin Microbiol, 14(5), 599-607.<br />
De Maeyer, D., Renkens, J., Cloots, L., De Raedt, L., & Marchal, K.<br />
(2013). Mol Biosyst, 9(7), 1594-1603.<br />
De Maeyer, D., Weytjens, B., Renkens, J., De Raedt, L., & Marchal, K.<br />
(<strong>2015</strong>). Nucleic Acids Res, 43(W1), W244-250.<br />
De Maeyer, D., Weytjens, B., De Raedt, L., & Marchal, K. Molecular<br />
biology and evolution. Submitted<br />
56
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P13. THE ROLE OF HLA ALLELES UNDERLYING CYTOMEGALOVIRUS<br />
SUSCEPTIBILITY IN ALLOGENEIC TRANSPLANT POPULATIONS<br />
Nicolas De Neuter 1,2* , Benson Ogunjimi 3 , Anke Verlinden 4 , Kris Laukens 1,2 & Pieter Meysman 1,2 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />
Antwerpen (biomina) 2 ; Centre for Health Economics Research and Modeling Infectious Diseases (CHERMID), Vaccine<br />
and Infectious Disease Institute, University of Antwerp 3 ; Antwerp University Hospital 4 .<br />
* nicolas.deneuter@uantwerpen.be<br />
In this study, we aim to characterize those HLA alleles that increase or decrease the risk of cytomegalovirus infections<br />
following tissue or organ transplants. This HLA-dependent susceptibility will then be explained using state-of-the-art<br />
HLA peptide affinity methods to identify the underlying molecular reason. This insight can greatly aid prediction of<br />
those transplantation patients that are most at risk from cytomegalovirus infection.<br />
INTRODUCTION<br />
Patients suffering from disorders of the hematopoietic<br />
system or with chemo-, radio-, or immuno- sensitive<br />
malignancies such as leukemia often receive<br />
hematopoietic stem cell transplantation therapy (HSCT).<br />
The transplantation is preceded by a conditioning regimen<br />
that eradicates the recipient’s malignant cell population<br />
through intensive chemotherapy and irradiation,<br />
simultaneously ablating the recipient’s bone marrow. Self<br />
(autologous) or non-self (allogeneic) hematopoietic stem<br />
cells are then reintroduced into the recipient after which<br />
they are allowed to reestablish hematopoietic functions.<br />
HSCT is associated with high morbidity and mortality and<br />
requires careful monitoring of patients during the weeks<br />
following transplantation. Opportunistic cytomegalovirus<br />
(CMV) infections are one of the major causes of this high<br />
morbidity and mortality and can occur in up to 80% of<br />
HSCT patients, depending on the use of prophylactic<br />
treatment or pre-emptive therapy and the serological CMV<br />
status of donor and recipient. CMV disease can manifest<br />
itself as life-threatening pneumonia, gastrointestinal<br />
disease, retinitis, encephalitis or hepatitis.<br />
The relevance of HLA alleles in varicella zoster virus<br />
associated disease has recently been demonstrated by our<br />
group (Meysman et al., <strong>2015</strong>) and similar insights might<br />
be gained in CMV related disease. Several studies have<br />
already shown a correlation between the incidence of<br />
CMV infection and the presence of certain human<br />
leukocyte antigens (HLA) alleles in the transplant<br />
recipient. However, the exact alleles identified in previous<br />
studies are very inconsistent, likely due to small sample<br />
sizes and type I multiple testing errors.<br />
METHODS<br />
Anonymized patient records on the HLA alleles, CMV<br />
infection and serological status of 1284 transplant<br />
recipients were collected from the Antwerp University<br />
Hospital (UZA). This data set was further extended with<br />
publicly available HLA data from transplant patient and<br />
the counts for the HLA alleles of each loci present were<br />
combined. A hypergeometric distribution was used to test<br />
HLA loci (A, B, C, DRB1, DQB1 and DPB1) for<br />
statistical over- or underrepresentation of their respective<br />
alleles. HLA alleles were tested for over- or<br />
underrepresentation in two test populations: recipients<br />
who were seropositive for CMV before transplantation<br />
and recipients who developed a CMV infection posttransplantation.<br />
In the later case, we also examined if<br />
donor seropositivity had an influence on the CMV<br />
infection status. The P value cutoff used is 0.05 and was<br />
adjusted with a Bonferroni correction for multiple testing,<br />
in this case the number of alleles tested per loci.<br />
Putative nonameric peptides were generated in silico from<br />
CMV protein sequences available in online protein<br />
sequence repositories such as the UniProt Knowledgebase.<br />
Three complementary methods were employed to predict<br />
the affinity of each putative nonameric peptide to the<br />
significantly enriched or depleted HLA alleles. The<br />
methods used were: NetCTLpan, the stabilized matrix<br />
method (SMM) and an in-house-developed approach<br />
called CRFMHC. Peptide-binding affinity results of each<br />
predictor were normalized against the affinity of a<br />
restricted panel of human proteins and used to compare<br />
results between predictors. Additionally, each CMV<br />
protein was assessed for depletion of high-affinity<br />
peptides using a hypergeometric distribution.<br />
RESULTS<br />
Preliminary results on a small portion of the UZA data<br />
reveals HLA alleles underlying either CMV seropositivity<br />
or CMV infection with a trend towards significance but do<br />
not reach the Bonferroni corrected threshold. We expect<br />
the additional data to increase the power of the analysis.<br />
REFERENCES<br />
Meysman,P. et al. (<strong>2015</strong>) Varicella-Zoster Virus-Derived Major<br />
Histocompatibility Complex Class I-Restricted Peptide Affinity Is<br />
a Determining Factor in the HLA Risk Profile for the<br />
Development of Postherpetic Neuralgia. J. Virol., 89, 962–969.<br />
57
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P14. NOVOPLASTY: IN SILICO ASSEMBLY OF PLASTID GENOMES FROM<br />
WHOLE GENOME NGS DATA<br />
Nicolas Dierckxsens 1,2* , Olivier Hardy 2 , Ludwig Triest 3 , Patrick Mardulyn 2 & Guillaume Smits 1,4 .<br />
Interuniversity Institute of Bioinformatics Brussels (IB2), ULB-VUB, Triomflaan CP 263, 1050 Brussels, Belgium 1 ;<br />
Evolutionary Biology and Ecology Unit, CP 160/12, Faculté des Sciences, Université Libre de Bruxelles, Av. F. D.<br />
Roosevelt 50, B-1050 Brussels, Belgium 2 ; Plant Biology and Nature Management, Vrije Universiteit Brussel, Brussels,<br />
Belgium 3 ; Department of Paediatrics, Hôpital Universitaire des Enfants Reine Fabiola (HUDERF), Université Libre de<br />
Bruxelles (ULB), Brussels, Belgium 4 . * nicolasdierckxsens@hotmail.com<br />
Thanks to the evolution in next-generation sequencer (NGS) technology, whole genome data can be readily obtained<br />
from a variety of samples. There are many algorithms available to assemble these reads, but few of them focus on<br />
assembling the plastid genomes. Therefore we developed a new algorithm that solely assembles the plastid genomes<br />
from whole genome data, starting from a single seed. The algorithm is capable of utilizing the full advantage of very high<br />
coverage, which makes it even capable of assembling through problematic regions (AT-rich). The algorithm has been<br />
tested on several whole genome Illumina datasets and it outperformed other assemblers in runtime and specificity. Every<br />
assembly resulted in a single contig for any chloroplast or mitochondrial genome and this always within a timeframe of<br />
30 minutes.<br />
INTRODUCTION<br />
Chloroplasts and mitochondria are both responsible for<br />
generating metabolic energy within eukaryotic cells. Both<br />
plastids are maternally inherited and have a persistent gene<br />
organization, what makes them ideal for phylogenetic<br />
studies or as a barcode in plant and food identification<br />
(Brozynska et al., 2014). But assembling these plastids<br />
genomes is not always that straightforward with the<br />
currently available tools. Therefore we developed a new<br />
algorithm, specifically for the assembly of plastid<br />
genomes from whole genome data.<br />
METHODS<br />
The algorithm is written in Perl. All assemblies were<br />
executed on Intel Xeon CPU machine containing 24 cores<br />
of 2.93 GHz with a total of 96,8 GB of RAM. All nonhuman<br />
samples were sequenced on the Illumina HiSeq<br />
platform (101 bp paired-end reads). The human<br />
mitochondria samples (PCR-free) were sequenced on the<br />
Illumina HiSeqX platform (150 bp paired-end reads). The<br />
Gonioctena intermedia sample was also sequenced on the<br />
PacBio platform.<br />
RESULTS & DISCUSSION<br />
Algorithm. The algorithm is similar to string overlap<br />
algorithms like SSAKE (Warren et al., 2007) and VCAKE<br />
(Jeck et al., 2007). It starts with reading the sequences into<br />
a hash table, which facilitates a quick accessibility. The<br />
assembly has to be initiated by a seed that will be<br />
extended bidirectionally in iterations. The seed input is<br />
quite flexible, it can be one sequence read, a conserved<br />
gene or even a complete mitochondrial genome from a<br />
distant species. Every base extension is determined by a<br />
consensus between the overlapping reads. Unlike most<br />
assemblers, NOVOPlasty doesn’t try to assemble every<br />
read, but will extend the given seed until the circular<br />
plastid is formed.<br />
Assemblies. NOVOPlasty has currently been tested for the<br />
assembly of 8 chloroplasts and 6 mitochondria. Since<br />
chloroplasts contain an inverted repeat, two versions of the<br />
assembly are generated. The differ only in the orientation<br />
of the region between the two repeats; the correct one will<br />
have to be resolved manually. Besides the mitochondrion<br />
of the leaf beetle Gonioctena intermedia, all assemblies<br />
resulted in a complete circular genome. A comparative<br />
study of four assemblers for the mitochondrial genome of<br />
G. intermedia clearly shows the speed and specificity of<br />
NOVOPlasty (Table 1).<br />
NOVO<br />
Plasty<br />
MIRA MITO bim ARC<br />
Duration (min) 12 536 4777* 586<br />
Memory (GB) 15 57,6 63,4 1,9<br />
Storage (GB) 0 144 418 12<br />
Total contigs 1 3434 2221 2502<br />
Mitochondrial contigs 1 1 4 48<br />
Coverage (%) 98 94 94 84<br />
Mismatches 10 25 26 2<br />
Unidentified nucleotides 43 194 197 0<br />
TABLE 1. Benchmarking results between four assemblies of the<br />
mitochondrial genome of Gonioctena intermedia. The assemblies were<br />
constructed with MITObim (Hahn et al., 2013), MIRA (Chevreux et al.,<br />
1999), ARC (Hunter et al., <strong>2015</strong>) and NOVOPlasty.*manually terminated<br />
Discussion. Despite the many available assemblers, many<br />
researchers still struggle to find a good assembler for<br />
plastids genomes. NOVOPlasty offers an assembler<br />
specifically designed for plastids that will deliver the<br />
complete genome within 30 minutes. The algorithm will<br />
be tested on more datasets and a comparative study with<br />
other assemblers is in progress.<br />
REFERENCES<br />
Brozynska et al. PLoS One 9 (2014).<br />
Chevreux et al. Computer Science and Biology: Proceedings of the<br />
German Conference on Bioinformatics (GCB) (1999).<br />
Hahn et al. Nucleic Acids Research, 1-9 (2013).<br />
Hunter et al. http://dx.doi.org/10.1101/014662 (<strong>2015</strong>).<br />
Jeck et al. BMC Bioinformatics 23, 2942-2944 (2007).<br />
Warren et al. BMC Bioinformatics 23, 500-501 (2007).<br />
58
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P15. ENANOMAPPER - ONTOLOGY, DATABASE AND TOOLS FOR<br />
NANOMATERIAL SAFETY EVALUATION<br />
Friederike Ehrhart 1 , Linda Rieswijk 1 , Chris T. Evelo 1 , Haralambos Sarimveis 2 , Philip Doganis 2 , Georgios Drakakis 2 ,<br />
Bengt Fadeel 3 , Barry Hardy 4 , Janna Hastings 5 , Christoph Helma 6 , Nina Jeliazkova 7 , Vedrin Jeliazkov 7 , Pekka Kohonen 89 ,<br />
Roland Grafström 9 , Pantelis Sopasakis 10 , Georgia Tsiliki 2 & Egon Willighagen 1 .<br />
Department of Bioinformatics - BiGCaT, Maastricht University 1 ; National Technical University of Athens 2 ; Karolinska<br />
Institutet 3 ; Douglas Connect 4 ; European Molecular Biology Laboratory – European Bioinformatics Institute 5 ; In silico<br />
toxicology 6 ; Ideaconsult Ltd. 7 ; VTT Technical Research Centre of Finland 8 ; Misvik Biology 9 ; IMT Institute for Advanced<br />
Studies 10 . *friederike.ehrhart@maastrichtuniversity.nl<br />
eNanoMapper is an open computational infrastructure for engineered nanomaterial data: it comprises a semantic web<br />
supported database, ontology, and user applications for up- and download of experimental data, and tools for modelling.<br />
INTRODUCTION<br />
Nanomaterials are defined by size: between 1 nm and 100<br />
nm in at least one dimension. The properties of these<br />
material do not always resemble those of the bulk<br />
material, i.e. micro- and bigger particles, or solutions.<br />
Nanomaterials can differ in reactivity, toxicity in<br />
biological organisms and ecosystems depending on their<br />
size and surface properties and the possibility for<br />
“leakage” of the material it is made off. That is why it is<br />
so difficult to assess the safety of nanomaterials and why<br />
the NanoSafety Cluster defined a need for a new<br />
computational infrastructure in 2012. eNanoMapper is a<br />
European project with partners from eight European<br />
countries. This project has been developing an<br />
computational infrastructure consisting of a semantic web<br />
assisted database, a modular ontology, and tools to use<br />
them for nanomaterial safety assessment. Data sharing,<br />
data storage, data analysis tools, and web services are<br />
currently under development, being developed and tested,<br />
and put into production use. The project website can be<br />
found at www.enanomapper.net.<br />
PROBLEM<br />
The eNanoMapper platform is designed to support hosting<br />
of data on nanomaterial properties relevant for nanosafety<br />
assessment as found in existing databases like the<br />
NanoMaterial Registry, DaNa Knowledge Base,<br />
Nanoparticle Information Library NIL, Nanomaterial-<br />
Biological Interactions Knowledgebase, caNanoLab,<br />
InterNano, Nano-EHS Database Analysis Tool, nanoHUB,<br />
etc. Each of them has different data formats and<br />
descriptors, like CODATA-VAMAS’ Universal<br />
Description System, ISO-Tab(-Nano), OECD templates,<br />
custom spreadsheets, and images. Interoperability is a<br />
main aim and semi-automatic import or upload of<br />
information and to integrate it in the eNanoMapper data<br />
structure is being enabled. Vice versa, retrieval or<br />
download of experimental data from the database for (re-<br />
)analysis should be provided too, using programmable<br />
interfaces to the data and the ontology. Database and<br />
search functionality should be semantic web compatible:<br />
the project developed and maintain a nanosafety ontology<br />
to support this. This eNanoMapper ontology was<br />
developed using the Web Ontology Language and the<br />
challenge is to map nanomaterial terms to their multiple<br />
ontology terms, namely physico-chemical properties,<br />
biological and ecological impact, experimental assay<br />
description, and known safety aspects.<br />
RESULTS & DISCUSSION<br />
The current eNanoMapper demo database instance,<br />
available at https://data.enanomapper.net/, contains the<br />
physico-chemical, biologic and environmental properties<br />
of nanomaterials of 465 different nanomaterials 1 . Loading<br />
data into the database supports various formats, including<br />
the OECD Harmonized Templates and the data structure<br />
used by the NanoWiki 2 . A web interface is designed to<br />
support all interactions with the database you may want to<br />
perform, including uploading of experimental data, as well<br />
as querying data to support analysis and modelling of<br />
nanoparticle properties. The eNanoMapper ontology is<br />
available<br />
under<br />
http://purl.enanomapper.net/onto/enanomapper.owl and is<br />
based on a multi-faceted description of nanoparticles<br />
concerning nanoparticle types, physico-chemical<br />
description, life cycle, biological and environmental<br />
characterisation including experimental methods and<br />
protocols, and safety information 3 . The terms are verified<br />
against the definitions of REACH, ISO, or common<br />
practices used in science in general. The often confused<br />
different meanings of endpoints and assays were<br />
discriminated in the definitions, e.g. size and size<br />
measurement assay. It was partly possible to use existing<br />
ontologies as basis, e.g. NPO, ChEBI, GO, etc. but many<br />
terms had to be added manually. Currently, there are 4592<br />
classes defined. Users can get access and download the<br />
ontology from the U.S. National Center for BioMedical<br />
Ontologies BioPortal platform,<br />
http://bioportal.bioontology.org/ontologies/ENM.<br />
REFERENCES<br />
1 Jeliazkova, N. et al. The eNanoMapper database for<br />
nanomaterial safety information. Beilstein Journal of<br />
Nanotechnology 6, 1609-1634, doi:10.3762/bjnano.6.165<br />
(<strong>2015</strong>).<br />
2 Willighagen, E.; doi: org/10.6084/m9.figshare.1330208<br />
3 Hastings, J. et al. eNanoMapper: harnessing ontologies to<br />
enable data integration for nanomaterial risk assessment. J<br />
Biomed Semantics 6, 10, doi:10.1186/s13326-015-0005-5<br />
(<strong>2015</strong>).<br />
59
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P16. BIOMEDICAL TEXT MINING FOR DISEASE-GENE DISCOVERY:<br />
SOMETIMES LESS IS MORE<br />
Sarah ElShal 1,2* , Jesse Davis 3 & Yves Moreau 1,2 .<br />
Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data<br />
Analytics Department, KU Leuven 1 ; iMinds Future Health Department, KU Leuven 2 ; Department of Computer Science,<br />
KU Leuven 3 . * sarah.elshal@esat.kuleuven.be<br />
Biomedical text is increasingly being made available online in either abstract or full article formats. This goes in parallel<br />
with the knowledge desire to extract information from such text (e.g. finding links between diseases and genes).<br />
Consequently text mining is very popular in the biomedical domain given that it provides the possibility to automatically<br />
analyze these texts in order to extract knowledge. One of the big challenges in text mining is recognizing named entities<br />
(e.g. disease and gene entities) inside a given text, which is widely known as Named Entity Recognition (NER). We<br />
studied two biomedical taggers that apply different NER methods on MEDLINE abstracts. Here, we compare the<br />
contribution of each of the two taggers in associating genes with diseases. We show that with fewer recognized entities<br />
we gain more knowledge and we better associate genes with diseases.<br />
INTRODUCTION<br />
MEDLINE currently has more than 25 million biomedical<br />
citations from different journals all over the world. With<br />
this vast amount of text available, it is increasingly<br />
important to mine such data and find the best ways to<br />
extract relevant knowledge out of it. One example of such<br />
knowledge is links between diseases and genes. However<br />
it is very challenging and time consuming to recognize<br />
biomedical entities inside a given text with the evolving<br />
number of dictionaries and tagging strategies. Different<br />
taggers exist that map MEDLINE abstracts to biomedical<br />
entities. Such tagged entities can be used to generate<br />
disease and gene profiles and by applying certain<br />
similarity measures, we can extract knowledge and<br />
generate disease-gene hypothesis.<br />
METHODS<br />
We compare two MEDLINE taggers that map the whole<br />
set of MEDLINE abstracts to biomedical entities (e.g.<br />
genes, diseases, GO and MeSH terms …). The first one is<br />
MetaMap (Aronson et al., 2010), and the second one has<br />
been used as a text mining pipeline in many resources,<br />
latest in Diseases (Pletscher-Frankild et al., <strong>2015</strong>). For<br />
sake of simplicity, we will refer to the second tagger by<br />
m_tagger throughout the rest of the abstract. For each<br />
MEDLINE abstract we could obtain two sets of mapped<br />
entities: (1) the metamap set, and (2) the m_tagger set. The<br />
metamap set (given all the abstracts) corresponds to<br />
78,298 distinct entities vs. 29,536 for M_tagger.<br />
In order to compare the contribution of each tagger to the<br />
disease-gene association process, we proceeded as follows.<br />
First, we generated a validation set from the OMIM<br />
database to acquire a list of experimentally-validated<br />
disease-gene pairs. Second, we generated an entity profile<br />
for every gene in our database and for every disease in our<br />
validation set. This profile corresponds to the TF-IDF<br />
score of a given entity in one profile, which is calculated<br />
according to the set of abstracts found to be linked with a<br />
disease or gene. Then for every disease, we computed the<br />
cosine similarity between its profile and all the gene<br />
profiles. Hence we could have a similarity score for each<br />
disease and gene pair, which we used to rank the genes for<br />
a given disease. We computed the average recall at the top<br />
10, 25, 50, and 100 ranked genes. We ran this analysis<br />
once according to the metamap set and once according to<br />
the m_tagger set. We also tried another association<br />
measure where we filtered the profiles such that they only<br />
contain gene entities. Then we ranked the genes according<br />
to their TF-IDF scores in a given disease profile. This<br />
corresponds to 9290 gene entities in the metamap set, and<br />
10,003 entities in the m_tagger set. Again we measured<br />
the average recall at the different rank thresholds, and we<br />
repeated the analysis using the metamap and m_tagger<br />
profiles.<br />
RESULTS & DISCUSSION<br />
Figure 1 presents the recall results on the OMIM<br />
validation set. We observe that MetaMap and M_tagger<br />
result in comparable recall when ranking the genes<br />
according to their cosine similarity with the disease<br />
profiles. We also observe that M_tagger results in the best<br />
recall when simply ranking the genes according to their<br />
TF-IDF scores inside the disease profile.<br />
FIGURE 1. Recall results on the OMIM validation set: comparing the<br />
contribution of MetaMap and M_tagger, once with cosine similarity and<br />
once with TF-IDF ranks.<br />
Even though using the m_tagger set implies using less<br />
entities than the metamap one, we could gain the same<br />
knowledge to associate genes with diseases. Moreover,<br />
when we further reduced this set of entities to only genes,<br />
we gained even more knowledge and better associated<br />
genes with diseases.<br />
REFERENCES<br />
Aronson A.R. et al. J. Am. Med. Inform. Assoc. An overview of MetaMap: historical<br />
perspective and recent advances. 17, 229-236 (2010).<br />
Pletscher-Frankild S. et al. DISEASES: text mining and data integration of diseasegene<br />
associations. Methods. 74, 83-89 (<strong>2015</strong>).<br />
60
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS<br />
Bertrand Escaliere 1,2 , Nicolas Simonis 1,3 , Gianluca Bontempi 1,2 & Guillaume Smits 1,4 .<br />
Interuniversity Institute of Bioinformatics in Brussels 1 ; Machine Learning Group, Université Libre de Bruxelles 2 ; Institut<br />
de Pathologie et de Génétique 3 ; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles 4 .<br />
NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing<br />
variants. A better combination between existing tools and the right choice of parameters can lead to more specific and<br />
sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a<br />
simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled<br />
way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine<br />
exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a<br />
pipeline one variable at the time.<br />
INTRODUCTION<br />
Identification of anomalies causing genetic disorders is<br />
difficult. It can be limited by scarcity of affliction<br />
concerned, by disorder genetic heterogeneity, or by<br />
phenotypic pleiotropy associated with the anomalies in a<br />
single gene. Exome and genome sequencing allowed the<br />
identification of many genetic diseases causes, whose<br />
origin remained inaccessible up to now by the usual<br />
techniques of research in genetics (Ng et al., 2009),<br />
(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al.,<br />
2014). Exome and genome sequencing data analysis<br />
pipelines are constituted by several steps (roughly:<br />
alignment, quality filters, variant calling) and several<br />
software are available for those steps. Evaluation and<br />
comparison of those tools are crucial in order to improve<br />
pipelines accuracy. Exome and genome sequencing<br />
simulations should allow to determine the veracity of<br />
called variants (false positives and false negatives).<br />
METHODS<br />
We implemented TuneSIM, a wrapper around NGS<br />
dwgsim (http://sourceforge.net/projects/dnaa/) reads<br />
simulator with realistic mutations. Generated reads contain<br />
real mutations from 1KG project and dbsnp138. We use<br />
existing tool dwgsim for reads generations. In order to<br />
generate data as realistic as possible we decided to keep<br />
the haplotype blocks structure. We computed blocks using<br />
vcf files from 1KG project phase 3 in european individuals<br />
with Plink (Purcell et al., 2007). For each block, we<br />
obtained a frequency of each combination of variants and<br />
we used these frequencies for blocks selection. We also<br />
insert variants in an independent way using their<br />
frequencies in dbSNP (Smigielski et al., 2000). Using 33<br />
in house samples, we computed global allele frequency<br />
variants distributions in coding and non coding regions<br />
and we select the variants according to those frequencies.<br />
Similar operation has been performed for CNVs insertion<br />
using 1KG data. We are developing a web interface<br />
allowing users to download existing generated datasets.<br />
After running their pipelines they can upload their output<br />
and see accuracy of their pipelines.<br />
RESULTS & DISCUSSION<br />
Simulations with different coverage, rate of indels have<br />
been performed and analysed with different pipelines.<br />
Results will be presented.<br />
REFERENCES<br />
Gilissen, et al. (2012). Disease gene identification strategies for exome<br />
sequencing. Eur J Hum Genet, 20, 490–497.<br />
Gilissen, et al. (2014). Genome sequencing identifies major causes of<br />
severe intellectual disability. Nature, 511, 344–347.<br />
Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a<br />
mendelian disorder. Nature Genetics, 42, 30–35.<br />
Purcell, et al. (2007). PLINK: a tool set for whole-genome association<br />
and population-based linkage analyses. American journal of human<br />
genetics, 81, 559–575.<br />
Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp:<br />
a database of single nucleotide polymorphisms. Nucleic Acids<br />
Research, 28, 352–355.<br />
Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis<br />
of Mendelian Disorders. N Engl J Med, 369, 1502–1511.<br />
61
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P18. RNA-SEQ REVEALS ALTERNATIVE SPLICING WITH<br />
ALTERNATIVE FUNCTIONALITY IN MUSHROOMS<br />
Thies Gehrmann 1 , Jordi F. Pelkmans 2 , Han Wösten 2 , Marcel J.T. Reinders 1 & Thomas Abeel 1* .<br />
Delft Bioinformatics Lab, Delft Technical University 1 ; Fungal Microbiology, Science Faculty, Utrecht University 2 ;<br />
* T.Abeel@tudelft.nl<br />
Alternative splicing is well studied in mammalian genomes, and alternative transcripts are often associated with disease<br />
and their role in regulation is gradually being unveiled. In fungi, the study of alternative splicing has only scratched the<br />
surface. Using RNA-Seq data, we predict alternative transcripts based on existing gene predictions in two mushroom<br />
forming fungi. We study the alternative functionality of genes through functional domains, developmental stages, tissue<br />
and time. This analysis reveals the amount of alternative functionality induced by alternative splicing which was<br />
previously unknown in fungi, and asserts the need for further research.<br />
INTRODUCTION<br />
Transcriptreconstruction algorithms rely on the sparsity<br />
(intergenic regions) of the genome in order distinguish<br />
between genes. In fungi, due to the density of the genome,<br />
transcripts overlap in the up and down-stream untranslated<br />
regions (UTRs) and prevent the use of existing tools for<br />
transcript prediction (Roberts et. al. 2011). Previous<br />
studies (Xie et. al. <strong>2015</strong>, Zhao et. al. 2013), were limited<br />
to the study of splice junctions, more advanced functional<br />
analyses. We transform the genomes of S. commune and A.<br />
bisporusin order to enable the prediction of alternative<br />
transcripts applying existing transcript reconstruction<br />
algorithms to RNA-Seq data from different tissue types<br />
and developmental stages. We present a functional<br />
analysis of the resulting transcripts.<br />
METHODS<br />
We apply a transformation on our fungal genomes in order<br />
to reduce the impact of overlapping UTRs which prevent<br />
the prediction of alternative transcripts. We split the<br />
genome into chunks, with each chunk being defined by<br />
existing gene annotations. Thus, the transformation<br />
essentially removes intergenic regions (which contain the<br />
UTRs). Each chunk is then analyzed separately by<br />
Cufflinks (Roberts et. al. 2011). Predicted transcripts are<br />
filtered based on read information and ORF sanity. Protein<br />
domain annotations are predicted for each transcript using<br />
InterPro (Zdobnov & Apweiler 2001).<br />
For each gene with multiple alternative transcripts, we<br />
construct a consensus sequence which allows us to call<br />
specific splicing events without the influence of erroneous<br />
reference annotations.<br />
RESULTS & DISCUSSION<br />
For both fungi, we find that alternative splicing is<br />
prevalent and many genes have multiple alternative<br />
transcripts (see Table 1).<br />
# Orig. Genes # Filt. # Transcripts<br />
Genes<br />
S. commune<br />
16,319 14,615 20,077<br />
A. bisporus<br />
10,438 9612 14,320<br />
TABLE 1. The number of originally annotated genes in S. Commune and<br />
A. Bisporus is decreased after prediction based on RNA-Seq data filters<br />
them out. The number of new transcripts predicted indicates that<br />
alternative splicing is not a rare event in these fungi.<br />
The frequency of specific events in the two fungi are<br />
similar and match what is seen in humans (Sammeth, M,<br />
et. al. 2008). However, there are significant differences in<br />
the event usage. While most transcripts in S. commune<br />
only have one event associated with it, most transcripts in<br />
A. Bisporushave at least two events. We show that this is a<br />
result of co-operative events.<br />
As our dataset consists of multiple developmental timepoints<br />
and tissue types, we are able to observe the<br />
alternative use of transcripts through time. If a gene swaps<br />
transcript usage at a certain time point, this is indicative of<br />
a functional involvement of that particular transcript (Lees<br />
et. al. <strong>2015</strong>). We find multiple transcripts in both S.<br />
commune and A. bisporus which are activated in specific<br />
developmental stages of the mushroom. Furthermore, in A.<br />
bisporus, we are able to identify transcripts which are<br />
activated specifically for certain tissue types through<br />
development.<br />
Using protein domain predictions for each transcript in a<br />
gene, we can measure how gene functionality changes<br />
across its transcripts. Figure 1 shows that functional<br />
annotations are not always preserved across all transcripts,<br />
indicating alternative functionality.<br />
FIGURE 1. Many genes in S. commune demonstrate alternative<br />
functionality through alternative splicing<br />
This is the first genome-wide functional analysis of<br />
alternative splicing in fungi from RNA-Seq data. We find<br />
a wealth of alternative splicing events in two fungi,<br />
resulting in many newly discovered transcripts. Although<br />
their functional influence is not yet demonstrated, we<br />
present evidence to suggest that they are relevant to<br />
mushroom development.<br />
REFERENCES<br />
Lees, J. G., et. al. BMC Genomics, 16:1 (<strong>2015</strong>)<br />
Roberts, A., et. al. Bioinformatics 27:17, 2325–2329. (2011)<br />
Sammeth, M., et. al. PLoS Computational Biology, 4:8. (2008)<br />
Xie, B.-B., et. al.. BMC Genomics, 16:54(<strong>2015</strong>).<br />
Zdobnov, E. M., & Apweiler, R. Bioinformatics 17:9 (2001)<br />
Zhao, C., et. al. BMC Genomics, 14:21. (2013).<br />
62
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P19. MSQROB: AN R/BIOCONDUCTOR PACKAGE FOR ROBUST RELATIVE<br />
QUANTIFICATION IN LABEL-FREE MASS SPECTROMETRY-BASED<br />
QUANTITATIVE PROTEOMICS<br />
Ludger Goeminne 1,2,3* , Kris Gevaert 2,3 & Lieven Clement 1 .<br />
Department of Applied Mathematics, Computer Science and Statistics, Ghent University 1 ; VIB Medical Biotechnology<br />
Center 2 ; Department of Biochemistry, Ghent University 3 . * ludger.goeminne@UGent.be<br />
MSqRob is an R/Bioconductor package that uses robust ridge regression on peptide-level data for robust relative<br />
quantification of proteins in label-free data-dependent acquisition (DDA) mass spectrometry (MS)-based proteomic<br />
experiments. It has been shown that statistical methods inferring at the peptide-level outperform workflows that<br />
summarize peptide intensities prior to inference. MSqRob improves upon existing peptide-level methods by three<br />
modular extensions: (1) ridge regression, (2) empirical Bayes variance estimation and (3) M-estimation with Huber<br />
weights. The extensions make MSqRob less sensitive towards outliers and missing peptides, enabling more proteins to be<br />
processed. Our software provides streamlined data analysis pipelines for experiments with simple layouts as well as for<br />
more complex multi-factorial designs. Using a spike-in dataset, we illustrate that MSqRob grants more stable protein fold<br />
change estimates and improves the differential abundance (DA) ranking.<br />
INTRODUCTION<br />
In a typical label-free DDA LC-MS/MS-based proteomic<br />
workflow, proteins are digested to peptides, separated by<br />
RP-HPLC and analyzed by a mass spectrometer. However,<br />
several issues inherent to the protocol make data analysis<br />
non-trivial. Most of the common data analysis procedures<br />
use summarization-based workflows. We have previously<br />
shown that inference at the peptide level outperforms these<br />
summarization-based approaches (Goeminne et al., <strong>2015</strong>).<br />
However, even these pipelines are sensitive to outliers and<br />
suffer from overfitting. Here, we present MSqRob, an<br />
R/Bioconductor package that starts form peptide-level data<br />
and provides robust inference on DA at the protein level.<br />
METHODS<br />
Dataset. To demonstrate the performance of our package,<br />
we use the CPTAC dataset, in which 48 known human<br />
proteins were spiked-in at different concentrations in a<br />
yeast proteome background. Ideally, when comparing<br />
different spike-in conditions, only the human proteins<br />
should be flagged as differentially abundant.<br />
Competing analytical methods. MaxLFQ+Perseus,<br />
which summarizes peptide data followed by pairwise t-<br />
tests.<br />
LM model. Generally, peptide-based models are<br />
constructed as follows:<br />
y ijklmn<br />
= treat ij + pep ik + biorep il + techrep im<br />
+ ε ijklmn<br />
with y ijklmn the n th log 2 -transformed normalized feature<br />
intensity for the i th protein under the j th treatment treat ij ,<br />
the k th peptide sequence pep ik , the lth biological repeat<br />
biorep il and the m th technical repeat techrep im , and<br />
ε ijklmn a normally distributed error term with mean zero<br />
and variance σ i<br />
2 .<br />
MSqRob. MSqRob adds the following improvements to<br />
the LM model:<br />
1. Ridge regression: shrink parameter estimates<br />
towards 0 by adding a ridge penalty term to the<br />
loss function.<br />
2. Stabilize variance estimation by borrowing<br />
information across proteins with empirical<br />
Bayes (EB): shrink individual variances towards<br />
the pooled variance.<br />
3. M estimation with Huber weights: weigh down<br />
observations with large errors.<br />
RESULTS & DISCUSSION<br />
MSqRob uses MaxQuant or Mascot peptide-level data as<br />
input. It performs preprocessing, robust model fitting and<br />
returns log 2 fold change estimates and FDR corrected p-<br />
values for all model parameters and/or (user specified)<br />
contrasts. Advanced users have the flexibility to (a) adopt<br />
their own preprocessing pipeline (e.g. transformation,<br />
normalization, drop contaminants…) and (b) specify the<br />
appropriate model structure. Compared to competing<br />
methods, MSqRob returns more stable log 2 fold change<br />
estimates, improves DA ranking (Figure 1) and is able to<br />
discern between consistently strong DA and an accidental<br />
hit caused by outliers or a small variance due to random<br />
chance in low-abundant proteins.<br />
FIGURE 1. Receiver operating characteristic (ROC) curves showing the<br />
superior performance of MSqRob compared to a simple linear model<br />
(LM) and a summarizarion-based approach (MaxLFQ+Perseus) when<br />
comparing the lowest spike-in concentration 6A with the second lowest<br />
spike-in concentration 6B. Stars denote the methods’ cut off at an<br />
estimated 5 % FDR.<br />
REFERENCES<br />
Goeminne LJE et al. Journal of Proteome Research 14, 2457-2465<br />
(<strong>2015</strong>).<br />
63
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P20. A MIXTURE MODEL FOR THE OMICS BASED IDENTIFICATION OF<br />
MONOALLELICALLY EXPRESSED LOCI AND THEIR DEREGULATION IN<br />
CANCER<br />
Tine Goovaerts 1 , Sandra Steyaert 1 , Jeroen Galle 1 , Wim Van Criekinge 1 & Tim De Meyer 1* .<br />
BIOBIX lab of Bioinformatics and Computational Genomics, Department of Mathematical Modelling,<br />
Statistics and Bioinformatics, Ghent University 1 . * tim.demeyer@ugent.be<br />
Imprinting is a phenomenon featured by parent-specific monoallelic gene expression. Its deregulation has been<br />
associated with non-Mendelian inherited genetic diseases but is also a common feature of cancer. As imprinting does not<br />
alter the genome yet is mitotically inherited, epigenetics is deemed to be a key regulator. Current knowledge in the field<br />
is particularly hampered by a lack of accurate computational techniques suitable for omics data. Here we introduce a<br />
mixture model for the identification of monoallelically expressed loci based on large scale omics data that can also be<br />
exploited to identify samples and loci featured by loss of imprinting / monoallelic expression.<br />
INTRODUCTION<br />
The genome-wide identification of mono-allelically<br />
expressed or epigenetically modified loci typically<br />
requires the presence of SNPs to discriminate both alleles.<br />
Current methods predominantly rely on genotyping for the<br />
identification of heterozygous loci in a limited sample set,<br />
followed by testing whether the expression/epigenetic<br />
modification levels for both alleles deviate from a 1:1 ratio<br />
for those loci (Wang et al., 2014). This approach is limited<br />
by the genotyping step and the required presence of<br />
heterozygous individuals. As large scale omics data is<br />
becoming increasingly available, an alternative strategy<br />
may be to screen larger numbers (e.g. hundreds) of<br />
samples, ensuring the presence of heterozygous<br />
individuals at predictable rates, thereby also avoiding the<br />
need for and limitations of a prior genotyping step.<br />
Based on this concept, a previous strategy (Steyaert et al.,<br />
2014) enabled us to identify and validate approximately 80<br />
loci featured by monoallelic DNA methylation, but had<br />
several drawbacks, such as computational inefficiency,<br />
heavy reliance on Hardy-Weinberg equilibrium (HWE),<br />
need for 100% imprinting and low power, which limited<br />
its practical use. Here we present a novel mixture model<br />
for the identification of monoallelically modified or<br />
expressed loci from large-scale omics data (without<br />
known genotypes) that largely circumvents previous<br />
drawbacks.<br />
METHODS<br />
The rationale of the methodology is that RNA-seq and<br />
ChIP-seq(-like) derived SNP data for monoallelic loci are<br />
featured by a general lack of apparent heterozygosity.<br />
More specifically, under the null-hypothesis (no<br />
imprinting) the homozygous and heterozygous sample<br />
fractions can be modelled as a mixture of (beta-)binomial<br />
distributions, with weights according to HWE or<br />
empirically derived. For imprinted loci however, the<br />
heterozygous fraction is split and shifted towards the two<br />
homozygous fractions (Figure 1), which can be evaluated<br />
with a likelihood ratio test. The model does not require but<br />
can incorporate prior genotyping data and allows for<br />
deviation from HWE, sequencing errors and efficiency<br />
differences and partial monoallelic events. Once loci<br />
featured by monoallelic events have been identified in<br />
control data, a loss of imprinting index can be calculated<br />
for each non-normal sample based on the mixture model<br />
likelihoods and loci generally featured by loss of<br />
imprinting in the pathology under study can be identified.<br />
RESULTS & DISCUSSION<br />
We demonstrate the applicability of the novel mixture<br />
model with simulations and a proof of concept study using<br />
breast cancer and control RNA-seq data from The Cancer<br />
Genome Atlas (TCGA Research Network, 2008). Well<br />
known imprinted loci such as IGF2 (Figure 1) and H19<br />
were indeed identified. Ongoing efforts are directed<br />
towards artefact-free RNA/ChIP-seq data based allele<br />
frequency inference and the efficient implementation of a<br />
beta-binomial based mixture.<br />
FIGURE 1. Observed (red) and modelled (green) allele frequencies for a<br />
100% (right, no observable heterozygotes) and a partially imprinted<br />
(left) SNP of the IGF2 gene<br />
In conclusion, we introduce a novel mixture model for the<br />
identification of loci featured by monoallelic events which<br />
can subsequently be exploited to determine their<br />
deregulation in the pathology of interest.<br />
REFERENCES<br />
Steyaert S et al. Nucleic Acids Research 42, e157 (2014).<br />
TCGA Research Network. Nature 455, 1061-1068 (2008).<br />
Wang X & Clark AG. Heredity 113, 156-166 (2014).<br />
64
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P21. GEVACT: GENOMIC VARIANT CLASSIFIER TOOL<br />
Isel Grau 1,4 , Dorien Daneels 2,3 , Sonia Van Dooren 2,3 , Maryse Bonduelle 2 ,<br />
Dewan Md. Farid 1,3 , Didier Croes 2,3 , Ann Nowé 1,3 & Dipankar Sengupta 1,3* .<br />
Como - Artificial Intelligence Lab, Vrije Universiteit Brussel 1 ; Centre for Medical Genetics, Reproduction and Genetics,<br />
Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel,UZ Brussel 2 ; Interuniversity Institute of<br />
Bioinformatics in Brussels, ULB-VUB 3 ; Department of Computer Sciences, Universidad Central de Las Villas 4 .<br />
* Dipankar.Sengupta@vub.ac.be<br />
High throughput screening (HTS) techniques, like genome or exome screening are becoming norms in the conventional<br />
clinical analysis. However, classifying the identified variants to be pathogenic, or potentially pathogenic or nonpathogenic,<br />
is still a manual, tedious and time consuming process for clinicians or geneticists. Thus, to facilitate the<br />
variant classification process, we have developed G E V A CT, a Java based tool, designed on an algorithm, i.e. based on the<br />
existing literature and knowledge of clinical geneticists. G E V A CT can classify variants annotated by Alamut Batch, with<br />
a future plan to support for inputs from other annotation software's also.<br />
INTRODUCTION<br />
With the emergence of new screening techniques, targeted<br />
or whole exome and genome screening are becoming<br />
standard diagnostic norms in clinical settings to identify<br />
the variants for a genetic disease (Ng et al., 2010;<br />
Saunders et al., 2012). However, development of<br />
bioinformatics solutions for pathogenic classification of<br />
the variants still remains a big challenge and henceforth,<br />
making the process ponderous for geneticists and<br />
clinicians. In this work, we describe G E V A CT (Genomic<br />
Variant Classifier Tool), a tool for classification of<br />
genomic single nucleotide and short insertion/deletion<br />
variants. The aim of this study was to design and<br />
implement a variant classification algorithm, based on a<br />
literature review of cardiac arrhythmia syndromes<br />
(Hofman et al., 2013; Schulze-Bahr et al., 2000; Wilde &<br />
Tan, 2007) and existing knowledge of clinical geneticists.<br />
METHODS<br />
The algorithm we propose for G E V A CT is based on a<br />
published variant classification schema for cardiac<br />
arrhythmia syndromes. This approach is based on the yield<br />
of DNA testing over a time span of 15 years (1996-2011),<br />
between probands with isolated/familial cases, and also<br />
between probands with or without clear disease-specific<br />
clinical characteristics (Hofman et al., 2013). It proposes<br />
two varying approaches: one to classify missense variants<br />
and another to classify nonsense and frameshift variants.<br />
The algorithm is implemented in two phases: preprocessing<br />
and classification. In the pre-processing phase,<br />
the annotated tab-delimited variant file (vcf.ann) from the<br />
Alamut batch, is refined based on the gene list for the<br />
disease-of-interest, so as to reduce the number of variants<br />
for the analysis. Filters are applied to look for variants that<br />
have already been reported in the Human Genome<br />
Mutation Database (Stenson et al., 2003) and in ClinVar<br />
(Landrum et al., 2014), or that have previously been<br />
detected and classified in an internal patient population.<br />
And lastly, the variants are filtered based on their location<br />
in the genome and their coding effect, followed by the<br />
check for minor allele frequency of the variant in a control<br />
population (Sherry ST et al. 2001). Thereafter, in the<br />
classification phase, the filtered variants are classified as<br />
missense or nonsense and frameshift variants. For<br />
missense variants the classification is based on the<br />
parameters: amino acid substitution and its impact on<br />
protein function (Adzhubei et al., 2010; Kumar et al.,<br />
2009), biochemical variation (Mathe et al., 2006),<br />
conservation (Pollard et al., 2010), frequency of variant<br />
alleles in a control population (ExAC, <strong>2015</strong>), effects on<br />
splicing (Desmet et al., 2009), family and phenotype<br />
information and functional analysis. Whereas, for the<br />
nonsense and frameshift variants, it is based on: effects on<br />
splicing, frequency of variant alleles in a control<br />
population, family and phenotype information and<br />
functional analysis. For each parameter, a score is given to<br />
the variant, which is subsequently cumulated.<br />
Conclusively, based on the cumulative score each variant<br />
is classified into one of the five categories: Class I - Non-<br />
Pathogenic; Class II - VUS1 (unlikely pathogenic); Class<br />
III - VUS2 (unclear); Class IV - VUS3 (likely<br />
pathogenic); Class V - Pathogenic (Sharon et al., 2008).<br />
RESULTS & DISCUSSION<br />
In this study, we report a Java based tool called G E V A CT,<br />
developed for classification of genomic variants. Input for<br />
the tool is an annotated vcf file, while the output depicts<br />
the cumulative classification score along with the class<br />
label for a variant. The tool was tested on a dataset of 130<br />
cardiac arrhythmia syndrome patients, available at UZ<br />
Brussel. The results of the variant classification made by<br />
the tool were cross-validated by manual curation,<br />
performed by the clinical geneticist. Definitively, the<br />
study indicates the tool to be promising but needs to be<br />
further validated on datasets from other diseases. In<br />
addition to, we are working on the tool to be adaptable for<br />
file inputs from other annotation software.<br />
REFERENCES<br />
Adzhubei IA et al. Nat Methods 7(4), 248-249 (2010).<br />
Desmet et al. Nucleic Acids Res 37 (9): e67 (2009).<br />
Exome Aggregation Consortium (ExAC), Cambridge, MA (<strong>2015</strong>).<br />
Hofman N et al. Circulation 128(14),1513-21 (2013).<br />
Kumar P et al. Nat Protoc 4(7), 1073–1081 (2009).<br />
Landrum MJ et al. Nucleic Acids Res 42(1), D980-5 (2014).<br />
Mathe E et al. Nucleic Acids Res 34(5),1317-25 (2006).<br />
Ng SB et al. Nat Genetics 42, 30–35 (2010).<br />
Pollard K et al. Genome Res 20, 110-121 (2010).<br />
Saunders CJ et al. Sci Transl Med 4, 154ra135 (2012).<br />
Sharon EP et al. Hum Mutat. 29(11), 1282–1291 (2008).<br />
Sherry ST et al. Nucleic Acids Res 29(1),308-11 (2001).<br />
Schulze-Bahr E et al. Z Kardiol 89 Suppl 4:IV12-22 (2000).<br />
Stenson et al. Hum Mutat. 21:577-581 (2003).<br />
Wilde AA & Tan HL Circ J 71, Suppl A:A12-9 (2007).<br />
65
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P22. MAPPI-DAT: MANAGEMENT AND ANALYSIS FOR HIGH<br />
THROUGHPUT INTERACTOMICS DATA FROM ARRAY-MAPPIT<br />
EXPERIMENTS<br />
Surya Gupta 1,2,3 , Jan Tavernier 1,2 & Lennart Martens 1,2,3 .<br />
Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent, Belgium 2 ;<br />
Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 .<br />
INTRODUCTION<br />
Proteins are highly interesting objects of study, involved<br />
in different cellular and molecular functions. Identification<br />
and quantification of these proteins along with their<br />
interacting proteins, nucleic acids and molecules can<br />
provide insight into development and disease mechanisms<br />
at the systems level. Yet studying these interactions is not<br />
trivial. In vivo methods exist to determine these<br />
interactions, but these suffer from several drawbacks [4].<br />
To overcome existing problems, an innovative approach<br />
called MAPPIT (Mammalian Protein-Protein Interaction<br />
Trap) [2] has been established in the Cytokine Receptor<br />
Lab to determine interacting partners of proteins in<br />
mammalian cells. To allow screening of thousands of<br />
interactors simultaneously, MAPPIT has been parallelized<br />
in the array MAPPIT system [3].<br />
AIM<br />
However, no effective pipeline existed to process the highthrough<br />
put data generated from array MAPPIT. We<br />
therefore established an automated high-throughput data<br />
analysis system called MAPPI-DAT (Mappit Array<br />
Protein Protein Interaction- Database & Analysis Tool).<br />
METHODS<br />
In the array-MAPPIT platform the interaction of two<br />
proteins (bait-prey) restores a mutated JAK-STAT<br />
signaling pathway which leads to the expression of<br />
florescence emitting genes. In order to rank the positive<br />
interactions based on fluorescence intensity, RankProd [1]<br />
is used. This method was originally developed to<br />
determine differentially expressed genes in microarray<br />
experiments and is available as R package. To minimize<br />
false positive hits from RankProd output, quartile based<br />
filtration was applied. MySQL platform was used to build<br />
the data management system for the array-MAPPIT<br />
system.<br />
RESULTS<br />
To extend and ease the usage of the analysis pipeline and<br />
database system, an interface has been developed called<br />
MAPPI-DAT. MAPPI-DAT is capable of processing<br />
many thousand data points for each experiment, and<br />
comprising a data storage system that stores the<br />
experimental data in a structured way for meta-analysis.<br />
REFERENCES<br />
[1] Breitling, R., Armengaud, P., Amtmann, A., & Herzyk, P. (2004).<br />
Rank products: A simple, yet powerful, new method to detect<br />
differentially regulated genes in replicated microarray experiments.<br />
FEBS Letters, 573(1-3), 83–92.<br />
[2] Lievens, S., Peelman, F., De Bosscher, K., Lemmens, I., &<br />
Tavernier, J. (2011). MAPPIT: A protein interaction toolbox built on<br />
insights in cytokine receptor signaling. Cytokine and Growth Factor<br />
Reviews, 22(5-6), 321–329.<br />
[3] Lievens, S., Vanderroost, N., Heyden, J. Van Der, Gesellchen, V.,<br />
Vidal, M., Tavernier, J., & Heyden, V. Der. (2009). Array MAPPIT :<br />
High-Throughput Interactome Analysis in Mammalian Cells Array<br />
MAPPIT : High-Throughput Interactome Analysis in Mammalian Cells,<br />
877–886.<br />
[4] S.Gopichandran and S.Ranganathan. (2013). Protein-protein<br />
Interactions and Prediction: A Comprehensive Overview. Protein and<br />
Peptide Letters, 779–789<br />
66
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P23. HIGHLANDER: VARIANT FILTERING MADE EASIER<br />
Raphael Helaers 1* & Miikka Vikkula 1 .<br />
Human Molecular Genetics (GEHU), de Duve Institute, Université catholique de Louvain 1 .<br />
* Raphael.helaers@UCLouvain.be<br />
The field of human genetics is being revolutionized by exome and genome sequencing. A massive amount of data is<br />
being produced at ever-increasing rates. Targeted exome sequencing can be completed in a few days using NGS,<br />
allowing for new variant discovery in a matter of weeks. The technology generates considerable numbers of false<br />
positives, and the differentiation of sequencing errors from true mutations is not a straightforward task. Moreover, the<br />
identification of changes-of-interest from amongst tens of thousands of variants requires annotation drawn from various<br />
sources, as well as advanced filtering capabilities. We have developed Highlander, a Java software coupled to a MySQL<br />
database, in order to centralize all variant data and annotations from the lab, and to provide powerful filtering tools that<br />
are easily accessible to the biologist. Data can be generated by any NGS machine (such as Illumina’s HiSeq, or Life<br />
Technologies’ Solid or Ion Torrent) and most variant callers (such as Broad Institute’s GATK or Life Technologies’<br />
LifeScope). Variant calls are annotated using DBNSFP (providing predictions from 6 different programs, and MAF from<br />
1000G and ESP), GoNL and SnpEff, subsequently imported into the database. The database is used to compute global<br />
statistics, allowing for the discrimination of variants based on their representation in the database. The Highlander GUI<br />
easily allows for complex queries to this database, using shortcuts for certain standard criteria, such as “sample-specific<br />
variants”, “variants common to specific samples” or “combined-heterozygous genes”. Users can browse through query<br />
results using sorting, masking and highlighting of information. Highlander also gives access to useful additional tools,<br />
including direct access to IGV, and an algorithm that checks all available alignments for allele-calls at specific positions.<br />
67
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P24. DOSE-TIME NETWORK IDENTIFICATION: A NEW METHOD FOR<br />
GENE REGULATORY NETWORK INFERENCE FROM GENE EXPRESSION<br />
DATA WITH MULTIPLE DOSES AND TIME POINTS<br />
Diana M Hendrickx 1* , Danyel G J Jennen 1 & Jos C S Kleinjans 1 .<br />
Department of Toxicogenomics, Maastricht University, The Netherlands 1 .<br />
*d.hendrickx@maastrichtuniversity.nl<br />
Toxicogenomics, the application of ‘omics’ technologies to toxicology, is a rapidly growing field due to the need for<br />
alternatives to animal experiments for toxicity testing of compounds. Identification of gene regulatory networks affected<br />
by compounds is important to gain more insight into the mode of action of a toxic compound. The response to a toxic<br />
compound is both time and dose dependent. Therefore, toxicogenomics data are often measured across several time<br />
points and doses. However, to our knowledge, there does not exist a method for gene regulatory network inference that<br />
takes into account both time and dose dependencies. Here we present Dose-Time Network Identification (DTNI), a novel<br />
gene regulatory network inference algorithm that takes into account both dose and time dependencies in the data. We<br />
show that DTNI can be used to infer gene regulatory networks affected by a group of compounds with the same mode of<br />
action. This is illustrated with gene expression (microarray) data from COX inhibitors, measured in human hepatocytes.<br />
INTRODUCTION<br />
Identifying and understanding gene regulatory networks<br />
(GRN) influenced by chemical compounds is one of the<br />
main challenges of systems toxicology. A GRN affected<br />
by one or more compounds evolves over time and with<br />
dose. The analysis of gene expression data measured at<br />
multiple time points and for multiple doses can provide<br />
more insight in the effects of compounds. Therefore, there<br />
is a need for mathematical approaches for GRN<br />
identification from this type of data.<br />
METHODS<br />
One of the mathematical approaches currently used for<br />
GRN inference is based on ordinary differential equations<br />
(ODE), where changes in gene expression over time are<br />
related to each other and to the external perturbation (i.e.<br />
the dose of the compound). Because gene expression data<br />
usually have less data points than variables (genes), ODE<br />
approaches are often combined with interpolation and/or<br />
dimension reduction techniques (PCA). A current method<br />
that combines ODE with both interpolation and dimension<br />
reduction techniques is Time Series Network<br />
Identification (TSNI) (Bansal et al., 2006).<br />
Here, we present Dose-Time Network Identification<br />
(DTNI), a method that extends TSNI by including ODE<br />
that describe changes in gene expression over dose in<br />
relation to each other and to time. We also adapted the<br />
original method so that it can include data from multiple<br />
perturbations (compounds).<br />
RESULTS & DISCUSSION<br />
By exploiting simulated data, we show that including<br />
ODE for expression changes over dose leads to improved<br />
GRN identification compared with including only ODE<br />
that describe changes over time. Furthermore, we show<br />
that DTNI performs better when including data from<br />
multiple perturbations (compounds) than when applying<br />
DTNI to data from a single perturbation. This suggests<br />
that the method is suitable to infer a GRN affected by<br />
compounds with the same mode of action. As an example,<br />
we infer the network affected by COX inhibitors from<br />
public microarray data of 6 COX inhibitors, measured in<br />
human hepatocytes, available from Open TG-Gates<br />
(http://toxico.nibio.go.jp/english/index.html) (Noriyuki et<br />
al., 2012). The interactions in the inferred network were<br />
compared to interactions from ConsensusPathDB, a<br />
database including interactions from 32 different sources<br />
(Kamburov et al., 2013). The inferred network was<br />
validated by leave-one out cross-validation (LOOCV). Six<br />
datasets were created from the original data by leaving out<br />
the data of one compound. The network constructed from<br />
the whole data set showed large overlap with the networks<br />
constructed from each of the LOOCV datasets. Edges in<br />
the network constructed from the whole data set, but not in<br />
the networks constructed from the LOOCV datasets were<br />
removed from the network. The remaining novel<br />
interactions, i.e. those that are not in ConsensusPathDB,<br />
have to be validated experimentally, e.g. by geneknockdown<br />
experiments.<br />
FIGURE 1. Workflow for identifying a gene regulatory network affected<br />
by a group of compounds with the same mode of action.<br />
REFERENCES<br />
Bansal M et al. Bioinformatics 22, 815-822 (2006).<br />
Noriyuki N et al. J Toxicol Sci 37,791-801 (2012).<br />
Kamburov A et al. Nucl Acids Res 41, D793-D800 (2013).<br />
68
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Category: Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P25. IDENTIFICATION OF NOVEL ALLOSTERIC DRUG TARGETS<br />
USING A “DUMMY” LIGAND APPROACH<br />
Susanne M.A. Hermans, Christopher Pfleger & Holger Gohlke * .<br />
Department of Mathematics and Natural Sciences, Institute for Pharmaceutical and Medicinal Chemistry, Heinrich-<br />
Heine-University, Düsseldorf, Germany. * gohlke@uni-duesseldorf.de<br />
Targeting allosteric sites is a promising strategy in drug discovery due to their regulatory role in almost all cellular<br />
processes. Currently, there is no standard method to identify novel pockets and to detect whether a pocket has a<br />
regulatory effect on the protein. Here, we present a new and efficient approach to probe information transfer through<br />
proteins in the context of dynamically dominated allostery that exploits “dummy” ligands as surrogates for allosteric<br />
modulators.<br />
INTRODUCTION<br />
Allosteric regulation is the coupling between separated<br />
sites in biomacromolecules such that an action at one site<br />
changes the function at a distant site. Allosteric drugs are<br />
popular, they often have less side effects then orthosteric<br />
drugs because the allosteric sites are less conserved. The<br />
identification of novel allosteric pockets is complicated by<br />
the large variation in allosteric regulation, ranging from<br />
rigid body motions to disorder/order transitions, with<br />
dynamically dominated allostery in between (Motlagh et<br />
al., 2014). Here we focus on dynamically dominated<br />
allostery with minimal or no conformational changes.<br />
Novel pockets do not have a known ligand, therefore we<br />
generate “dummy” ligands to function as surrogates for<br />
allosteric ligands. We have developed an efficient<br />
approach to probe information transfer through proteins<br />
using “dummy” ligands and detect if allosteric coupling is<br />
present between the novel pocket and the orthosteric site.<br />
METHODS<br />
In a preliminary study to test the general feasibility, the<br />
approach was applied to conformations extracted from a<br />
MD trajectory of the holo and apo structures of LFA1.<br />
The grid-based PocketAnalyzer program (Craig et al.,<br />
2011) is used to detect putative binding sites. “Dummy”<br />
ligands were generated for each detected pocket along the<br />
ensemble. Finally, the Constraint Network Analysis<br />
(CNA) software, which links biomacromolecular structure,<br />
(thermo-)stability, and function, is used to probe the<br />
allosteric response by monitoring altered stability<br />
characteristics of the protein due to the presence of the<br />
“dummy” ligand (Pfleger et al., 2013; Krüger et al., 2013;<br />
Pfleger, 2014). The results were compared to those of the<br />
holo structure with the bound allosteric ligand to validate<br />
the “dummy” ligand approach.<br />
RESULTS & DISCUSSION<br />
Remarkably, the usage of “dummy” ligands almost<br />
perfectly reproduced the results obtained from the known<br />
allosteric effector. Although it turned out that the intrinsic<br />
rigidity of the “dummy” ligands over-stabilizes the LFA1<br />
structure, these results are already encouraging. Even for<br />
the LFA1 apo structures, where the allosteric pocket is<br />
partially closed, the results are in agreement with known<br />
allosteric effectors. Overall, the results obtained from the<br />
validation of the “dummy” ligand approach are<br />
encouraging. This suggests that our “dummy” ligand<br />
approach for the characterization of unexplored allosteric<br />
pockets is a promising step towards identifying novel drug<br />
targets.<br />
REFERENCES<br />
Craig, I.R. et al. J. Chem. Inf. Model. 51 2666–2679 (2011).<br />
Krüger, D. M. et al. Nucleic Acids Res. 41 340–348 (2013).<br />
Motlagh, H.N. et al. Nature 508 7496 331–339 (2014).<br />
Pfleger, C. et al. J. Chem. Inf. Model. 53 1007–1015 (2013).<br />
Pfleger, C. Doctoral Thesis, Heinrich Heine University, Düsseldorf,<br />
Germany (2014).<br />
69
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P26. PASSENGER MUTATIONS CONFOUND INTERPRETATION OF ALL<br />
GENETICALLY MODIFIED CONGENIC MICE<br />
Paco Hulpiau 1,2,3 *, Liesbet Martens 1,2,3 *, Yvan Saeys 1,2,3 , Peter Vandenabeele 1,2,4 & Tom Vanden Berghe 1,2 .<br />
Inflammation Research Center, VIB, Ghent, Belgium 1 ; Department of Biomedical Molecular Biology, Ghent University,<br />
Ghent, Belgium 2 ; Data Mining and Modelling for Biomedicine (DaMBi), Ghent, Belgium 3 ; Methusalem Program, Ghent<br />
University, Belgium 4 . *paco.hulpiau@irc.vib-ugent.be, liesbet.martens@irc.vib-ugent.be<br />
Targeted mutagenesis in mice is a powerful tool for functional analysis of genes. However, genetic variation between<br />
embryonic stem cells (ESCs) used for targeting (previously almost exclusively 129-derived) and recipient strains (often<br />
C57BL/6J) typically results in congenic mice in which the targeted gene is flanked by ESC-derived passenger DNA<br />
potentially containing mutations. Comparative genomic analysis of 129 and C57BL/6J mouse strains revealed indels and<br />
single nucleotide polymorphisms resulting in alternative or aberrant amino acid sequences in 1,084 genes in the 129-<br />
strain genome.<br />
INTRODUCTION<br />
Annotating the passenger mutations to the reported<br />
genetically modified congenic mice that were generated<br />
using 129-strain ESCs revealed that nearly all these mice<br />
possess multiple passenger mutations potentially<br />
influencing the phenotypic outcome. We illustrated this<br />
phenotypic interference of 129-derived passenger<br />
mutations with several case studies and developed a Me-<br />
PaMuFind-It web tool to estimate the number and possible<br />
effect of passenger mutations in transgenic mice of interest.<br />
METHODS<br />
We analyzed the SNP data release v3 from the Mouse<br />
Genome Project available at Sanger Institute (Keane et al.,<br />
2011). The data in the indel vcf file and SNP vcf file were<br />
filtered to retrieve indels and SNPs present in at least one<br />
of the three 129 strains (129P2/OlaH, 129S1/SvIm and<br />
129S5SvEvB) and affecting the protein coding sequence<br />
of the genes. These so-called protein coding variants are<br />
based on the following sequence ontology (SO) terms:<br />
stop gained, stop lost, inframe insertion, inframe deletion,<br />
frameshift variant, splice donor variant, splice acceptor<br />
variant, and coding sequence variant. In total, 949 indels<br />
and 446 SNPs affecting 1,084 mouse genes were retained.<br />
We gathered chromosome and gene start and end positions<br />
for 1,084 genes covering 1,395 variations. The Ensembl<br />
gene ID was used to find the most upstream and<br />
downstream start and stop in all Ensembl transcripts for<br />
that gene. Next these genome coordinates were used to<br />
search for flanking genes within 2, 10, and 20 Mbps<br />
upstream and downstream. We then downloaded all mouse<br />
phenotypic allele data from the MGI resource and<br />
extracted the data of genetically modified mouse lines.<br />
Information on 5,322 genes (corresponding to 7,979 129-<br />
derived genetically modified mouse lines) was connected<br />
to genes with passenger mutations and affected genes.<br />
Additionally we filtered the data to identify putative<br />
regulatory variants. All data were stored in a MySQL<br />
database and can be queried using the publicly available<br />
web tool Me-PaMuFind-It:<br />
http://me-pamufind-it.org/<br />
Passenger genome mutations in gene-targeted mice (Nechanitzky and<br />
Mak, <strong>2015</strong>)<br />
RESULTS & DISCUSSION<br />
The vast majority of existing and well-characterized<br />
genetically engineered congenic mice have been created<br />
using 129 ESCs. 99.5% of these mouse lines are affected<br />
by a median number of 20 passenger mutations within a<br />
10 cM flanking region. This implies that nearly all<br />
genetically modified congenic mice contain multiple<br />
passenger mutations despite intensive backcrossing.<br />
Consequently, the phenotypes observed in these mice<br />
might be due to flanking passenger mutations rather than a<br />
defect in the targeted gene (Vanden Berghe et al, <strong>2015</strong>).<br />
REFERENCES<br />
Keane, T.M., Goodstadt, L., Danecek, P., White, M.A., Wong, K., Yalcin,<br />
B., Heger, A., Agam, A., Slater, G., Goodson, M., et al. (2011).<br />
Mouse genomic variation and its effect on phenotypes and gene<br />
regulation. Nature 477, 289–294.<br />
Nechanitzky R, Mak TW (<strong>2015</strong>). Passenger Mutations Identified in the<br />
Blink of an Eye. Immunity 43(1), 9-11.<br />
Vanden Berghe, T., Hulpiau, P., Martens, L. et al (<strong>2015</strong>). Passenger<br />
Mutations Confound Interpretation of All Genetically Modified<br />
Congenic Mice. Immunity 43(1), 200-9.<br />
70
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: 000 Category: Abstract template<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P27. DETECTING MIXED MYCOBACTERIUM TUBERCULOSIS INFECTION<br />
AND DIFFERENCES IN DRUG SUSCEPTIBILITY WITH WGS DATA<br />
Arlin Keo 1 & Thomas Abeel 1,2,* .<br />
Delft Bioinformatics Lab, Delft University of Technology , Delft, the Netherlands 1 ; Broad Institute of MIT<br />
and Harvard, Cambridge, MA, USA 2 . * t.abeel@ tudelft.nl<br />
Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis and infects millions of people. When a<br />
person is infected with more than one distinct strain type of tuberculosis (TB), referred to as mixed infection, diagnosis<br />
and treatment is complicated. Due to difficulty of diagnosis the prevalence of mixed infections among TB patients<br />
remain uncertain. Whole genome sequencing (WGS) yields a great number of single nucleotide polymorphisms (SNPs)<br />
and offers increased resolution to distinguish distinct strains. Here, we present a tool that maps sample reads against 21<br />
bp cluster specific SNP markers to detect putative mixed infections and estimate the frequencies of the present<br />
subpopulations.<br />
INTRODUCTION<br />
Mycobacterium tuberculosis is a clonal, bacterial pathogen<br />
that causes the pulmonary disease tuberculosis (TB), and it<br />
infects and kills millions of people worldwide [1]. The<br />
study of genetic diversity within the M. tuberculosis<br />
complex (MTBC) is complicated by mixed TB infections,<br />
which happens when a person is infected with more than<br />
one distinct strain type of MTBC. This often results in<br />
poor diagnosis and treatment of patients as the bacterial<br />
subpopulation may have undetected differences in drug<br />
susceptibility [2]. A strain typing method should be able to<br />
distinguish closely related strains, to also allow the<br />
detection of a mixed infection at finer resolutions [3]. This<br />
study aims to detect a possible mixed TB infection at<br />
different levels in MTBC and to determine the frequencies<br />
of the present strains based on established tree paths in the<br />
MTBC phylogenetic tree.<br />
METHODS<br />
A global comprehensive dataset of 5992 MTBC strains<br />
was used for analysis, and 226570 SNPs were extracted<br />
from this set to construct a SNP-based phylogenetic tree<br />
with RAxML. In this bifurcating tree, each branch<br />
represents a cluster of strains and splits into two new<br />
monophyletic subclusters of genetically more closely<br />
related strain. These ¨splits¨ were used to define clusters<br />
and subclusters that contain more than 10. Global SNP<br />
association was done for each cluster to get clusterspecific<br />
SNPs, those for which the true positive rate, true<br />
negative rate, positive predictive value, and negative<br />
predictive value were >0.95. Markers were generated from<br />
these SNPs by extending them with 10 bp sequence on<br />
each side based on reference genome H37Rv. Each<br />
hierarchical cluster now has a set of specific SNP markers.<br />
By mapping sample reads against these 21 bp clusterspecific<br />
SNP markers the tool determines the presence of<br />
paths in the phylogenetic tree that start at the MTBC root<br />
node. Paths that split indicate the presence of multiple<br />
strains and thus a mixed infection.<br />
The read depth at the root node represents a frequency of 1<br />
of the present MTBC species. If the path splits further in<br />
the tree, the total read depth is divided over the two<br />
subpaths and determines the frequencies of those present<br />
subclusters (Figure 1).<br />
FIGURE 1. Detection of mixed TB infection with hierarchical clusters.<br />
The detected strains are combined with detected drug<br />
susceptibility profiles. A minimized reference genome<br />
consisting of drug resistance genes and 1000 bp flanking<br />
regions is used to map sample reads with BWA, and call<br />
variants with Pilon. Ambiguous variation calls may<br />
indicate that present strains in a mixed infection sample<br />
also have differences in drug susceptibility.<br />
RESULTS & DISCUSSION<br />
In the phylogenetic tree 308 clusters (MTBC root<br />
excluded) were defined and there are 14823 SNP markers<br />
in total that are specific to a cluster and unique within the<br />
cluster. The known MTBC lineages 1 to 6 have between<br />
355-614 markers.<br />
7661 TB samples were tested, present strain(s) and<br />
frequencies could be predicted for 7495 samples of which<br />
914 (~12%) are mixed infections (Table 1).<br />
# of subpopulations 1 2 3 >3<br />
# of samples 6581 798 95 21<br />
TABLE 1. 914 Out of 7495 samples is a mixed infection.<br />
REFERENCES<br />
1. World Health Organization. Global Tuberculosis Report. World<br />
Health Organization, Geneva, Switzerland, 2014.<br />
2. Zetola et al. Mixed Mycobacterium tuberculosis complex infections<br />
and false-negative results for rifampicin resistance by GeneXpert<br />
MTB/RIF are associated with poor clinical outcomes. Journal of<br />
Clin. Microb., 52:2422-2429, 2014.<br />
3. G. Plazzotta, T. Cohen, and C. Colijn. Magnitude and sources of bias<br />
in the detection of mixed strain M. tuberculosis infection. Journal of<br />
theoretical biology, 368:67–73, <strong>2015</strong>.<br />
71
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P28. APPLICATION OF HIGH-THROUGHPUT SEQUENCING TO<br />
CIRCULATING MICRORNAS REVEALS NOVEL BIOMARKERS FOR DRUG-<br />
INDUCED LIVER INJURY<br />
Julian Krauskopf 1* , Florian Caiment 1 , Sandra Claessen 1 , Kent J. Johnson 2 , Roscoe L. Warner 2 , Shelli J. Schomaker 3 ,<br />
Deborah A. Burt 3 , Jiri Aubrecht 3 , Jos C. Kleinjans 1 .<br />
Department of Toxicogenomics, Maastricht University, Maastricht 6200 MD, The Netherlands 1 ; Pathology Department,<br />
University of Michigan, Ann Arbor, MI 48109, USA 2 ; Drug Safety Research and Development, Pfizer, Inc., Groton, CT<br />
06340, USA 2 . *j.krauskopf@maastrichtuniversity.nl<br />
Drug-induced liver-injury (DILI) is a leading cause of acute liver failure and the major reason for withdrawal of drugs<br />
from the market. Preclinical evaluation of drug candidates has failed to detect about 40% of potentially hepatotoxic<br />
compounds in humans. At the onset of liver injury in humans, currently used biomarkers have difficulty differentiating<br />
severe DILI from mild, and/or predict the outcome of injury for individual subjects. Therefore, new biomarker<br />
approaches for predicting and diagnosing DILI in humans are urgently needed. Recently, circulating microRNAs<br />
(miRNAs) such as miR-122 and miR-192 have emerged as promising biomarkers of liver injury in preclinical species<br />
and in DILI patients. In this study, we focused on examining global circulating miRNA profiles in serum samples from<br />
subjects with liver injury caused by accidental acetaminophen (APAP)-overdose. Upon applying next generation highthroughput<br />
sequencing of small RNA libraries, we identified 36 miRNAs, including three novel miRNA-like small<br />
nuclear RNAs, which were enriched in serum of APAP overdosed subjects. The set comprised miRNAs that are<br />
functionally associated with liver-specific biological processes and relevant to APAP toxic mechanisms. Although more<br />
patients need to be investigated, our study suggests that profiles of circulating miRNAs in human serum might provide<br />
additional biomarker candidates and possibly mechanistic information relevant to liver injury.<br />
72
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P29. INFORMATION THEORETIC MODEL FOR GENE PRIORITIZATION<br />
Ajay Anand Kumar 1,2 * , Geert Vandeweyer 1,2 , Lut Van Laer 1,2 & Bart Loeys 1,2 .<br />
Department of Medical Genetics, University of Antwerp 1 ; Biomedical informatics, Antwerp University Hospital 2 .<br />
*ajay.kumar@uantwerpen.be<br />
The identification of top candidate genes involved in human diseases from a list of candidate genes remains<br />
computationally challenging. Many tools exist for this computational prioritization, of which the core typically utilizes<br />
fusion or integration of various genomic annotation data sources. However, due to the rapid generation of novel data<br />
high-throughput experiments, annotation sources often become outdated, lead to annotation errors. Hence, predictions<br />
based on these computational tools are not reliable. To tackle this, we propose an information theoretic model that<br />
effectively fuses annotation sources and regression model under Bayesian framework to prioritize candidate genes. Our<br />
method is fast and performs better as compared to four existing tools on their own benchmark dataset.<br />
INTRODUCTION<br />
Gene Prioritizaton has become a central research problem<br />
in the bioinformatics domain. With the advent of exome<br />
sequencing in clinical genetics, it became a necessity to<br />
automate the identification of the top most genes likely<br />
involved in the disease from a given pool of affected<br />
genes. Various annotation sources can be integrated or<br />
fused to learn multiple functionality of genes and then<br />
design a classifiers/regressor for prioritization. We<br />
propose here an early data integration method that<br />
implements an information retrieval model to fusing the<br />
data at functional feature level and then designing a<br />
discriminative regression model in Bayesian framework to<br />
prioritize candidate genes.<br />
METHODS<br />
Principle behind our approach is based on guilt-byassociation.<br />
Genes that are known to be disease associated<br />
might also share similar functions. The idea is that a<br />
classifier or regressor can be trained on the linear<br />
mapping between functional proximity profiles of genes<br />
and their phenotypic proximity profiles. We implemented<br />
Bayesian regressor to infer the degree of association of the<br />
test genes with the query disease. The work-flow of is<br />
shown in the Figure 1. The details are:<br />
1. Functional annotation: Text, Ontologies (GO, MPO),<br />
Sequence similarity, Pathways, Interactions. Phenotype<br />
annotation: Human Phenotype Ontology (HPO), Disease<br />
Ontology (DO), HuGe/ MeSh terms and GAD<br />
2. TF - IDF (Term Frequency – Inverse document<br />
frequency) methodology is used to assign statistical<br />
weights to the functional attributes of genes form these<br />
annotation sources. TF-IDF is data driven model<br />
traditionally used for information retrieval. We apply same<br />
methodology for weighing features. Together, it gives<br />
gene-by-gene functional & phenotypic proximity profiles.<br />
3. Finally, the Bayesian linear regression model for a<br />
given set of query disease or training genes it learns the<br />
linear mapping between functional & phenotypic<br />
proximity profiles. Y = βX + η, where is Gaussian<br />
distributed. We have incorporated traditional noninformative<br />
Normal-Inverse Gama (NIG) priors for<br />
estimating the unknowns namely β and б.<br />
RESULTS & DISCUSSION<br />
We performed leave-one-out cross validation experiment<br />
on the benchmark data set that was used to compare four<br />
other tools whose design principles are similar to our<br />
method [1]. Our dataset consisted of 1040 disease genes<br />
categorized under manually curated 12 different disease<br />
classes [2]. In our preliminary results for 1154<br />
prioritizations under the cut-off of top 5%, 10% and 30%<br />
genes ranked in random control dataset we achieved<br />
AUROC of 86.31 % against their best achieved score of<br />
83.0%. This clearly indicates our method is comparatively<br />
better with other tools mentioned in the comparative<br />
analysis.<br />
FIGURE 1. Workflow of Bayesian regression model for gene<br />
prioritization.<br />
Currently, we are incurring large-scale cross-validation<br />
with manually curated 6762 disease gene association with<br />
more number of tools and benchmark data [3].<br />
Additionally, we also plan to explore to develop<br />
probabilistic generative approach to model cooccurrences,<br />
dependencies of features for effective data<br />
fusion that can help in finding novel disease causing<br />
genes.<br />
REFERENCES<br />
1. Chen B et.al BMC Med Genomics. <strong>2015</strong>;8 Suppl 3:S2<br />
2. Goh et.al Proc Natl Acad Sci USA 2007, 104(21):8685-8690<br />
3. Börnigen, Daniela, et al. Bioinformatics 28.23 (2012): 3081-3088.<br />
73
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P30. GALAHAD: A WEB SERVER FOR THE ANALYSIS OF DRUG EFFECTS<br />
FROM GENE EXPRESSION DATA<br />
Griet Laenen 1,2,* , Amin Ardeshirdavani 1,2 , Yves Moreau 1,2 & Lieven Thorrez 1,3 .<br />
Dept. of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics,<br />
KU Leuven 1 ; iMinds Medical IT Dept., KU Leuven 2 ; Dept. of Development and Regeneration @ Kulak, KU Leuven 3 .<br />
* griet.laenen@esat.kuleuven.be<br />
Galahad (https://galahad.esat.kuleuven.be) is a web-based application for the analysis of gene expression data from drug<br />
treatment versus control experiments, aimed at predicting a drug’s molecular targets and biological effects. Galahad<br />
provides data quality assessment and exploratory analysis, as well as computation of differential expression. Based on<br />
the obtained differential expression values, drug target prioritization and both pathway and disease enrichment can be<br />
calculated and visualized. Drug target prioritization is based on the integration of the gene expression data with a<br />
functional protein association network.<br />
INTRODUCTION<br />
Gene expression analysis is frequently employed to study<br />
the effects of drug compounds on cells. The observed<br />
transcriptional patterns can provide valuable information<br />
for identifying compound–protein inter-actions as well as<br />
resulting biological effects. To facilitate the analysis of<br />
this particular data type and enable an in-depth exploration<br />
of a drug’s mode of effect, we have developed Galahad 1 .<br />
INPUT<br />
The main input for Galahad are raw Affymetrix human,<br />
mouse or rat DNA microarray data derived from both<br />
untreated control samples and samples treated with a drug<br />
of interest. In addition, Galahad provides the possibility to<br />
start from differential expression data derived with other<br />
platforms to perform drug target prioritization and<br />
enrichment analysis.<br />
METHODS<br />
The different analyses are depicted in Figure 1 and<br />
include:<br />
<br />
<br />
<br />
<br />
<br />
<br />
preprocessing of the raw data with RMA or<br />
MAS5.0, as indicated by the user;<br />
quality assessment and exploratory analysis to<br />
ascertain data quality, uncover experimental<br />
issues, and help in deciding whether certain<br />
arrays need to be considered as outlying;<br />
differential expression analysis to determine the<br />
significance of gene up- and downregulation<br />
following drug treatment;<br />
genome-wide drug target prioritization by<br />
means of an in-house developed algorithm for<br />
network neighborhood analysis integrating the<br />
expression data with functional protein<br />
association infor-mation 2 ;<br />
prediction of molecular pathways involved in the<br />
drug’s mode of effect;<br />
identification of associated disease phenotypes<br />
enabling side effect prediction and drug<br />
repositioning.<br />
OUTPUT<br />
The output is displayed in a series of tabs corresponding to<br />
the different analyses selected by the user:<br />
<br />
<br />
<br />
<br />
in the Quality Control and Data Exploration<br />
tabs, several diagnostic plots are displayed along<br />
with a short explanation;<br />
the Differential Expression tab contains a sorted<br />
table listing all genes together with their log 2<br />
ratios and P-values for differential expression, as<br />
well as links to the corresponding GeneCards<br />
sections;<br />
in the Drug Target Prioritization tab, a ranked<br />
list of genes as potential targets of the drug can be<br />
found, together with the network diffusion-based<br />
scores and P-values for prioritization, and links to<br />
the corresponding GeneCards section; in addition,<br />
a network-based visualization is available for<br />
each gene, showing the 10 interaction partners<br />
contrib-uting most to the gene’s ranking;<br />
the tabs summarizing the results for Pathway<br />
and Disease Enrichment contain a sorted table<br />
with pathway or disease ontology IDs, names,<br />
and database links, together with the number of<br />
differentially expressed genes in the<br />
corresponding gene sets and the accompanying P-<br />
values; in addition, network graphs are available,<br />
consisting of the top 10 most significant<br />
pathways or disease phenotypes, along with their<br />
associated genes colored according to fold change.<br />
FIGURE 1. Overview of the Galahad analysis steps.<br />
REFERENCES<br />
1. Laenen G. et al. Nucl Acids Res 43, W208-W212 (<strong>2015</strong>).<br />
2. Laenen G. et al. Mol BioSyst 9, 1676-1685 (2013).<br />
74
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: 000 Category: Abstract template<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P31. KMAD: KNOWLEDGE BASED MULTIPLE SEQUENCE ALIGNMENT<br />
FOR INTRINSICALLY DISORDERED PROTEINS<br />
Joanna Lange 1,2 , Lucjan S Wyrwicz 1 & Gert Vriend 2* .<br />
Laboratory of Bioinformatics and Biostatistics, M. Sklodowska-Curie Memorial Cancer Center;<br />
Institute of Oncology 1 , CMBI, Radboud University Nijmegen 2 . * vriend@cmbi.ru.nl<br />
INTRODUCTION<br />
Intrinsically disordered proteins (IDPs) lack tertiary<br />
structure and thus differ from globular proteins in terms of<br />
their sequence – structure – function relations. IDPs have a<br />
lower sequence conservation, different types of active<br />
sites, and a different distribution of functionally important<br />
regions, which altogether makes their multiple sequence<br />
alignment (MSA) difficult.<br />
Algorithms underlying existing MSA programs are<br />
directly or indirectly based on knowledge obtained from<br />
studying three dimensional protein structures. Hereby we<br />
introduce a tool for Knowledge based Multiple sequence<br />
Alignment for intrinsically Disordered proteins, KMAD,<br />
that incorporates SLiM, domain, and PTM annotations to<br />
improve the alignments.<br />
KMAD web server is accessible at<br />
http://www.cmbi.ru.nl/kmad/. A standalone version is<br />
freely available.<br />
METHODS<br />
Dataset of proteins experimentally proven to be disordered<br />
was obtained from DisProt (Sickmeier et al., 2007). For<br />
each IDP all homologous sequences were extracted from<br />
SwissProt (The Uniprot Consortium, 2014) using BLAST.<br />
The sequence sets were aligned with several MSA tools.<br />
Apart from manual validation we also performed a<br />
benchmark validation on reference sets from BAliBASE<br />
(Thompson et al., 2005) and PREFAB holding structurebased<br />
'gold standard' sequence alignments. For this<br />
purpose we used KMAD and a modified version of<br />
KMAD, which performs a ’refinement’ of Clustal Omega<br />
(Sievers et al., 2011) alignments.<br />
RESULTS & DISCUSSION<br />
Manual validation showed that KMAD bypasses many<br />
mistakes made by Clustal Omega. An example of an<br />
alignment mistake is shown on Figure 1.<br />
a) Clustal Omega<br />
b) KMAD<br />
FIGURE 1. Excerpts from Clustal Omega and KMAD alignments of<br />
human sialoprotein (SIAL HUMAN) with four homologues. Various PTM<br />
kinds are highlighted with bright colours<br />
In the field of sequence alignment research it is common<br />
practice to compare the sequence alignments obtained with<br />
MSA software with those that are obtained from structure<br />
superpositions. IDPs do not possess a static 3D structure<br />
so that this method is not applicable to KMAD alignments.<br />
Both of the validation methods that we used have their<br />
disadvantages, but so far there is no alternative. Validation<br />
on benchmark alignments of structured proteins is biased<br />
towards Clustal Omega, because it was optimized to work<br />
with structured proteins. On the other hand, the manual<br />
inspection based on the same features that influence the<br />
alignment is not a very elegant method, but given the<br />
nature of IDPs probably the best we can do.<br />
REFERENCES<br />
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high<br />
accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–<br />
1797.<br />
Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W.,<br />
Lopez, R., McWilliam, H., Remmert, M., S öding, J., Thompson, J.<br />
D., and Higgins, D. G. (2011). Fast, scalable generation of highquality<br />
protein multiple sequence alignments using Clustal Omega.<br />
Molecular System Biology, 7(539), 539.<br />
Sickmeier, M., Hamilton, J. a., LeGall, T., Vacic, V., Cortese, M. S.,<br />
Tantos, A., Szabo, B., Tompa, P., Chen, J., Uversky, V. N.,<br />
Obradovic, Z., and Dunker, a. K. (2007). DisProt: the Database of<br />
Disordered Proteins. Nucleic Acids Research, 35(Database issue),<br />
D786–93.<br />
The Uniprot Consortium (2014). Activities at the Universal Protein<br />
Resource (UniProt). Nucleic Acids Research, 42(Database issue),<br />
D191–8.<br />
Thompson, J. D., Koehl, P., Ripp, R., and Poch, O. (2005). BAliBASE<br />
3.0: latest developments of the multiple sequence alignment<br />
benchmark. Proteins: Structure, Function, and Bioinformatics,<br />
61(1), 127–136.<br />
75
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P32. ON THE LZ DISTANCE FOR DEREPLICATING<br />
REDUNDANT PROKARYOTIC GENOMES<br />
Raphaël R. Léonard 1,2* , Damien Sirjacobs², Eric Sauvage 1 , Frédéric Kerff 1 & Denis Baurain².<br />
Centre for Protein Engineering, University of Liège 1 ; PhytoSYSTEMS, University of Liège 2 . * rleonard@doct.ulg.ac.be<br />
The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem<br />
when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of<br />
the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to<br />
single species, such as Escherichia coli (almost 2000 genomes as of Sept <strong>2015</strong>), while the continuous flow of newly<br />
discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes<br />
combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An<br />
automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the<br />
quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes<br />
without redundancy. We are planning to release this tool on a website that will be made publicly available.<br />
INTRODUCTION<br />
The LZ distance (Lempel and Ziv, 1977; Otu and Sayood,<br />
2003) is inspired by compression algorithms, such as gzip<br />
or WinRAR. This distance, amongst others, has already<br />
been used in attempts to produce alignment-free<br />
phylogenetic trees (Bacha and Baurain, 2005; Hohl et al.<br />
2007), though the results were disappointing in such a<br />
context (due to the heterogeneity of the substitution<br />
process at large evolutionary scales). However, the LZ<br />
distance is likely to provide enough resolving power to<br />
identify groups of redundant genomes and to keep only<br />
one representative for each group.<br />
METHODS<br />
For each pair of genomes A and B, the LZ distance is<br />
computed from the gzip-compressed file lengths of the<br />
corresponding nucleotide assemblies s(A) and s(B) and of<br />
their concatenations s(A+B) and s(B+A). These distances,<br />
along with taxonomic information, are stored in a<br />
database.<br />
A clustering method is then applied to regroup the similar<br />
genomes into a user-specified number of groups. For each<br />
of these groups, a representative is chosen based on the<br />
quality of the genomic assemblies (chromosomes rather<br />
than scaffolds) and of the protein annotations (e.g., few<br />
rather than many “unknown proteins”).<br />
RESULTS & DISCUSSION<br />
Our method using the LZ distance is currently under<br />
development using the genomes from the release 28 of<br />
Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/<br />
bacteria/release-28/). It contains 20,950 unique<br />
prokaryotic genomes, composed of 286 Archaea and<br />
20,664 Bacteria. The three most represented phyla are the<br />
Proteobacteria (8642, of which 1980 E. coli), the<br />
Firmicutes (7766) and the Actinobacteria (2673). These<br />
genomes are already the result of a pre-processing step<br />
designed to remove extra assemblies for strains present in<br />
multiple copies (due to parallel sequencing or<br />
resequencing in different labs).<br />
We are working on different approaches for validating our<br />
dereplication method, based on (1) current taxonomy, (2)<br />
16S rRNA phylogeny, and (3) clustering using genomic<br />
signatures (Moreno-Hagelsieb et al. 2013).<br />
First, we compute a central measure of the taxonomic<br />
“purity” of all genome clusters, which reflects the amount<br />
of “mixture” at different taxonomic levels (phylum, class,<br />
order etc). A good clustering should regroup different<br />
genera (or species) without amalgamating distinct classes<br />
(or phyla). Second, we cut the branches of a large 16S<br />
rRNA tree based on the same genome collection to<br />
produce an equal number of groups to compare with our<br />
clustering method. We then compute a statistic of the<br />
overlap between the 16S subtrees and the LZ clusters. A<br />
good clustering should have a reasonable overlap with the<br />
gold standard that is the 16S rRNA tree. Third, using the<br />
same overlap metric, we compare the LZ clusters to<br />
clusters obtained using the genomic signature.<br />
Finally, an interactive tool will be made available through<br />
a website. It will allow the users to download precomputed<br />
sets of representative genomes for either the<br />
complete database or for taxonomic subsets. We are also<br />
planning to allow users to upload their own genomes to<br />
cluster them with the LZ method.<br />
REFERENCES<br />
Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data<br />
Compression.’ IEEE Transactions on Information Theory 23.3.<br />
doi:10.1109/TIT.1977.1055714.<br />
Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for<br />
Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130.<br />
doi:10.1093/bioinformatics/btg295.<br />
Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013.<br />
‘Phylogenomic Clustering for Selecting Non-Redundant Genomes<br />
for Comparative Genomics.’ Bioinformatics 29.1: 947–949.<br />
doi:10.1093/bioinformatics/btt064.<br />
Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment<br />
Required for Accurate Inference of Phylogeny?’ Systematic biology<br />
56.2: 206–221. doi:10.1080/10635150701294741.<br />
Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity<br />
to alignment-free sequence comparison of protein families’.<br />
Benelux Bioinformatics Conference 2005.<br />
http://hdl.handle.net/2268/80179<br />
76
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P33. THE ROLE OF MIRNAS IN ALZHEIMER’S DISEASE<br />
Ashley Lu 1,2* , Annerieke Sierksma 1,2 , Bart De Strooper 1,2 & Mark Fiers 1,2 .<br />
VIB Center for the Biology of Disease 1 ; KU Leuven Center for Human Genetics 2 . * ashley.lu@cme.vib-kuleuven.be<br />
MicroRNAs (miRNA) play an important role in post-transcriptional regulation and were shown to be dysregulated in<br />
Alzheimer’s disease. By analysing the hippocampal miRNA and mRNA expression of two mouse models of Alzheimer’s<br />
disease, we identify a set of miRNAs that are dysregulated with the onset of cognitive impairments. Using GO<br />
enrichment analysis we aim to identify miRNAs that likely play a role in learning and memory.<br />
INTRODUCTION<br />
MiRNAs are small non-coding RNAs involved in posttranscriptional<br />
regulation through mRNA inhibition or<br />
degradation. Past studies have suggested miRNAs to play<br />
a direct role in Alzheimer’s disease (AD), e.g. by<br />
modulating the expression of genes involved in the<br />
formation of neuropathological protein aggregates (Lau P<br />
& De Strooper B, 2010). In this study, we investigated the<br />
changes in miRNA and mRNA expression in two AD<br />
mouse models: APPswe/PS1 L166P (Radde R, 2006) and<br />
Thy-Tau22 (Schindowski K, 2006), which have similar<br />
patterns of cognitive impairment, but different pathology.<br />
We aim to better understand the functional role of<br />
miRNAs in AD-related cognitive impairments.<br />
METHODS<br />
RNA was extracted from the left hippocampus of 96 mice.<br />
The experiment covers the two models (APPswe/PS1 L166P<br />
& Thy-Tau22), with wild type controls for each. All<br />
genotypes are tested at two ages (4 and 10 months); before<br />
and after onset of cognitive impairment. This yields eight<br />
experimental groups with twelve mice each.<br />
Expression profiles of miRNAs and mRNAs were<br />
generated using Illumina single-end sequencing.<br />
Differential Expression (DE) analysis was performed<br />
using the limma package of R/Bioconductor with a linear<br />
model to test the effects of age, genotype and their<br />
interaction.<br />
Functional analysis of the mRNAs and miRNAs are<br />
conducted separately. For mRNAs, gene ontology analysis<br />
was applied to sets of the most up- and down regulated<br />
genes.<br />
To determine the functional impact of dysregulated<br />
miRNAs we determined which mRNAs are the most likely<br />
direct targets of each miRNA using the following<br />
approach: 1) for each miRNA we calculated the Pearson’s<br />
correlation coefficient to each mRNA based on the<br />
miRNA and mRNA expression data. 2) For each miRNA<br />
we extracted the predicted set of targets from Targetscan<br />
(Lewis BP & Burge CB & Bartel DP, 2005), with Diana<br />
(Maragkakis M et al. 2011) as backup when Targetscan<br />
had no record. 3) We filtered the miRNA target genes by<br />
determining the leading edge set in a GSEA PreRanked<br />
analysis (Subramanian A. et al, 2005) using the predicted<br />
target mRNAs of each miRNA against the mRNAs ranked<br />
according to the Pearson’s scores generated in step 1. We<br />
additionally investigated target sets based on a Pearson’s<br />
correlation coefficient cut-off of -0.2, -0.3, and -0.4. 4)<br />
Gene-ontology analysis was then applied to these<br />
candidate target sets to infer the likely biological function<br />
of each miRNA.<br />
RESULTS & DISCUSSION<br />
DE analysis showed that the direction of expression level<br />
changes in mRNAs are similar between APPswe/PS1 166P<br />
and Thy-Tau22 in terms of age*genotype interaction<br />
effects. However, for the miRNAs the expression pattern<br />
is less obvious. Overall, the effect size is more pronounced<br />
in APPswe/PS1 L166P mouse than the Thy-Tau22 for both<br />
miRNAs and mRNAs.<br />
Functional analyses of the down-regulated mRNAs show a<br />
clear enrichment in cognition and neural development<br />
related categories, whereas up-regulated genes show a<br />
clear inflammatory signature.<br />
Combining miRNA target prediction with miRNA/mRNA<br />
correlation analysis shows a marked increase of GO<br />
enrichment scores. This analysis strongly suggests a<br />
regulatory role for miRNAs in the down regulation of<br />
genes involved in learning, cognition and related<br />
categories.<br />
This analysis workflow has allowed focusing on a list of<br />
miRNAs that likely play a direct role in the observed<br />
learning and memory deficits in AD mouse models, and<br />
have been used to select candidate miRNAs for<br />
downstream in vivo experiments, which will hopefully<br />
provide a deeper understanding in the impact of AD on<br />
learning and cognition.<br />
REFERENCES<br />
Lau P & De Strooper B. Seminars in Cell & Developmental Biology,<br />
21(7), 768–773, (2010).<br />
Radde R. EMBO reports, 7(9), 940–946, (2006).<br />
Schindowski K. The American Journal of Pathology, 169(2),599–616,<br />
(2006).<br />
Lewis BP & Burge CB & Bartel DP. Cell, 120,15-20 (2005).<br />
Maragkakis M et al. Nucleic Acids Research (2011)<br />
Subramanian A. et al. Proceedings of the National Academy of Sciences<br />
of the United States of America, 102(43), 15545–15550, (2005)<br />
77
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P34. FUNCTIONAL SUBGRAPH ENRICHMENTS<br />
FOR NODE SETS IN REGULATORY NETWORKS<br />
Pieter Meysman 1,2* , Yvan Saeys 3,4 , Ehsan Sabaghian 5,6 , Wout Bittremieux 1,2 ,<br />
Yves van de Peer 5,6 , Bart Goethals 1 & Kris Laukens 1,2 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />
Antwerpen (biomina) 2 ; VIB Inflammation Research Center 3 ; Department of Respiratory Medicine, Ghent University 4 ;<br />
Department of Plant Biotechnology and Bioinformatics, Ghent University 5 ; Department of Plant Systems Biology,<br />
VIB/Ghent University 6 . * pieter.meysman@uantwerpen.be<br />
We have developed a subgroup discovery algorithm to find subgraphs in a single graph that are associated with a given<br />
set of nodes. The association between a subgraph pattern and a set of vertices is defined by its significant enrichment<br />
based on a Bonferroni-corrected hypergeometric probability value, and can therefore be considered as a network-focused<br />
extension of traditional gene ontology enrichment analysis. We demonstrate the operation of this algorithm by applying it<br />
on two transcriptional regulatory networks and show that we can find relevant functional subgraphs enriched for the<br />
selected nodes.<br />
INTRODUCTION<br />
Frequent subgraph mining (FSM) is a common but<br />
complex problem within the data mining field that has<br />
gained in importance as more graph data has become<br />
available. However traditional FSM finds all frequent<br />
subgraphs within the graph dataset, while often a more<br />
interesting query is to find the subgraphs that are most<br />
associated with a specific set of nodes. Nodes of interest<br />
might be those that are associated with a specific disease,<br />
or those that are differentially expressed in an omics<br />
experiment.<br />
METHODS<br />
To address this issue, we developed a novel subgraph<br />
mining algorithm that can efficiently construct, match and<br />
test candidate subgraphs against the given graph for<br />
enrichment within a specific set of nodes (Meysman et al.<br />
<strong>2015</strong>). To allow the enrichment testing, each candidate<br />
subgraph is built around a ‘source’ node. A subgraph<br />
match where the source node corresponds to a node of<br />
interest is counted as a ‘hit’. If the source node is not a<br />
node of interest, it is counted as a background hit. In this<br />
manner the problem of enrichment can be easily tested<br />
using a hypergeometric test. Furthermore, we show that<br />
this definition of enrichment allows us to drastically prune<br />
the search space that the algorithm must traverse to find all<br />
enriched subgraphs.<br />
An implementation of the algorithm is available at<br />
http://adrem.ua.ac.be/sigsubgraph.<br />
RESULTS & DISCUSSION<br />
The first data set concerned the yeast genes that have<br />
remained in duplicate following the most recent whole<br />
genome duplication. Within the yeast transcriptional<br />
network, we found that these duplicate genes were<br />
enriched for self-regulating motifs (e.g. feedback loops,<br />
self edges, etc.), which matches the duplicated nature of<br />
these genes (Figure 1).<br />
FIGURE 1. Enriched subgraphs for yeast duplicated genes<br />
The second data set concerned mining the subgraphs<br />
associated with the homologs of the PhoR transcription<br />
factor across seven different inferred bacterial regulatory<br />
networks from Colombos expression data (Meysman et al.<br />
2014). These PhoR homologs were found to be<br />
significantly associated with several complex regulatory<br />
motifs.<br />
REFERENCES<br />
Meysman P et al. Discovery of Significantly Enriched<br />
Subgraphs Associated with Selected Vertices in a<br />
Single Graph. Proceedings of the 14th International<br />
Workshop on Data Mining in Bioinformatics (<strong>2015</strong>).<br />
Meysman P et al. COLOMBOS v2. 0: an ever expanding<br />
collection of bacterial expression compendia. Nucleic<br />
acids research 42 (D1), D649-D653 (2014).<br />
78
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: 000<br />
Category: Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P35. HUMANS DROVE THE INTRODUCTION & SPREAD OF<br />
MYCOBACTERIUM ULCERANS IN AFRICA<br />
Koen Vandelannoote 1,2,* , Conor Meehan 1* , Miriam Eddyani 1 , Dissou Affolabi 3 , Delphin Mavinga Phanzu 4 , Sara<br />
Eyangoh 5 , Kurt Jordaens 6 , Françoise Portaels 1 , Kirstie Mangas 7 , Torsten Seemann 7 , Herwig Leirs 2 , Tim Stinear 7 &<br />
Bouke C. de Jong 1 .<br />
Institute of Tropical Medicine, Antwerp, Belgium 1 ; Evolutionary Ecology Group, University of Antwerp, Antwerp,<br />
Belgium 2 ; Laboratoire de Référence des Mycobactéries, Cotonou, Benin 3 ; Institut Médical Evangélique, Kimpese,<br />
Democratic Republic of Congo 4 ; Centre Pasteur du Cameroun, Yaoundé, Cameroun 5 ; Joint Experimental Molecular<br />
Unit, Royal Museum for Central Africa, Tervuren, Belgium 6 ; Department of Microbiology and Immunology, University<br />
of Melbourne, Melbourne, Australia 7 . *cmeehan@itg.be<br />
Buruli ulcer (BU) is an insidious neglected tropical disease. BU is reported around the world but the rural regions of<br />
West and Central Africa are most affected. How BU is transmitted and spreads has remained a mystery, even though the<br />
causative agent, Mycobacterium ulcerans, has been known for more than 70 years. Here, using the tools of population<br />
genomics, we reconstruct the evolutionary history of M. ulcerans by comparing 167 isolates spanning 48 years and<br />
representing 11 endemic countries across Africa. The genetic diversity of African M. ulcerans proved very limited<br />
because of its slow substitution rate coupled with its recent origin. We show for the first time how M. ulcerans has<br />
existed in Africa for several hundreds of years but was recently re-introduced during the period of Neo-imperialism. We<br />
also provide evidence of the role that the so-called “Scramble for Africa” played in the spread of the disease.<br />
INTRODUCTION<br />
The clonal population structure of M. ulcerans has meant<br />
that conventional genetic fingerprinting methods have<br />
largely failed to differentiate clinical disease isolates,<br />
complicating molecular analyses on the elucidation of the<br />
population structure, and the evolutionary history of the<br />
pathogen. Whole genome sequencing (WGS) is currently<br />
replacing conventional genotyping methods for M.<br />
ulcerans.<br />
METHODS<br />
We analyzed a panel of 165 M. ulcerans disease isolates<br />
originating from disease foci in 11 different African<br />
countries that had been cultured between 1964 and 2012.<br />
Index-tagged paired-end sequencing-ready libraries were<br />
prepared from gDNA extracts. Genome sequencing was<br />
performed on the Illumina HiSeq 2000 DNA sequencer or<br />
the Illumina MiSeq sequencing platform with respectively<br />
2x150bp and 2x250bp paired-end sequencing chemistry.<br />
Read mapping and SNP detection were performed using<br />
the Snippy v.2.6 pipeline. Bayesian model-based inference<br />
of the genetic population structure was performed using<br />
BAPS v.6.0. 1 Evidence for recombination between<br />
different BAPS-clusters was assessed using BRAT-<br />
NextGen 2 . We used BEAST2 v2.2.1 3 to date evolutionary<br />
events, determine the substitution rate and produce a timetree<br />
of African M. ulcerans. A permutation test was used<br />
to assess the validity of the temporal signal in the data. To<br />
assess the geospatial distribution of African M. ulcerans<br />
through time, an additional BEAST2 analysis was<br />
performed with a discrete BSSVS geospatial model 4 .<br />
RESULTS & DISCUSSION<br />
Resulting sequence reads were mapped to the Ghanaian M.<br />
ulcerans Agy99 reference genome and, after excluding<br />
mobile repetitive elements and small indels, we detected a<br />
total of 9,193 SNPs randomly distributed across the M.<br />
ulcerans chromosome with approximately 1 SNP per 613<br />
bp (0.15% nucleotide divergence). We explored the<br />
distribution of DNA chromosomal deletions and identified<br />
differential genome reduction that strongly supports the<br />
existence of two specific M. ulcerans lineages within the<br />
African continent, hereafter referred to as Lineage Africa I<br />
(Mu_A1) and Lineage Africa II (Mu_A2). Subsequent<br />
SNP-based exploration of the genetic population structure<br />
agreed with the above deletion analysis and subdivided the<br />
African M. ulcerans population into four major clusters.<br />
BRAT-NextGen did not detect any recombined segments<br />
in any isolate, supporting a strongly clonal population<br />
structure for M. ulcerans that is evolving by vertically<br />
inherited mutations. Within the phylogenetic tree, isolates<br />
formed tight, shallow-rooted phylogenetic clusters which<br />
are suggestive of contemporary dispersal. We estimated a<br />
very slow mean genome wide substitution rate of 6.32E-8<br />
per site per year. The Bayesian analysis demonstrated that<br />
Mu_A1 has existed in Africa for several hundreds of years<br />
and that Mu_A2 was recently introduced on the continent.<br />
The re-introduction event coincides well with a historical<br />
event of particular interest: the period of Neo-imperialism<br />
(1881-1914). Since tMCRA(Mu_A2) did not predate<br />
colonization it seems very likely that lineage Mu_A2 was<br />
introduced after the instigation of colonial rule through an<br />
influx of BU infected humans. The time-tree of African M.<br />
ulcerans also reveals evidence of the likely role that the<br />
so-called “Scramble for Africa” played in the spread of<br />
endemic Mu_A1 clones in three hydrological basins<br />
(Congo, Oueme & Nyong) that are particularly well<br />
covered by our isolate panel.<br />
REFERENCES<br />
1. Corander, J., et al. (2008) BMC bioinformatics. 9: p. 539.<br />
2. Marttinen, P., et al. (2012) Nucleic acids research. 40(1): p. e6.<br />
3. Bouckaert, R., et al. (2014) PLoS computational biology. 10(4): p.<br />
e1003537.<br />
4. Lemey, P., et al., (2009) PLoS computational biology. 5(9): p.<br />
e1000520.<br />
79
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA<br />
DETECTION AND CLASSIFICATION IN PLANTS<br />
Lionel Morgado 1* & Frank Johannes 2,3 .<br />
Groningen Bioinformatics Centre (GBiC), University of Groningen 1 ; Department of Plant Sciences, Center of Life and<br />
Food Sciences Weihenstephan, Technical University Munich 2 ; Institute of Advanced Studies, Technical University<br />
Munich 3 . * lionelmorgado@gmail.com<br />
Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional<br />
silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a<br />
given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in<br />
combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are<br />
bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing<br />
data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our<br />
SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and<br />
classification in plants.<br />
INTRODUCTION<br />
Small RNA molecules are known to have an important<br />
role in gene expression control. It is therefore of extreme<br />
interest to be able to detect them and determine the<br />
regulatory pathways in which they are involved. With the<br />
current laboratorial methods it is unfeasible to test the high<br />
number of sRNA candidates, but there are computational<br />
methods that can greatly narrow down the list.<br />
Nevertheless, sRNA activity is still far from being fully<br />
understood and that is reflected in the very high false<br />
positive rate of the prediction tools currently available.<br />
High throughput sequencing in combination with<br />
immunoprecipitation (IP) techniques make nowadays<br />
possible to access sRNA sequences associated with<br />
specific AGO. AGO-sRNA binding is a fundamental step<br />
for the activation of specific silencing pathways. Here,<br />
AGO-sRNA data acquired from A. thaliana is explored<br />
with SVM-based algorithms to learn which sequence<br />
features drive different AGO-sRNA associations. Using<br />
this knowledge, a framework for in silico sRNA detection<br />
and classification in plants is presented.<br />
METHODS<br />
A system with 3 layers of classifiers (see figure 1) was<br />
designed to identify different kinds of sRNA: the 1 st layer<br />
includes a binary SVM model that filters out sequences<br />
that don’t bind to AGO and are therefore most probably<br />
inactive; 2 nd layer is composed by an ensemble of binary<br />
classifiers, each trained to explore the differences in sRNA<br />
bound to a specific AGO against all others; and finally, the<br />
3 rd layer comprises a multiclass linear model to assign the<br />
most akin AGO to a given sRNA, using scores produced<br />
in the previous layer.<br />
Diverse AGO-sRNA libraries from A. thaliana were<br />
explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10.<br />
After the typical RNA-seq library preprocessing, quality<br />
check and genome mapping, several features were<br />
extracted from the remaining sequences, namely: position<br />
specific base composition, sequence length, k-mer<br />
composition and entropy scores. The different feature sets<br />
were explored separately and in different combinations.<br />
Initially, highly correlated features (pearson score>0.75)<br />
were removed, and the remaining ones were further<br />
subjected to selection using SVM-RFE (Guyon et al.,<br />
2002) with a linear kernel to handle the large data set size.<br />
A 10-fold cross-validation procedure was executed to<br />
modulate the variation in the data, being the best features<br />
of each round determined as the ones with the highest<br />
average weight across the models with the best ROC-AUC<br />
score in each cross-validation subset. Each round, 1/3 of<br />
the remaining features with the worst performance were<br />
eliminated, being the process repeated until no more<br />
features were available. The best features found were then<br />
used to train the final classifiers using RBF kernels with<br />
optimal parameters. This was repeated for all models in<br />
layers 1 and 2.<br />
AGO1<br />
vs<br />
otherAGO<br />
AGO vs noAGO<br />
AGO2<br />
vs<br />
otherAGO<br />
…<br />
Final AGO prediction<br />
FIGURE 1. Proposed architecture for the SVM-based framework.<br />
RESULTS & DISCUSSION<br />
AGO10<br />
vs<br />
otherAGO<br />
Layer 1<br />
Layer 2<br />
Layer 3<br />
Although the classifiers are still being optimized,<br />
preliminary results from the 2 nd layer of the framework<br />
(see figure 1) show that the top ranked features by SVM-<br />
RFE reflect indeed significant biological patterns for<br />
AGO-sRNA association. Among others, the relevance of<br />
the 5’ terminal nucleotide was observed, in agreement<br />
with findings from previous work (Mi et al., 2008).<br />
Additionally, the accuracy for the models trained span<br />
values that range from 71% to 86%, showing their<br />
capacity to recognize specific AGO-binding patterns.<br />
REFERENCES<br />
Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn<br />
46:389-422 (2002)<br />
Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the<br />
5’terminal nucleotide. Cell 133(1): 116-27 (2008).<br />
Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5:<br />
413, 209-215 (2014).<br />
80
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P37. ANALYSIS OF RELATIONSHIP PATTERNS<br />
IN UNASSIGNED MS/MS SPECTRA<br />
Aida Mrzic 1,2* , Wout Bittremieux 1,2 , Trung Nghia Vu 4 , Dirk Valkenborg 3,5,6 , Bart Goethals 1 & Kris Laukens 1,2 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />
Antwerpen (biomina) 2 ; Flemish Institute for Technological Research (VITO), Mol 3 ; Karolinska Institutet, Stockholm 4 ;<br />
CFP, University of Antwerp 5 ; I-BioStat, Hasselt University 6 . * aida.mrzic@uantwerpen.be<br />
Tandem mass spectrometry (MS/MS) spectra generated in proteomics experiments often contain a large portion of<br />
unexplained peaks, despite continuous search engines improvements. Here we use pattern mining technique to determine<br />
the origin of these unassigned spectra. We discover patterns that indicate the presence of chimeric spectra and missed<br />
post-translational modifications (PTMs).<br />
INTRODUCTION<br />
Regardless of being a rich source of information, mass<br />
spectra acquired in mass spectrometry proteomics<br />
experiments often contain a significant number of<br />
unexplained peaks, or even remain completely<br />
unidentified. The unexplained fraction of mass spectra<br />
may come from low-quality or chimeric MS/MS spectra,<br />
or unexpected PTMs. To interpret the unexplained data,<br />
we propose a structured analysis of the peaks occurring in<br />
MS/MS spectra. We employ an unsupervised pattern<br />
mining technique (Naulaerts et al., 2013) to discover<br />
which peaks are associated with each other, and therefore<br />
are likely to have a common origin.<br />
METHODS<br />
Frequent itemset mining<br />
The technique we used to discover relationships between<br />
frequently co-occurring peaks in MS/MS data is frequent<br />
itemset mining, a class of data mining techniques that is<br />
specifically designed to discover co-occurring items in<br />
transactional datasets. The typical example of frequent<br />
itemset mining is the discovery of sets of products that are<br />
frequently bought together. Here, every set of products<br />
purchased together represents a single transaction, which<br />
results in a dataset consisting of a large number of<br />
supermarket basket transactions that can be mined for<br />
frequent patterns (Figure 1). In our approach a transaction<br />
consists of the mass differences between relevant peaks in<br />
the MS/MS spectrum.<br />
FIGURE 1. Frequent itemset mining principle.<br />
Mass differences associations<br />
In order to detect relationships between different types of<br />
mass spectrometry peaks, a distinction is made between<br />
peaks that were relevant for spectrum identification<br />
(assigned peaks) and peaks that were not used for the<br />
identification (unassigned peaks) (Vu et al., 2013). The<br />
mass differences between peaks (either assigned,<br />
unassigned, or both) are then calculated so that for each<br />
MS/MS spectrum in the dataset there is a single<br />
transaction consisting of all its mass differences.<br />
After obtaining these transactions for all MS/MS spectra<br />
in the dataset, frequent itemset mining can be employed to<br />
detect relationship patterns (Figure 2). These patterns can<br />
indicate previously unknown characteristics of the spectra,<br />
or even detect novel PTMs.<br />
FIGURE 2. Outline of the approach.<br />
RESULTS & DISCUSSION<br />
In order to evaluate our approach, we used MS/MS<br />
datasets from the PRoteomics IDEntifications (PRIDE)<br />
database (Vizcaino et al., 2013). This database contains a<br />
large number of publicly available datasets from massspectrometry-based<br />
proteomics experiments. However, the<br />
quality of the submitted datasets can be subject to a large<br />
variability, which makes it a proper candidate for our<br />
pattern mining approach.<br />
Preliminary results show that the detected patterns are able<br />
to capture valid information in a spectrum. The obtained<br />
patterns indicate peaks originating from the same peptide<br />
in case of chimeric spectra and mass differences<br />
originating from common PTMs.<br />
REFERENCES<br />
Naulaerts et al. Brief Bioinform, 16(2): 216–231 (<strong>2015</strong>).<br />
Vizcaino et al. Nucleic Acids Res, 41(D1):D1063-9 (2013).<br />
Vu et al. Proteome Science, 12:54 (2014).<br />
81
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P38. MINING ACROSS “OMICS” DATA FOR DRUG PRIORITIZATION<br />
Stefan Naulaerts 1,2* , Pieter Meysman 1,2 , Bart Goethals 1 , Wim Vanden Berghe ,3 & Kris Laukens 1,2 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center<br />
Antwerpen (biomina) 2 ; Department for Biomedical Sciences, University of Antwerp 3 . * stefan.naulaerts@uantwerpen.be<br />
Drug resistance and response have traditionally been investigated by means of case-by-case studies. The process to<br />
profile drug compounds is time and resource intensive. Large scale information on gene expression and protein<br />
abundance, protein interactions, as well as functional and pathways annotations exist nowadays, as well as freely<br />
accessible repositories for drug targets. Also structural evidence of select drug compounds is publicly available. These<br />
data offer an enormous opportunity for data integration and pattern mining efforts across each of these levels. Here, we<br />
apply frequent itemset mining to identify structurally similar compounds, and to detect patterns within the biological<br />
effect profiles of these chemical compound families. Next, we explore how we can link both types of patterns to metainformation<br />
(such as drug interactions) in a bid to identify promising compounds and speed up the drug discovery<br />
process by means of candidate prioritization.<br />
INTRODUCTION<br />
In the last decades, several widely used databases have<br />
emerged. These vary from gene expression data and massspectrometric<br />
protein identifications to resources covering<br />
interaction graphs or functional annotations of proteins<br />
and chemicals.<br />
The presence of these resources offers interesting<br />
opportunities to gain deeper insight in drug mode of action,<br />
as well as help reduce important bottlenecks with regards<br />
to the speed of novel drug discovery or drug repurposing,<br />
by intelligently prioritizing potentially interesting<br />
compounds.<br />
METHODS<br />
To integrate the listed kinds of data, we use pattern mining<br />
methods that are collectively known as “frequent itemset<br />
mining”. This set of techniques uses clever heuristics to<br />
efficiently find items that occur more often together than a<br />
minimal threshold. In this work, we identified several<br />
pattern types based on their source:<br />
<br />
<br />
<br />
Expression itemsets<br />
Metadata itemsets<br />
Graph patterns (protein-protein, protein-drug and<br />
chemical structures)<br />
For subgraph mining, we used GASTON 1 . All other data<br />
sources were analysed with Apriori 2 .<br />
To deal with the extreme numbers of patterns that result<br />
from mining this kind of data, we used a filter which<br />
incorporates several quality measures based on objective<br />
data mining measures properties (e.g. lift), as well as more<br />
biologically inspired methods (e.g. functional coherence in<br />
the Gene Ontology 3 tree).<br />
Simple classification based on the patterns was performed<br />
with CBA 4 .<br />
RESULTS & DISCUSSION<br />
We were able to identify several backbone patterns within<br />
the chemical structures studied and used these to define<br />
“chemical compound families”. Next, we used this<br />
classification as starting point to group experimental<br />
evidence (bio-assays, interactions and metadata). After<br />
applying cut-offs based on the quality measures, all<br />
patterns remaining were significant and made sense<br />
biologically.<br />
Unsurprisingly, structurally similar compound families<br />
show significant pattern overlaps in drug-drug interactions,<br />
gene expression, term co-occurrence and conserved<br />
protein-protein interactions. We found that specific<br />
patterns in the biological profile often correlate with<br />
specific discriminative structural patterns. Moreover, these<br />
collections of structural frequent subgraphs seemed highly<br />
relevant for the mode in which a compound connects to<br />
the “core” proteome. This central proteome performs<br />
essential functions of the cell (e.g. energy metabolism) and<br />
it is known to be conserved across cell types. Structurally<br />
distinct compound families converge much later (if at all)<br />
to the same “core proteins” than more similar chemicals<br />
do. This observation corresponds to currently known<br />
pathway knowledge and tissue biology.<br />
We were further able to associate previously unseen<br />
compounds to chemicals present in the database, based on<br />
the subgraph collection and by extension to the biological<br />
profile patterns. Manual survey of literature indicated that<br />
several compounds not covered by our database have<br />
recently been approved or are in testing as alternative<br />
drugs to the compounds we hypothesized as being<br />
substantially similar.<br />
FIGURE 1. Visualizing the dexamethasone environment. Both predictions<br />
and experimental evidence (drug-target and protein-protein interactions)<br />
are shown.<br />
REFERENCES<br />
1. Nijssen S & Kok J. ENTCS 127, 77-87 (2005).<br />
2. Agrawal R & Srikant R. Proc 20th Int Conf on Very Large Databases<br />
(1994).<br />
3. Ashburner M et al. Nat Genet 25, 25-29 (2000).<br />
4. Liu B et al. KDD (1998).<br />
82
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P39. ABUNDANT TRANS-SPECIFIC POLYMORPHISM AND A COMPLEX<br />
HISTORY OF NON-BIFURCATING SPECIATION IN THE GENUS<br />
ARABIDOPSIS<br />
Polina Novikova 1 , Nora Hohmann 2 , Marcus Koch 2 & Magnus Nordborg 1 .<br />
Gregor Mendel Institute, Austrian Academy of Sciences, Vienna Biocenter (VBC), A-1030 Vienna, Austria 1 ; Centre for<br />
Organismal Studies Heidelberg, University of Heidelberg, D-69120 Heidelberg, Germany 2 .<br />
*magnus.nordborg@gmi.oeaw.ac.at<br />
The prevailing notion of species rests on the concept of reproductive isolation. Under this model, sister taxa should not<br />
share genetic variation unless they still hybridize, or diverged too recently for genetic drift to have eliminated shared<br />
ancestral polymorphism, and gene trees should generally agree with species trees. Advances in sequencing technology<br />
are finally making it possible to evaluate this model. We sequenced (Illumina 100bp paired reads) multiple individuals<br />
from 26 proposed taxa in the genus Arabidopsis. Cluster analysis identified seven distinct groups, corresponding to four<br />
common species — the model species A. thaliana, plus A. arenosa, A. halleri and A. lyrata — and three species with<br />
very limited geographical distribution. However, at the level of gene trees, only the separation of A. thaliana from the<br />
remaining taxa was universally supported, and even in this case there was abundant sharing of ancestral polymorphism<br />
with the other taxa, demonstrating that reproductive isolation must be fairly recent. By considering the distribution of<br />
derived alleles, we were also able to reject a bifurcating species tree because there is clear evidence for asymmetrical<br />
gene flow between taxa. Finally, we show that the pattern of sharing and divergence between taxa differs between gene<br />
ontologies, suggesting a role for selection.<br />
83
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P40. RIBOSOME PROFILING ENABLES THE DISCOVERY OF SMALL OPEN<br />
READING FRAMES (SORFS), A NEW SOURCE OF BIOACTIVE PEPTIDES<br />
Volodimir Olexiouk 1,* , Jeroen Crappé 1 , Steven Verbruggen 1 & Gerben Menschaert 1,* .<br />
Lab of Bioinformatics and Computational Genomics (BioBix), Department of Mathematical Modelling, Statistics and<br />
Bioinformatics, Faculty of Bioscience Engineering, Ghent University 1 .<br />
INTRODUCTION<br />
Evidence for micropeptides, defined as translation<br />
products from small open reading frames (sORFs), has<br />
recently emerged. While limitations contributed to<br />
sequencing technologies as well as proteomics have<br />
stalled the discovery of micropeptides. It is the advent of<br />
ribosome profiling (RIBO-SEQ), a next generation<br />
sequencing technique revealing the translation machinery<br />
on a sub-codon resolution, that provided evidence in favor<br />
of translating sORFs. RIBO-SEQ captures and<br />
subsequently sequences the +-30 nt mRNA-fragments<br />
captured within ribosomes, providing means to identify<br />
translating sORFs, possible encoding functional<br />
micropeptides. Since the advent of ribosome profiling<br />
several micropeptides were described with import cellular<br />
functions micropeptides (e.g. Toddler, Pri-peptides,<br />
Sarcolipin and Myoregulin).<br />
METHODS<br />
RIBO-SEQ allows the identification of sORFs with<br />
ribosomal activity, however in order to further access the<br />
coding potential (potential of sORFs truly encoding<br />
functional micropeptides) down-stream analysis is<br />
necessary. Here we propose a pipeline which starts from<br />
RIBO-SEQ, implements state-of-the-art tools and metrics<br />
accessing the coding potential of sORFs and creates a list<br />
of candidate sORFs for downstream analysis (e.g.<br />
proteomic identification). In summary, assessment of the<br />
coding potential includes: PhyloCSF (conservation<br />
analysis), FLOSS-score (Ribosome protected fragment<br />
(RPF) length distribution analysis), ORFscore (distribution<br />
analysis of RPFs towards the first frame of a coding<br />
sequence (CDS), BLASTp (sequence similarity), VarAn<br />
(genetic variation analysis). In an attempt to set a<br />
community standard in addition to make sORFs accessible<br />
to a larger audience, a public database (www.sorfs.org) is<br />
provided where public available datasets were processed<br />
by this pipeline, allowing users to browse, query and<br />
export identified ORFs. Furthermore a PRIDE-respin<br />
pipeline was developed in order to periodically search the<br />
PRIDE database for proteomic evidence.<br />
RESULTS & DISCUSSION<br />
The pipeline has been tested and curated on three different<br />
cell-lines. These cell-lines include: HCT116 (human), E14<br />
mESC (mouse) and s2 (fruitfly). Results obtained<br />
provided similar results to those reported in recent<br />
literature proving its relevance. All metrics, as stated<br />
above, have been carefully inspected for their biological<br />
relevance and contributed significantly to the detection of<br />
sORFs. The pipeline is currently being finalized, however<br />
is available upon request. The public repository is<br />
accessible at http://www.sorfs.org, and includes the<br />
datasets mentioned above resulting in 263354 sORFs. Two<br />
querying interfaces were implemented, a default query<br />
interface intended for browsing sORFs and a BioMart<br />
query interface for advanced querying and export<br />
functions. sORFs have their own detail page, visualizing<br />
the above discussed metrics and ribosome profiling data<br />
and a link to the UCSC-browser is provided, visualizing<br />
the RIBO-SEQ data.<br />
REFERENCES<br />
Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,<br />
Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.<br />
(2014) Toddler: an embryonic signal that promotes cell movement<br />
via Apelin receptors. Science, 343, 1248636.<br />
Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,<br />
Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.<br />
(2014) Toddler: an embryonic signal that promotes cell movement<br />
via Apelin receptors. Science, 343, 1248636.<br />
Crappé,J., Ndah,E., Koch,A., Steyaert,S., Gawron,D., De Keulenaer,S.,<br />
De Meester,E., De Meyer,T., Van Criekinge,W., Van Damme,P., et<br />
al. (2014) PROTEOFORMER: deep proteome coverage through<br />
ribosome profiling and MS integration. Nucleic Acids Res.,<br />
10.1093/nar/gku1283.<br />
Ingolia,N.T. (2014) Ribosome profiling: new views of translation, from<br />
single codons to genome scale. Nat. Rev. Genet., 15, 205–13.<br />
Crappé,J., Van Criekinge,W., Trooskens,G., Hayakawa,E., Luyten,W.,<br />
Baggerman,G. and Menschaert,G. (2013) Combining in silico<br />
prediction and ribosome profiling in a genome-wide search for novel<br />
putatively coding sORFs. BMC Genomics, 14, 648.<br />
Pauli,A., Norris,M.L., Valen,E., Chew,G.-L., Gagnon,J. a,<br />
Zimmerman,S., Mitchell,A., Ma,J., Dubrulle,J., Reyon,D., et al.<br />
(2014) Toddler: an embryonic signal that promotes cell movement<br />
via Apelin receptors. Science, 343, 1248636.<br />
Chanut-Delalande,H., Hashimoto,Y., Pelissier-Monier,A., Spokony,R.,<br />
Dib,A., Kondo,T., Bohère,J., Niimi,K., Latapie,Y., Inagaki,S., et al.<br />
(2014) Pri peptides are mediators of ecdysone for the temporal<br />
control of development. Nat. Cell Biol., 16<br />
84
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
PosterBeNeLux Bioinformatics Conference – Antwerp,<br />
December 7-8 <strong>2015</strong><br />
Abstract 10th ID: Benelux 000 Bioinformatics Category: Conference Abstract template<br />
<strong>bbc</strong> <strong>2015</strong><br />
P41. RIGAPOLLO, A HMM-SVM BASED APPROACH TO SEQUENCE<br />
ALIGNMENT<br />
Gabriele Orlando 1,2,3,4 , Wim Vranken 1,2,3 and & Tom Lenaerts 1,4,5 .<br />
1 Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, CP 263 1 ; 2 Structural<br />
Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2 2 ; 3 Structural Biology Research Center, VIB,1050 Brussels,<br />
Belgium 3 ;. 4 Machine Learning group, Université Libre de Bruxelles, Brussels, 1050, Belgium 4 ;. 5 Artificial Intelligence<br />
lab, Vrije Universiteit Brussel, Brussels, 1050, Belgium 5 .<br />
INTRODUCTION<br />
Reliable protein alignments are a central problem for<br />
many bioinformatics tools, such as homology modelling.<br />
Over the years many different algorithms have been<br />
developed and different kinds of information have been<br />
used to align very divergent sequences [1]. Here we<br />
present a pairwise alignment tool, called Rigapollo, based<br />
on pairwise HMM-SVM, which includes backbone<br />
dynamics predictions [2] in the alignment process: recent<br />
work suggests that protein backbone dynamics is often<br />
evolutionary conserved and contains information<br />
orthogonal to the amino acid conservation..<br />
METHODS<br />
Rigapollo uses a pairwise HMM-SVM alignment<br />
approach to infer the optimal alignment between two<br />
proteins, taking into consideration both sequence and<br />
dynamic information. The model (described in Figure 1) is<br />
composed by 3 states: M (match), G1 (gap in the first<br />
sequence) and G2 (gap in the second sequence). The<br />
transition probabilities are defined in the same way as a<br />
standard HMM. This new alignment tool is further<br />
designed in the following manner:<br />
Defining the N-dimensional feature vectors:<br />
Each amino acid in the sequences is described by an N-<br />
dimensional feature vector. That vector can be defined<br />
using any kind of information, ranging from evolutionary<br />
information (i.e. PSSM calculated with HHblits [3])) to<br />
dynamics predictions (using the DynaMine predictor [2]).<br />
While standard pairwise HMMs require the definition of a<br />
finite and discrete alphabet of observable states, our model<br />
works directly using these feature vectors (that can be both<br />
orthonormal or not orthonormal), evaluating the emission<br />
probability with a support vector machine (SVM).<br />
Definition of the emisisonemission probability:<br />
We define the emission probability using a SVM trained<br />
to discriminate matches from mismatches. We define as<br />
matches all the positions in the reference pairwise<br />
alignments that do not contain gaps and we use the<br />
concatenation of the previously defined feature vectors to<br />
describe them. These matches are considered positive hits.<br />
For what concerns the mismatches, we perform the same<br />
procedure, but couple positions that, in the reference<br />
alignment, are shifted a number of amino acids, varying<br />
between 5 and 10. After the training, the predicted<br />
emission probabilities for the M state, given the<br />
concatenation of two feature vectors, will be a function of<br />
the distance from the decision hyperplane of the SVM<br />
(called f(D)). The corresponding emission probabilities for<br />
the states G1 and G2 will be modeled as 1-f(D)<br />
RESULTS & DISCUSSION<br />
For the evaluation of the performances of Rigapollo, we<br />
adopted two publicly available subsets of the Balibase and<br />
SABmark alignmenta datasets, already used to evaluate<br />
other pairwise alignment tools [1]; from the MSAs, allpair<br />
pairwise alignments has been extracted, and all these<br />
that shared a percentage of sequence equal to the median<br />
of the one of the full database has been put in the subset.<br />
The datasets consist respectively in 38 and 123 manually<br />
curated, structure based pairwise alignments and they<br />
share very low sequence identity. For the evaluation of the<br />
performances we performed a 10 folds randomized crossvalidtion.<br />
Rigapollo increases the quality of low sequence<br />
identity pairwise alignment from 5 to 10% respect to the<br />
state of the art methods and it seams appears that the<br />
increase in the performancewse is more marked in very<br />
Figure 1: Structure of the pairwise HMM-SVM model<br />
divergent sequences, such as the onesthose in the<br />
SABmark dataset , where the dynamics information seams<br />
to significantly increase the quality of the alignment. This<br />
is probably due to the fact that dynamics are often well<br />
conserved in functional patterns, also when the sequence<br />
is not preserved [2].<br />
REFERENCES<br />
[1] Do Chuong B.et al. Research in Computational Molecular Biology.<br />
Springer Berlin Heidelberg, 2006<br />
[2] Cilia, Elisa, et al. Nucleic acids research 42.W1 (2014): W264-W270<br />
[3] Remmert, Michael, et al.Nature methods 9.2 (2012): 173-175.<br />
85
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P42. EARLY FOLDING AND LOCAL INTERACTIONS<br />
R. Pancsa 1 , M. Varadi 1 , E. Cilia 2,3 , D. Raimondi 1,2,3 & W. F. Vranken 1,3,* .<br />
Structural Biology Research Centre, VIB and Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium 1 ;<br />
Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium 2 ; Interuniversity Institute of Bioinformatics<br />
in Brussels (IB) 2 , Brussels, Belgium 3 . * wvranken@vub.ac.be<br />
INTRODUCTION<br />
Protein folding is in its early stages largely determined by<br />
the protein sequence and complex local interactions<br />
between amino acids, resulting in the formation of foldons<br />
that provide the context for further folding into the native<br />
state. These early folding processes are therefore<br />
important to understand subsequent folding steps and their<br />
influence on, for example, aggregation, but they are<br />
difficult to study experimentally. We here address this<br />
issue computationally by assembling and analysing a<br />
dataset on early folding residues from hydrogen deuterium<br />
exchange (HDX) data from NMR and MS, and analyse<br />
how they relate to the sequence-based backbone dynamics<br />
predictions from DynaMine (Cilia et al. 2013, 2014) and<br />
evolutionary information from multiple sequence<br />
alignments.<br />
METHODS<br />
We assembled a dataset of HDX experimental data from<br />
NMR and MS from literature for 57 proteins totalling<br />
4172 residues. The data was classified by the into early,<br />
intermediate and late classes depending on the folding<br />
time where protection of the backbone NH was observed,<br />
and into strong, medium and weak classes depending on<br />
how long the amides remain protected upon unfolding the<br />
native state. This resulted in 219 residue sets that are<br />
organised in XML files and loaded into a database that is<br />
made available online via http://start2fold.eu.<br />
The DynaMine predictions were run locally with a new<br />
version of the software that handles C- and N-terminal<br />
effects. These original predictions were then normalised<br />
by shifting them so that the maximum prediction value for<br />
each protein is always 1.0, so not affecting the relative<br />
differences between the prediction values within each<br />
protein, but effectively normalising the values between<br />
different proteins. MSAs were generated for each<br />
sequence in the dataset using HHblits and Jackhmmer with<br />
3 iterations and E value threshold of 10 -4 . All the retrieved<br />
homologs have minimum 90% coverage with the query<br />
sequence. By using HHfilter, a post processing tool<br />
provided in the HHblits package, we built two different<br />
sets of MSAs by varying the maximum pairwise sequence<br />
identity threshold between the collected homologs in each<br />
MSA. The (ungapped) sequences in the MSAs were<br />
predicted without normalisation in order to preserve the<br />
differences within a protein family, and mapped back to<br />
the full (gapped) MSA.<br />
Our analysis shows that the DynaMine-predicted rigidity<br />
of the protein backbone represents where the protein is<br />
likely to adopt specific lower free energy conformations<br />
based on sequence-encoded local interactions, as<br />
evidenced by the HDX data on early folding (Figure 1).<br />
This effect is also present on a per-residue basis.<br />
FIGURE 1. Distribution of DynaMine predictions for early folding<br />
residues (green) and non-early folding residues (brown) for the original<br />
(left) and normalized (right) values.<br />
When relating the secondary structure elements as<br />
observed in the native fold to the early folding residues,<br />
we observe that the ‘early folding’ secondary structure<br />
elements also tend to be more rigid overall. Finally, we<br />
examined whether early folding is conserved in evolution<br />
on the basis of multiple sequence alignments. Although<br />
there is no conservation of individual amino acids, the<br />
physical characteristic of a rigid backbone seems to be<br />
conserved.<br />
We therefore propose that the backbone dynamics of the<br />
protein is a fundamental physical feature conserved by<br />
proteins that can provide important insights into their<br />
folding mechanisms and stability.<br />
REFERENCES<br />
Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013).<br />
From protein sequence to dynamics and disorder with DynaMine.<br />
Nature Communications, 4, 2741.<br />
http://doi.org/10.1038/ncomms3741<br />
Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2014).<br />
The DynaMine webserver: predicting protein dynamics from<br />
sequence. Nucleic Acids Research, 12(Web Server), W264–W270.<br />
http://doi.org/10.1093/nar/gku270<br />
RESULTS & DISCUSSION<br />
86
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P43. BINDING SITE SIMILARITY DRUG REPOSITIONING:<br />
A GENERAL AND SYSTEMATIC METHOD FOR DRUG DISCOVERY<br />
AND SIDE EFFECTS DETECTION<br />
Daniele Parisi & Yves Moreau.<br />
I developed a protocol based on prediction of druggable cavities, comparison of these putative binding sites and crossdocking<br />
between bound ligands and the binding site detected to be similar to the one of the complex, in order to study the<br />
cross reactivity of known compounds. It is a general method because it can find applications both in drug repositioning<br />
and in the study of adverse effects, and it is systematic because it consists in several subsequent steps. It would indicate<br />
ligands to screen, reducing the number of candidates and allowing companies or universities to save money and time<br />
from unnecessary tests.<br />
INTRODUCTION<br />
The ability of small molecules to interact with multiple<br />
proteins is referred to as polypharmacology [1] , and the<br />
strategy that aims to exploit the positive aspects of<br />
polypharmacology is drug repositioning, whereby existing<br />
drugs are investigated for efficacy against targets for other<br />
indications. Existing drugs are privileged structures with<br />
verified bioavailability and compatibility. Furthermore,<br />
virtual screening allows to conduct repositioning of<br />
existing drugs against novel disease targets without the<br />
expense of purchasing thousands of compounds [2] . The<br />
combination of structure-based virtual screening (such as<br />
estimation of similarity of protein-ligand binding sites and<br />
consequent cross-docking) and drug repositioning<br />
represents a highly efficient and fast methodology for<br />
predicting cross-reactivity and putative side effects of drug<br />
candidates [3] .<br />
METHODS<br />
Each step of my work is related to a bioinformatics<br />
technique or tool, resulting to be the coupling of different<br />
software.<br />
1. At first there is the choice of the query (a single protein<br />
as PDB file) and the templates (a set of PDB<br />
structures). At least one of the two categories has to<br />
present a ligand bound in a cavity;<br />
2. prediction of druggable cavities in all the protein<br />
structures using a geometry-based or an energy-based<br />
algorithm (Fpocket, geometry-based tool, in my case);<br />
3. comparison of the query binding sites to the binding<br />
sites of the templates for assessing the similarity. It can<br />
be carried out by an alignment or alignment-free<br />
algorithm (I used Apoc, an alignment based tool);<br />
4. cross-docking of the ligand available in the pair of<br />
similar binding sites, into the other cavity, in order to<br />
study the binding with a different target for toxicity or<br />
new therapeutic indications (AutodockVina);<br />
5. Fingerprinting of the new complex ligand-cavity for<br />
scoring the docking poses.<br />
I applied this protocol on two different queries (Thrombin<br />
and Dihydrofolate reductase), using a data set of 1067<br />
druggable proteins as tamplates (Druggable Cavity<br />
Directory).<br />
RESULTS & DISCUSSION<br />
The method works well in repositioning ligands among<br />
proteins of the same family (intraprotein), but is not able<br />
to detect interprotein similarities (among not related<br />
proteins). It happens because of the big size of the<br />
predicted cavities (larger than the mere space occupied by<br />
the ligand) coupled to the alignment-based algorithm used,<br />
which make difficult to have a sufficient similarity rate<br />
and exponentially increase the false negatives. For my<br />
further works I will divide the cavity space in subpockets,<br />
disengage the similarity from the sequence by using<br />
pharmacophoric maps, and couple the structure based<br />
similarity to the ligand based and network based. All the<br />
information will be fused with data integrations algorithms.<br />
REFERENCES<br />
On the origins of drug polypharmacology, Xavier Jalencas and Jordi<br />
Mestres, Med. Chem. Commun., 2013, 4, 80.<br />
Drug repositioning by structure-based virtual screening, Dik-Lung Ma,<br />
Daniel Shiu-Hin Chana and Chung-Hang Leung, Chem. Soc. Rev.,<br />
2013, 42, 2130.<br />
Comparison and Druggability Prediction of Protein−Ligand Binding<br />
Sites from Pharmacophore-Annotated Cavity Shapes, Jérémy<br />
Desaphy, Karima Azdimousa, Esther Kellenberger, and Didier<br />
Rognan, J. Chem. Inf. Model. 2012, 52, 2287−2299.<br />
87
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P44. ASSESSMENT OF THE CONTRIBUTION OF COCOA-DERIVED STRAINS<br />
OF ACETOBACTER GHANENSIS AND ACETOBACTER SENEGALENSIS TO<br />
THE COCOA BEAN FERMENTATION PROCESS THROUGH A GENOMIC<br />
APPROACH<br />
Rudy Pelicaen, Koen Illeghems, Luc De Vuyst, and Stefan Weckx * .<br />
Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering<br />
Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB,<br />
Brussels, Belgium. *Stefan.Weckx@vub.ac.be<br />
Acetobacter ghanensis LMG 23848 T and Acetobacter senegalensis 108B are acetic acid bacteria species that originate<br />
from a spontaneous cocoa bean heap fermentation process. They have been indicated as strains with interesting<br />
functionalities through extensive metabolic and kinetic studies. Whole-genome sequencing of A. ghanensis LMG 23848 T<br />
and A. senegalensis 108B allowed to unravel their genetic adaptations to the cocoa bean fermentation ecosystem.<br />
INTRODUCTION<br />
Fermented dry cocoa beans are the basic raw material for<br />
chocolate production. The cocoa pulp-bean mass contents<br />
of the cocoa pods undergo, once taken out of the pods, a<br />
spontaneous fermentation process that lasts four to six<br />
days. This process is characterised by a succession of<br />
yeasts, lactic acid bacteria (LAB), and acetic acid bacteria<br />
(AAB) coming from the environment (De Vuyst et al.,<br />
<strong>2015</strong>).<br />
METHODS<br />
Total genomic DNA isolation and purification of A.<br />
ghanensis LMG 23848 T and A. senegalensis 108B was<br />
followed by the construction of an 8-kb paired-end library,<br />
454 pyrosequencing, and assembly of the sequence reads<br />
using the GS De Novo Assembler version 2.5.3 with<br />
default parameters. Genome finishing was performed by<br />
PCR assays to close gaps in the draft assembly using<br />
CONSED 23.0. Automated gene prediction and annotation<br />
of the assembled genome sequences were carried out using<br />
the bacterial genome sequence annotation platform<br />
GenDB v2.2 (Meyer et al., 2003). The predicted genes<br />
were functionally characterised using searches in public<br />
databases and bioinformatics tools, and annotations were<br />
manually curated. Comparative analysis of the genome<br />
sequences of the cocoa-derived strains A. ghanensis LMG<br />
23848 T (this study), A. senegalensis 108B (this study), and<br />
A. pasteurianus 386B (Illeghems et al., 2013) was<br />
accomplished by the EDGAR framework (Blom et al.,<br />
2009).<br />
RESULTS & DISCUSSION<br />
The genomes of the strains investigated consisted of a<br />
circular chromosomal DNA sequence with a size of 2.7<br />
Mbp and two plasmids for A. ghanensis LMG 23848 T and<br />
a circular chromosomal DNA sequence with a size of 3.9<br />
Mbp and one plasmid for A. senegalensis 108B (Figure 1).<br />
Comparative analysis revealed that the order of<br />
orthologous genes was highly conserved between the<br />
genome sequences of A. pasteurianus 386B and A.<br />
ghanensis LMG 23848 T . Evidence was found that both<br />
species possessed the genetic ability to be involved in<br />
citrate assimilation and they displayed adaptations in their<br />
respiratory chain. As is the case for many AAB, the<br />
missing gene encoding phosphofructokinase in the<br />
genome sequences of both A. ghanensis LMG 23848 T and<br />
A. senegalensis 108B resulted in a non-functional upper<br />
part of the Embden–Meyerhof–Parnas pathway. However,<br />
the presence of genes coding for membrane-bound PQQdependent<br />
dehydrogenases enabled the AAB strains<br />
examined to rapidly oxidise ethanol into acetic acid.<br />
Furthermore, an alternative TCA cycle, characterised by<br />
genes coding for a succinyl-CoA:acetate-CoA transferase<br />
and a malate:quinone oxidoreductase, was present.<br />
Furthermore, evidence was found in both genome<br />
sequences that glycerol, mannitol and lactate could be<br />
used as energy sources. Thus, although both species<br />
displayed genetic adaptations to the cocoa bean<br />
fermentation process, their dependence on glycerol,<br />
mannitol and lactate may partly explain their low<br />
competitiveness during cocoa bean fermentation processes,<br />
as these substrates have to be formed through yeast or<br />
LAB activities, respectively.<br />
FIGURE 1. Graphical representation of the genomes of A. ghanensis<br />
LMG 23848 T (A) and A. senegalensis 108B (B).<br />
REFERENCES<br />
Blom, J., Albaum, S., Doppmeier, D., Pühler, A., Vorhölter, F.-J., Zakrzewski, M.,<br />
Goesmann, A., 2009. EDGAR: a software framework for the comparative<br />
analysis of prokaryotic genomes. BMC Bioinformatics 10, 1-14.<br />
De Vuyst, L., Weckx, S., <strong>2015</strong>. The functional role of lactic acid bacteria in cocoa<br />
bean fermentation. In: Mozzi, F., Raya, R.R., Vignolo, G.M. (Eds.).<br />
Biotechnology of Lactic Acid Bacteria: Novel Applications. Wiley-Blackwell,<br />
Ames, IA, USA. In press.Illeghems, K., De Vuyst, L., Weckx, S., 2013.<br />
Complete genome sequence and comparative analysis of Acetobacter<br />
pasteurianus 386B, a strain well-adapted to the cocoa bean fermentation<br />
ecosystem. BMC Genomics 14, 526.<br />
Meyer, F., Goesmann, A., McHardy, A. C., Bartels, D., Bekel, T., et al., 2003.<br />
GenDB - an open source genome annotation system for prokaryote genomes.<br />
Nucleic Acids Res. 31, 2187-2195.<br />
88
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: 000 Category: Abstract template<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P45. REPRESENTATIONAL POWER OF GENE FEATURES<br />
FOR FUNCTION PREDICTION<br />
Konstantinos Pliakos 1* , Isaac Triguero 2,3 , Dragi Kocev 4 & Celine Vens 1 .<br />
Department of Public Health and Primary Care, KU Leuven Kulak 1 ; Department of Respiratory Medicine, Ghent<br />
University 2 ; Data Mining and Modelling for Biomedicine group, VIB Inflammation Research Center 3 ; Department of<br />
Knowledge Technologies, Jožef Stefan Institute 4 . * konstantinos.pliakos@kuleuven-kulak.be<br />
We present a short study on gene function prediction datasets, revealing an existing issue of non-unique feature<br />
representation, as well as the effect of this issue on hierarchical multi-label classification algorithms.<br />
INTRODUCTION<br />
This study focuses on hierarchical multi-label<br />
classification (HMC). HMC is a variant of classification<br />
where one sample can be assigned to several classes<br />
simultaneously. It differs though from multi-label<br />
classification as these classes are organized in a hierarchy.<br />
That means that a sample belonging to a class<br />
automatically belongs to all its super-classes. Typical<br />
HMC tasks include gene function prediction or text<br />
classification. Here, we focus on the former.<br />
A typical characteristic of genes is that they can be<br />
described in several ways: using information about their<br />
sequence, homology to well-characterized genes,<br />
expression profiles, secondary structure of their derived<br />
proteins, etc. The HMC community has multiple research<br />
datasets at its disposal on gene functions (e.g., (Vens et al.,<br />
2008) or (Schietgat et al., 2010)), each representing genes<br />
by one type of features. Indisputably, researchers should<br />
get advantage of this amount of data but the question<br />
arises how “good” these datasets are. How discriminant<br />
are the features describing a gene? Here, a short study is<br />
trying to display existing data-related problems and give<br />
answers to the aforementioned questions.<br />
DATA STUDY & RESULTS<br />
After careful experimentation on various publicly<br />
available datasets it was noted that some of them suffer<br />
from large amount of duplicate feature vectors. The<br />
irrational behind this occurrence is that there are genes,<br />
which despite having different functions, have exactly the<br />
same feature representation. The table below lists the<br />
aforementioned problem in the 20 gene function<br />
prediction datasets described in (Vens et al., 2008) and<br />
(Schietgat et al., 2010).<br />
Organism Dataset Nb of genes Nb of unique gene<br />
representations<br />
S. cerevisiae church 3755 2352<br />
pheno 1591 514<br />
hom 3854 3646<br />
seq 3919 3913<br />
struc 3838 3785<br />
A. thaliana scop 9843 9415<br />
struc 11763 11689<br />
TABLE 1. Datasets, the number of genes and their unique representations.<br />
As it is displayed, the church (micro-array expression) and<br />
the pheno (phenotype features) datasets suffer the most.<br />
More specifically, in pheno dataset the 67.7% of the gene<br />
representations are duplicates. The most frequent feature<br />
vector appears 315 times, 197 times in the training set and<br />
118 times in the test set. Due to this, 20% of the 582 test<br />
examples will give the same feature vector as input for<br />
prediction. In a decision tree model, for example, these<br />
genes will end up in the same leaf, receive the same<br />
prediction (the average class vector of 197 training<br />
examples), but receive a different error term as they are a<br />
priori associated with a different class label-set. In the<br />
training phase, there may still be a lot of variation in the<br />
class vectors of the 197 genes, but no split exists to<br />
separate them. In the Church dataset, the 3755 genes<br />
correspond to only 2352 unique feature descriptors. In<br />
Hom or Struc datasets the number of the duplicates is<br />
lower but still impressive, considering the enormous size<br />
of the feature vectors in these datasets.<br />
For evaluation purposes, ML-KNN (Zhang M. L et al.,<br />
2007) was employed to demonstrate the effect of the<br />
studied problem on the average precision for the FunCat<br />
annotated datasets. Here, “unique” refers to the datasets<br />
occurring after removing all the duplicates. Thus, any<br />
feature vector can only once be included in a gene’s<br />
neighbour set. We report the average of 10 “unique”<br />
versions, each one using a different gene’s class label as<br />
ground truth for the feature vector.<br />
Dataset K= 1 K = 5 K = 17<br />
Train Test<br />
(5cv)<br />
Train Test<br />
(5cv)<br />
Train Test<br />
(5cv)<br />
pheno initial 51.59 23.62 39.55 24.14 32.76 23.59<br />
unique 100 24.21 55.62 24.90 39.70 25.01<br />
hom initial 98.30 39.32 63.64 39.45 48.96 37.28<br />
unique 100 39.14 64.64 39.67 49.28 37.53<br />
TABLE 2. Average Precision rates (%) using ML-KNN.<br />
The table shows that the less discriminant feature<br />
representation can affect the ML-KNN and decrease the<br />
precision of multi-label classification. Indisputably, it<br />
could be concluded that the same problem will be more<br />
obvious or even completely disastrous for two-class or<br />
multi-class classification problems.<br />
CONCLUSION<br />
The major point of this study was to inform the research<br />
community of the relatively low representational power of<br />
the features present in some widely used gene function<br />
prediction datasets, making them even more difficult and<br />
challenging datasets from machine learning perspective.<br />
We observed the same issue in datasets of other HMC<br />
application domains like text categorization.<br />
REFERENCES<br />
Zhang M. L. & Zhou Z. H. ML-KNN: A lazy learning approach to multi-label learning, Pattern<br />
recognition 40, 2038-2048, (2007).<br />
Vens C. et al. Decision trees for hierarchical multi-label classification, Machine Learning 73, 185-214,<br />
(2008).<br />
Schietgat L. et al. Predicting gene function using hierarchical multi-label decision tree ensembles, BMC<br />
Bioinformatics 11, (2010).<br />
89
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P46. ANALYSIS OF BIAS AND ASYMMETRY IN THE PROTEIN STABILITY<br />
PREDICTION<br />
Fabrizio Pucci 1,* , Katrien Bernaerts 1,2 , Fabian Teheux 1 , Dimitri Gilis 1 & Marianne Rooman 1 .<br />
Department of BioModeling, BioInformatics & BioProcesses 1 , Université Libre de Bruxelles, 1050 Brussels, Belgium;<br />
BioBased Materials, Faculty of Humanities and Sciences 2 , Maastricht University, 6200 Maastricht, The Netherlands.<br />
* fapucci@ulb.ac.be<br />
In many bioinformatics analyses avoiding biases towards the training dataset is one of the most intricate issue. Here we<br />
focus on the specific case of the prediction of protein thermodynamic stability changes upon point mutations (G). In a<br />
first instance we measure the bias towards the destabilizing mutations of some widely used G-prediction algorithms<br />
described in the literature. Then we show how important is the use of the symmetry of the model to avoid biasing. In the<br />
last step we briefly discuss the distribution of the G values for all possible point mutations in a series of proteins with<br />
the aim of understanding whether the distribution is universal and how much it is biased towards the training dataset.<br />
INTRODUCTION<br />
The accurate prediction of the stability changes on a large<br />
scale is still a challenge in protein science. Despite the<br />
large amount of work done in the last years, the results<br />
frequently suffer from hidden biases towards the training<br />
dataset and this makes the evaluation of the real<br />
performances a difficult task.<br />
Here we study the “bias problem” in the case of the<br />
prediction of protein thermodynamic stability changes<br />
upon point mutations and more precisely of its best<br />
descriptor G that is the change of folding free energy<br />
upon mutation from the wild type protein W to the mutant<br />
M. In principle the predicted G value of the inverse<br />
mutation (M to W) has to be exactly equal to minus the<br />
G of the direct mutation (W to M), since the free energy<br />
is a state function.<br />
Unfortunately the asymmetry of the training dataset<br />
towards the destabilizing mutations (reflecting the<br />
evolutionary optimization of protein stability) makes the<br />
prediction of inverse mutations less accurate with respect<br />
to the direct ones. This introduces a series of distortions in<br />
the prediction model that we will analyze here.<br />
METHODS<br />
We computed the G value for a set of almost 200<br />
mutations in which both the structure of the wild type<br />
protein and mutant are known, using a series of prediction<br />
tools, i.e. PoPMuSiC [1], I-Mutant, FoldX, Duet,<br />
AutoMute, CupSat, Eris and ProSMS. We then computed<br />
the Ratio (RID) of the standard deviation between the<br />
predicted and the experimental values of G for the<br />
Inverse mutations to for the Direct mutations (which<br />
should be one in the case of a perfect symmetric<br />
prediction) and compared the results of the different<br />
programs.<br />
If the functional structure of the model is known as in the<br />
case of the artificial neural network of PoPMuSiC, one<br />
can further understand which terms contribute more than<br />
others to deviate the RID from unit and thus propose new<br />
model structures in which the biases are correctly avoided<br />
[2].<br />
In the more blind machine learning approaches (as the<br />
methods based on Random Forest or Support Vector<br />
Machine) in which the functional form is not explicitly<br />
known, the asymmetry correction is less obvious.<br />
In a second part, we investigated how the symmetry of the<br />
G values distribution in the training dataset influences<br />
the prediction of the G distribution for all possible<br />
mutations in a series of proteins with known structures.<br />
RESULTS & DISCUSSION<br />
The estimation of the asymmetry computed for a<br />
series of available prediction methods gives a RID<br />
values between 1 for bias-corrected methods and<br />
about 3 for the most biased programs. From these<br />
results we have shown that the correct use of the<br />
symmetry in setting up the model structure helps to<br />
avoid unwanted biases towards the destabilizing<br />
mutations.<br />
Furthermore the distribution of the G values for all<br />
point mutations in some proteins has been analyzed<br />
and showed a dependence from the G distribution<br />
of the training dataset when the RID deviate<br />
significantly from one. The understanding of the<br />
relation between the two distrubutions is an<br />
important step to comprehend the universality of the<br />
distribution [3] and how much the proteins are<br />
optimized to minimize the impact of single-site<br />
aminoacid substitution.<br />
REFERENCES<br />
[1] Y. Dehouck, Jean Marc Kwasigroch, D. Gilis, M. Rooman (2011),<br />
PopMusic 2.1 : a web server for the estimation of the protein<br />
stability changes upon mutation and sequence optimality. BMC<br />
Bioinformatics. 12, 151<br />
[2] F. Pucci, K. Bernaerts, F. Teheux, D. Gilis, M. Rooman, Symmetry<br />
Principles in Optimization Problems: an application to Protein<br />
Stability Prediction (<strong>2015</strong>), IFAC-PapersOnLine 48-1, 458-463<br />
[3] Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS, The<br />
stability effects of protein mutations appear to be universally<br />
distributed (2007), J Mol Biol, 356, 1318-1332.<br />
90
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P47. MULTI-LEVEL BIOLOGICAL CHARACTERIZATION OF EXOMIC<br />
VARIANTS AT THE PROTEIN LEVEL IMPROVES THE IDENTIFICATION OF<br />
THEIR DELETERIOUS EFFECTS<br />
Daniele Raimondi 1,2,3,4 , Andrea Gazzo 1,2 , Marianne Rooman 1,6 , Tom Lenaerts 1,2,5 & Wim Vranken 1,2,3,4 .<br />
Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium 1 ; Machine Learning group,<br />
Université Libre de Bruxelles, Brussels, 1050, Belgium 2 ; Structural Biology Brussels, Vrije Universiteit Brussel,<br />
Brussels, 1050, Belgium 3 ; Structural Biology Research Centre, VIB, Brussels, 1050, Belgium 4 ; Artificial Intelligence lab,<br />
Vrije Universiteit Brussel, Brussels, 1050 Belgium 5 ; 3BIO-BioInfo group, Université Libre de Bruxelles, Brussels, 1050,<br />
Belgium 6 . * daniele.raimondi@vub.ac.be<br />
The increasing availability of genome sequence data led to the development of predictors that are capable of identifying<br />
the likely phenotypic effects of Single Nucleotide Variants (SNVs) or short inframe Insertions or Deletions (INDELs).<br />
Most of these predictors focus on SNVs and use a combination of features related to sequence conservation, biophysical<br />
and/or structural properties to link the observed variant to either a neutral or a disease phenotype. Despite notable<br />
successes, the mapping between genetic alterations and phenotypic effects is riddled with levels of complexity that are<br />
not yet fully understood and that are often not taken into account in the predictions. A better multi-level molecular and<br />
functional contextualization of both the variant and the protein may therefore significantly improve the predictive quality<br />
of variant-effect predictors.<br />
INTRODUCTION<br />
The phenotypical interpretation at the organism level of<br />
protein-level alterations is the ultimate goal of the varianteffect<br />
prediction field. This causal relationship is still far<br />
from being completely understood and is confounded by<br />
many aspects related to the intrinsic complexity of cell life. A<br />
crucial restriction of variant-effect prediction is that an<br />
alteration of the protein’s molecular phenotype, even if it is a<br />
sine qua non condition for the disease phenotype in the<br />
carrier individual,may not constitute in itself a sufficient<br />
cause for the disease: this also depends on the particular role<br />
that the affected protein plays in the well-being of the<br />
organism. Even the most commonly used features, which<br />
relate evolutionary constraints with likely functional damage,<br />
offer only a partial correlation with the pathogenicity of the<br />
variant. Consequently, additional information that bridges the<br />
variant-phenotype gap is crucial to improve variant-effect<br />
predictions.<br />
METHODS<br />
We address the inherently complex variant-effect prediction<br />
problem through the integration of different sources of<br />
information. By describing each (protein, variant) pair from<br />
different perspectives corresponding to different levels of<br />
contextualisation, we assembled the most relevant and<br />
accessible pieces of information that are currently available,<br />
with the aim to elucidate the fuzzy and complex mapping<br />
between molecular-level alterations and the individual-level<br />
phenotypic outcome. We use three variant-oriented features<br />
with different characteristics: the log-odd ratio (LOR) score<br />
and Conservation index (CI) [1], which are column-wise<br />
measures of the conservation of a mutated column within a<br />
multiple-sequence alignment (MSA), and the PROVEAN [2]<br />
predictions (PROV), which provide a sequence-wide measure<br />
of the change in evolutionary distance between the mutated<br />
target protein and close functional homologs that correlates<br />
with the deleteriousness of variants. The protein-oriented<br />
features use pathway [4] and protein-protein interaction<br />
networks information [5] (DGR) as well as genetic and<br />
clinical information, for instance an evaluation of how<br />
tolerant the affected genes are to homozygous loss-offunction<br />
mutations (REC) [3].<br />
RESULTS & DISCUSSION<br />
DEOGEN is our novel variant effect predictor that can<br />
natively handle both SNVs and inframe INDELs. By<br />
integrating information from different biological scales and<br />
mimicking the complex mixture of effects that lead from the<br />
variant to the phenotype, we obtain significant improvements<br />
in the variant-effect prediction results. Next to the typical<br />
variant-oriented features based on the evolutionary<br />
conservation of the mutated positions, we added a collection<br />
of protein-oriented features that are based on functional<br />
aspects of the gene affected. We cross-validated DEOGEN on<br />
36825 polymorphisms, 20821 deleterious SNVs and 1038<br />
INDELs from SwissProt.<br />
Method Missing SNVs Sen Spe Pre Bac MCC<br />
PROVEAN 0.0 78 79 68 79 56<br />
SIFT 2.0 85 69 61 77 52<br />
Mutation Assessor 0.6 85 71 63 78 54<br />
PolyPhen2 (HumDiv) 4.0 89 63 57 76 50<br />
CADD 7.0 82 75 66 78 55<br />
EFIN 0.0 86 80 87 83 64<br />
MutationTaster 20.7 86 75 69 81 60<br />
GERP++ 20.7 97 24 45 61 28<br />
DEOGEN 4.4 77 92 85 84 71<br />
FIGURE 1. Comparison of the performances of 8 variant-effect predictors<br />
with DEOGEN on Humsavar 2013 dataset.<br />
REFERENCES<br />
[1]Calabrese, R. et al., R. Functional annotations improve the predictive<br />
score of human disease-related mutations in proteins. Hum. Mutat.<br />
30, 123744 (2009).<br />
[2]Choi, Y. et al., Predicting the functional effect of amino acid<br />
substitutions and indels. PLoS One 7, e46688 (2012).<br />
[3]Daniel G. MacArthur et al. A Systematic Survey of Loss-of-Function<br />
Variants in Human Protein-Coding Genes Science 17 February<br />
2012: 335 (6070), 823-828.<br />
[4]Atanas Kamburov et al. (2011) ConsensusPathDB: toward a more<br />
complete picture of cell biology. Nucleic Acids Research 39:D712-<br />
717.<br />
91
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P48. NGOME: PREDICTION OF NON-ENZYMATIC PROTEIN<br />
DEAMIDATION FROM SEQUENCE-DERIVED SECONDARY STRUCTURE AND<br />
INTRINSIC DISORDER<br />
J. Ramiro Lorenzo 1 , Leonardo G. Alonso 2 & Ignacio E. Sánchez 1* .<br />
Protein Physiology Laboratory, Facultad de Ciencias Exactas y Naturales and IQUIBICEN - CONICET, Universidad de<br />
Buenos Aires, Argentina 1 ; Protein Structure-Function and Engineering Laboratory, Fundación Instituto Leloir and<br />
IIBBA - CONICET, Buenos Aires, Argentina 2 . *isanchez@qb.fcen.uba.ar<br />
Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a<br />
molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein<br />
local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non -<br />
enzymatic deamidation of internal asparagine residues in proteins, in the absence of structural data, from sequence based<br />
predictions of secondary structure and intrinsic disorder. NGOME may help the user identify deamidation-prone<br />
asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological<br />
processes.<br />
INTRODUCTION<br />
Protein deamidation is a post-translational modification in<br />
which the side chain amide group of a glutamine or<br />
asparagine (Asn) residue is transformed into an acidic<br />
carboxylate group. Deamidation often, but not always,<br />
leads to loss of protein function 1,2 . Deamidation rates in<br />
proteins vary widely, with halftimes for particular Asn<br />
residues ranging from several days to years. In contrast<br />
with the ubiquity and importance of Asn deamidation,<br />
there is currently no publicly available algorithm for the<br />
prediction of Asn deamidation A structure-based<br />
algorithm was published 3 , but is no longer available online<br />
and is not useful for proteins of unknown structure or<br />
those that are intrinsically disordered.<br />
METHODS<br />
Dataset. We collected from the literature experimental<br />
reports of deamidation of Asn residues in proteins using<br />
mass spectrometry or Edman sequencing. Since<br />
deamidation rates depend strongly on pH and temperature,<br />
we only included experiments at neutral or slightly basic<br />
pH and up to 313K. An Asn residue was considered a<br />
positive if unequivocal change to aspartic or isoaspartic<br />
residue was observed. Asn residues for which direct<br />
experimental evidence was not obtained were not taken<br />
into account.<br />
NGOME training. We trained the algorithm by randomly<br />
splitting the dataset into training and test sets 100 times,<br />
while keeping a similar number of positive and negative<br />
Asn-Xaa dipeptides in the two sets. For each splitting, we<br />
selected the weights for disorder 4 and alpha helix<br />
prediction 5 in NGOME algorithm to maximize the area<br />
under the ROC curve for the training set. For the test set,<br />
the area under the ROC curve for NGOME was larger than<br />
for sequence-based prediction 97 out of 100 times. Finally,<br />
we selected the average values of weights for NGOME.<br />
RESULTS & DISCUSSION<br />
Both protein sequence and structure can influence Asn<br />
deamidation kinetics. In the absence of secondary and<br />
5. Cole, C., et al. Nucleic Acids Res 36:W197-201 (2008).<br />
tertiary structure, Asn deamidation rates are governed by<br />
the identity of the N+1 amino acid 3 . In model peptides, the<br />
Asn-Gly dipeptide is by far the fastest to deamidate, with<br />
bulky N+1 side chains generally slowing down the<br />
reaction. Several structural features decreasing Asn<br />
deamidation rates have also been identified, including<br />
alpha helix formation and hydrogen bond formation by the<br />
Asn side chain, the N+1 backbone amide and the<br />
neighbouring residues 3 .<br />
We compiled a database of 281 Asn residues (67 positives<br />
and 214 negatives) in 39 proteins to train NGOME. We<br />
computed t50 for all Asn in the dataset and generated a<br />
ROC curve by considering as positives Asn residues with<br />
different values of t50. The area under the ROC curve is<br />
larger for the NGOME predictions (0.9640) than for the<br />
sequence-based predictions (0.9270) (p-value 6×10 -3 ).<br />
NGOME also performs better for threshold value s<br />
yielding few false positives. NGOME can also<br />
discriminate between positive and negative Asn-Gly<br />
dipeptides whereas sequence-based prediction can not.<br />
The area under the ROC curve is 0.7051 for the NGOME<br />
predictions, larger than the random value of 0.5 for<br />
sequence-based prediction (p-value 9×10 –3 ). Since<br />
NGOME requires only a protein sequence as an input and<br />
not a three-dimensional structure, we envision that<br />
GNOME will be useful to systematically evaluate whole<br />
proteome data and in the study of intrinsically disordered<br />
proteins for which the structural data is scarce. NGOME is<br />
freely available as a webserver at the National EMBnet<br />
node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/<br />
in the subpage “Protein and nucleic acid structure and<br />
sequence analysis”.<br />
REFERENCES<br />
1. Curnis, F., et al. J Biol Chem 281:36466-36476 (2006).<br />
2. Reissner, K.J. and Aswad, D.W. Cell Mol Life Sci 60:1281 -1295<br />
(2003).<br />
3. Robinson, N.E. and Robinson, A.B. Proc Natl Acad Sci U S A<br />
98:4367-4372 (2001).<br />
4. Dosztanyi, Z., et al. Bioinformatics 21:3433-3434 (2005).<br />
92
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P49. OPTIMAL DESIGN OF SRM ASSAYS USING MODULAR EMPIRICAL<br />
MODELS<br />
Jérôme Renaux 1,* , Alexandros Sarafianos 1 , Kurt De Grave 1 & Jan Ramon 1 .<br />
Department of Computer Science, KU Leuven. 1 * Jerome.renaux@cs.kuleuven.be<br />
Targeted proteomics techniques such as Selected Reaction Monitoring (SRM) have become very popular for protein<br />
quantification due to their high sensitivity and reproducibility. However, these rely on the selection of optimal transitions,<br />
which are not always known in advance and may require expensive and time-consuming discovery experiments to<br />
identify. We propose a computer program for the automated identification of optimal transitions using machine learning<br />
and show encouraging results when compared to a widely used spectral library.<br />
INTRODUCTION<br />
A major issue with both SRM is to know which transitions<br />
to monitor in order to maximally detect a specific protein,<br />
these being different from one protein to another. Good<br />
candidates are transitions whose chemical properties will<br />
make them likely to occur and easy to detect by the mass<br />
spectrometer, while being sufficiently specific indicators<br />
of their parent protein.<br />
Traditionally, targeted proteomics assays, which consist of<br />
lists of ions or transitions to monitor, are designed through<br />
costly exploratory experiments. Recently, attempts have<br />
been made to produce software to help design optimal<br />
assays. These efforts rely on some extent on collaborative<br />
databases of mass spectra which are mined to identify the<br />
best possible peptides to include in the assays. While<br />
successful, these approaches still depend on past<br />
exploratory analyses and on the coverage of the exploited<br />
databases. Therefore, their performance decrease in cases<br />
where such databases cannot be leveraged, such as when<br />
dealing with little-studied organisms or rare, lowabundance<br />
proteins.<br />
We propose an approach called SIMPOPE (Sequence of<br />
Inductive Models for the Prediction and Optimization of<br />
Proteomics Experiments) that models all the steps of the<br />
typical tandem mass spectrometry (MS/MS) workflow in<br />
order to accurately predict the properties of peptide and<br />
fragment ions within a given proteome, and subsequently<br />
identify optimal assays among them.<br />
METHODS<br />
SIMPOPE consists of a sequential suite of predictive<br />
models for each step of the MS/MS workflow. It exploits<br />
knowledge from public databases and combines it with the<br />
generalizing power of machine learning models to<br />
compensate for noisy or missing data. All models are<br />
probabilistic, allowing to keep track of the inherent<br />
uncertainty of the successive predictions and to weight the<br />
results accordingly for the assay prediction.<br />
Enzymatic cleavage is modelled using CP-DT(Fannes et<br />
al., 2013), which models the behaviour of the trypsin<br />
enzyme using random forests. Retention time prediction is<br />
achieved using the Elude tool from the Percolator suite<br />
(Moruz et al., 2010). The charge distribution of<br />
electrospray precursor ions is also modelled using random<br />
forests trained on experimental data mined from PRIDE<br />
(Vizcaino et al., 2013). Fragmentation patterns and<br />
product ion intensity are predicted with the help of random<br />
forest models trained on MS-LIMS data (Degroeve &<br />
Martens 2013; De Grave et al., 2014). Finally, prior<br />
knowledge about the abundance of proteins within a given<br />
proteome is incorporated as prior probabilities, obtained<br />
when available from PaxDB.<br />
On the human proteome, these steps yield a total of 321<br />
000 000 transitions together with their relevant chemical<br />
properties. We then compute a score for every single<br />
transition, based on these properties and on their aliasing<br />
with other transitions in terms of Q1 and Q3 m/z.<br />
RESULTS & DISCUSSION<br />
We validated our approach by computing scores for 2000<br />
reference transitions from the SRMAtlas database (Picotti<br />
et al., 2014). Based on these scores, we can rank the<br />
reference transitions among all possible transitions.<br />
Intuitively, reference transitions should rank high, and<br />
therefore have a low rank (ideally, in the top five). Based<br />
on the average number of transitions per protein in our<br />
reference set, a perfect median rank would be 3.2, while a<br />
totally random scoring system should yield a median rank<br />
of 151. The approach we propose achieved a median rank<br />
of 15, signifying that using our scoring method, 50% of<br />
the reference transitions are ranked in the top 15. This<br />
result is encouraging as it shows that the scores predicted<br />
by SIMPOPE do correlate with the quality of the<br />
transitions. We can subsequently use that score as a<br />
feature to train an additional model on top of the ones<br />
described here to refine the assay prediction process<br />
(further results on the poster).<br />
REFERENCES<br />
Degroeve, S. & Martens, L. MS2PIP: a tool for MS/MS peak<br />
intensity prediction. Bioinformatics, 29, pp.3199–203 (2013).<br />
Fannes, T. et al. Journal of Proteome Research, 12(5), pp.2253–2259<br />
(2013).<br />
De Grave, K. De et al. Prediction of peptide fragment ion intensity : a<br />
priori partitioning reconsidered. International Mass Spectrometry<br />
Conference 2014, (2014).<br />
Moruz, L., Tomazela, D. & Käll, L. Training, selection, and robust<br />
calibration of retention time models for targeted proteomics. Journal<br />
of Proteome Research, 9(10), pp.5209–5216 (2010).<br />
Picotti, P. et al. A complete mass-spectrometric map of the yeast<br />
proteome applied to quantitative trait analysis. Nature, 494(7436),<br />
pp.266–270 (2014).<br />
Vizcaino, J. a. et al. The Proteomics Identifications (PRIDE) database<br />
and associated tools: status in 2013. Nucleic Acids Research, 41(D1),<br />
pp.D1063–D1069 (2013).<br />
93
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P50. EVALUATING THE ROBUSTNESS OF LARGE INDEL IDENTIFICATION<br />
ACROSS MULTIPLE MICROBIAL GENOMES<br />
Alex Salazar 1,2 & Thomas Abeel 1,2* .<br />
Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands 1 ; Genome Sequencing and Analysis<br />
Program, Broad Institute of MIT and Harvard 2 . * T.Abeel@tudelft.nl<br />
Comparing large structural variants—such as large insertions and deletions (indels)—across multiple genomes can reveal<br />
important insights in microbial organisms. Unfortunately, most studies that compare sequence variants only focus on<br />
single nucleotide variants and small indels. In this study, we investigated whether current available variant callers are<br />
robust when identifying the same large indel across multiple genomes—an important criteria for accurately associating<br />
large variants. By simulating over 8,000 large indels of various sizes across 161 bacterial strains, we found that<br />
breakpoint detection is precise when identifying both deletions and insertion. We suggest that left-most-overlap<br />
normalization across all samples will ensure uniform breakpoint coordinates of identical large variants which can then be<br />
incorporated to existing association pipelines.<br />
INTRODUCTION<br />
Structural sequence variants—such as large insertion and<br />
deletions (indels)—along with small sequence variants (e.g.<br />
single nucleotide variants and small indels) can enable more<br />
robust comparisons of microbial populations. Unfortunately,<br />
limitations in variant calling methods restrict investigations to<br />
compare only small variants across multiple microbial<br />
genomes—thereby ignoring larger variants (e.g. indels of size<br />
greater than 50nt). The recent development of structural<br />
variant detecting tools now provide an opportunity to<br />
compare and associate large indels with phenotype and<br />
population structure across a collection of samples. However,<br />
these tools have only been benchmarked against a single<br />
genome and their ability to consistently call large events<br />
across multiple genomes remains uncharacterized.<br />
METHODS<br />
In this study, we systematically benchmarked the robustness<br />
of large indel identification across multiple genomes using<br />
five recently developed structural variant detection tools:<br />
Pilon (Walker et al., 2014), Breseq (Barrick et al., 2014),<br />
BreakSeek (Zhao et al., <strong>2015</strong>), and MindTheGap (Rizk et al.,<br />
2014). Using a manually-curated reference genome for<br />
M. tuberculosis (H37Rv), we simulated nearly 10,000<br />
deletions and 8,000 thousand insertions—ranging from 50nt<br />
to 550nt. Overall, the simulation experiment resulted in a<br />
total 1.6 million expected deletions and 1.3 million expected<br />
insertions when we aligned short-reads from a data set of 161<br />
clinical strains of M. tuberculosis (Zhang et al., 2013).<br />
After identifying the simulated indels using the variant<br />
detecting tools, we used a distance test to investigate each<br />
tool’s robustness in breakpoint and genotype prediction. For<br />
each simulated indel prediction, we computed the distance of<br />
the predicted breakpoint coordinate to the expected<br />
breakpoint coordinate. We also calculated a genotype<br />
similarity score using the Damerau-Levenshtein distance.<br />
RESULTS & DISCUSSION<br />
We found that all tools are able to precisely predict the<br />
breakpoint coordinate of the same large event present across<br />
multiple genomes. For deletions, Breseq and Breakseek<br />
consistently identified more than 96% of all simulated<br />
deletions regardless of size. This number ranged from 87% to<br />
93% in Pilon and correlated with decreasing deletion size.<br />
Breseq and Pilon correctly predicted the exact breakpoint<br />
coordinate for about two-thirds of all identified simulated<br />
indels. This number ranged from 1% to 7% in Breakseek calls<br />
and inversely correlated with increasing deletion size.<br />
For insertions, MindTheGap consistently identified<br />
approximately 97% of all simulated insertions, but Pilon’s<br />
performance worsened as the number of insertions that it<br />
identified ranged from 69% to 93%--again, we observed a<br />
direct correlation of missed calls as the insertion size<br />
increased. Both tools correctly predicted the exact breakpoint<br />
coordinate for about two-thirds of all identified simulated<br />
indels. Nevertheless, we found 99% of the predicted<br />
breakpoint coordinates made by the four tools were within<br />
10nt of the expected breakpoint coordinate.<br />
Our results also indicate that Pilon, Breseq, Breakseek, and<br />
MindTheGap are robust when predicting the genotype of<br />
large indels across multiple samples. The large majority of<br />
identified simulated deletions had a size and genotype<br />
similarity of more than 98%. In insertions, the size similarity<br />
of insertions varied widely in both MindTheGap and Pilon<br />
calls indicating that both tools have a difficult time<br />
determining the exact length of an insertion sequence.<br />
Overall, these results show that breakpoint detection is<br />
precise when identifying deletion and insertions of any size.<br />
Therefore, a simple normalization procedure—such as leftmost-overlap<br />
normalization across samples—will ensure<br />
consistent breakpoint location for identical large events. This<br />
will enable researchers to incorporate large variants to<br />
existing association pipelines; opening novel opportunities to<br />
associate large variants with phenotype and population<br />
structure.<br />
REFERENCES<br />
Barrick,J.E. et al. (2014) Identifying structural variation in haploid<br />
microbial genomes from short-read resequencing data using breseq.<br />
BMC Genomics, 15, 1039.<br />
Rizk,G. et al. (2014) MindTheGap: integrated detection and assembly of<br />
short and long insertions. Bioinformatics, 30, 1–7.<br />
Walker,B.J. et al. (2014) Pilon: an integrated tool for comprehensive<br />
microbial variant detection and genome assembly improvement.<br />
PLoS One, 9, e112963.<br />
Zhang,H. et al. (2013) Genome sequencing of 161 Mycobacterium<br />
tuberculosis isolates from China identifies genes and intergenic<br />
regions associated with drug resistance. Nat. Genet., 45, 1255–60.<br />
Zhao,H. and Zhao,F. (<strong>2015</strong>) BreakSeek: a breakpoint-based algorithm for<br />
full spectral range INDEL detection. Nucleic Acids Res., 1–13.<br />
94
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
10th Benelux Bioinformatics Conference Poster<br />
<strong>bbc</strong> <strong>2015</strong><br />
P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES<br />
FOR PREDICTING CLINICAL CODES<br />
Elyne Scheurwegs 1,3* , Kim Luyckx 2 , Léon Luyten 2 , Walter Daelemans 3 & Tim Van den Bulcke 1 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Antwerp University Hospital 2 ; Center<br />
for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp 3 ; * elyne.scheurwegs@uantwerpen.be<br />
Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to<br />
various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both<br />
in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple<br />
days during the stay). This work studies the complementarity of information derived from these different sources to<br />
enhance clinical code prediction.<br />
INTRODUCTION<br />
The increased accessibility of healthcare data through the<br />
large-scale adoption of electronic health records stimulates<br />
the development of algorithms that monitor hospital<br />
activities, such as clinical coding applications.<br />
Clinical coding consists of the translation of information<br />
found in a patient file to diagnostic and procedural codes,<br />
originating from a medical ontology to patient files.<br />
In our work, we investigate if unstructured (textual) and<br />
structured data sources, present in electronic health<br />
records, can be combined to assign clinical diagnostic and<br />
procedural codes (specifically ICD-9-CM) to patient stays.<br />
Our main objective is to evaluate if integrating these<br />
heterogeneous data types improves prediction strength<br />
compared to using the data types in isolation.<br />
METHODS<br />
Several datasets were collected from the clinical data<br />
warehouse of the Antwerp University Hospital (UZA).<br />
The resulting dataset consists of a randomized subset of<br />
anonymized data of patient stays, in 14 different medical<br />
specialties. Two separate data integration approaches were<br />
evaluated on each dataset from a medical specialty.<br />
With early data integration, multiple sources are combined<br />
prior to training a model. This is achieved by using a<br />
single bag of features that are given to the prediction<br />
pipeline. Feature selection is performed with tf-idf for<br />
unstructured sources and gainratio and minimal<br />
redundancy, maximum relevance (mRMR) for structured<br />
source filtering.<br />
The late data integration method trains a separate model<br />
on each data source, and then combines the prediction<br />
output for each code in a meta-learner. This meta-learner<br />
is mainly used to find which sources perform best for a<br />
certain code.<br />
The prediction task in both approaches was cast as a multiclass<br />
classification task, in which an array of binary<br />
predictions was made (one for each clinical code).<br />
RESULTS & DISCUSSION<br />
Late data integration improves the predictions of ICD-9-<br />
CM diagnostic codes made in comparison to the best<br />
individual prediction source (i.e. overall F-measure<br />
increased from 30.6% to 38.3%). Early data integration<br />
does not show this trend and only performs well with a<br />
limited number of combinations of sources. ICD-9-CM<br />
procedure codes also show this trend, with the exception<br />
of the RIZIV data source, which shows a better prediction<br />
when used individually. The predictive strength of the<br />
models varies strongly between different medical<br />
specialties.<br />
The results show that the data sources, independent of<br />
their structured or unstructured nature, are able to provide<br />
complementary information when predicting ICD-9-CM<br />
codes, particularly when combined within the late data<br />
integration approach. This approach also allows for<br />
including as many sources as possible, as the effects of<br />
including a source that does not contain any additional<br />
information barely influences the end result. This is an<br />
advantage when the information content of a data source is<br />
not previously known. A disadvantage is the loss of<br />
information due to the strong generalisation as each data<br />
source is effectively reduced to a single feature for the<br />
meta-learner.<br />
Early data integration seems to suffer when combining<br />
sources that have features with a largely differing<br />
information content and different numbers of features. An<br />
unstructured data source typically renders 30,000<br />
different, weak features, while a structured source often<br />
contains only 500 different features.<br />
CONCLUSIONS<br />
Models using multiple electronic health record data<br />
sources systematically outperform models using data<br />
sources in isolation in the task of predicting ICD-9-CM<br />
codes over a broad range of medical specialties.<br />
ACKNOWLEDGEMENT<br />
This work is supported by a doctoral research grant (nr.<br />
131137) by the Agency for Innovation by Science and<br />
Technology in Flanders (IWT). The datasets used in this<br />
research were made available by the Antwerp University<br />
Hospital (UZA) for restricted use.<br />
REFERENCES<br />
Scheurwegs, E et al. Data integration of structured and unstructured<br />
sources for assigning clinical codes to patient stays. Journal of the<br />
American Medical Informatics Association (<strong>2015</strong>): ocv115.<br />
95
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P52. SUPERVISED TEXT MINING FOR DISEASE AND GENE LINKS<br />
Jaak Simm 1,2,3* , Adam Arany 1,2 , Sarah ElShal 1,2 & Yves Moreau 1,2 .<br />
Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing, and Data<br />
Analytics, KU Leuven, Kasteelpark Arenberg 10, box 2446, 3001 Leuven, Belgium 1 ; iMinds Medical IT, Kasteelpark<br />
Arenberg 10, box 2446, 3001 Leuven, Belgium 2 ; Institute of Gene Technology, Tallinn University of Technology,<br />
Akadeemia tee 15A, Estonia 3 . * jaak.simm@esat.kuleuven.be<br />
Scientific publications contain rich information about genetic disorders. Text mining these publications provides an<br />
automatic way to quickly query and summarize the information. We propose a supervised learning approach that takes<br />
advantage of the well known unsupervised approach TF-IDF (term frequency–inverse document frequency) and<br />
integrates it with supervised approach using logistic loss error metric. The preliminary results on OMIM dataset look<br />
promising.<br />
INTRODUCTION<br />
Scientific publications contain rich information about<br />
genetic disorders. Text mining these publications provides<br />
an automatic way to quickly query and summarize the<br />
information.<br />
The traditional approaches employ unsupervised text<br />
mining approaches like TF-IDF (term frequency–inverse<br />
document frequency) or Latent Dirichlet Allocation<br />
(LDA) by Blei et al. (2003) for linking terms to genes and<br />
diseases. A recent text mining software Beegle (ElShal et<br />
al., <strong>2015</strong>) developed for linking diseases and genes has<br />
taken this approach using TF-IDF as its similarity metric.<br />
PROPOSED METHOD<br />
Our work proposes a supervised learning of the<br />
importance of the textual terms, which can automatically<br />
filter out many terms that are unnecessary for the task at<br />
hand. We formulate it as a prediction of supervised values<br />
y given the terms for all genes g and all diseases d where i<br />
is the index of the term:<br />
and w i is the weight for the term i and σ is sigmoid<br />
function. The main idea is to learn the weight vector w that<br />
minimizes the difference between known values y and<br />
predictions. The minimization can transformed into a<br />
logistic regression.<br />
For the supervised values we use OMIM database<br />
(Hamosh et al., 2003). More specifically y corresponds to<br />
1 if there is a link between the given gene-disease pair and<br />
0 if there is no link. Intuitively, in this setup the text<br />
mining is transformed into a classification problem. We<br />
use dataset of 330 OMIM terms and their linked genes and<br />
randomly sample genes as negatives for each disease.<br />
For the textual terms we use MEDLINE abstracts as the<br />
source of biomedical text. We employ MetaMap (Aronson<br />
et al. 2010) to link terms with abstracts. We use geneRIF<br />
to link genes with abstracts, and PubMed to link diseases<br />
with abstracts. We apply a TF-IDF transformation to score<br />
a term with a given disease or gene based on the abstracts<br />
linked to each entity. We only use the terms linked to<br />
abstracts that belong to genes. Hence our vocabulary<br />
consists of 66,883 terms.<br />
RESULTS & DISCUSSION<br />
The preliminary results show that supervised learning<br />
allows to automatically pick up the keywords that are<br />
informative, improving the recall of the genes that are<br />
related to genetic disorders. We will present more detailed<br />
results in the poster.<br />
We are also investigate how to integrate the supervised<br />
approach to have answers to online queries provided by<br />
Beegle.<br />
REFERENCES<br />
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet<br />
allocation. the Journal of machine Learning research, 3, 993-1022.<br />
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., & McKusick,<br />
V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a<br />
knowledgebase of human genes and genetic disorders. Nucleic acids<br />
research, 33(suppl 1), D514-D517.<br />
ElShal, S., Tranchevent L.C., Sifrim A., Ardeshirdavani A., Davis J.,<br />
Moreau Y. (<strong>2015</strong>). Beegle: from literature mining to disease-gene<br />
discovery. Nucleic Acids Res, gkv905.<br />
Aronson, A. R., & Lang, F. M. (2010). An overview of MetaMap:<br />
historical perspective and recent advances. Journal of the American<br />
Medical Informatics Association, 17(3), 229-236.<br />
96
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P53. FLOWSOM WEB: A SCALABLE ALGORITHM TO VISUALIZE AND<br />
COMPARE CYTOMETRY DATA IN THE BROWSER<br />
Arne Soete 2 , Sofie Van Gassen 1,2,3 , Tom Dhaene 1 , Bart N. Lambrecht 2,3 & Yvan Saeys 2,3 .<br />
Department of Information Technology, Ghent University-iMinds, Ghent, Belgium 1 ; Inflammation Research Center, VIB,<br />
Ghent, Belgium 2 ; Department of Respiratory Medicine, Ghent University Hospital, Ghent, Belgium 3 .<br />
We developed FlowSOM Web, a web-tool which visualizes cytometry data based on Self-Organizing Maps. Similar cells<br />
are clustered and visualized via star charts. This allows us to process and display millions of cells efficiently.<br />
Additionally, different biological samples (e.g. healthy versus diseased mice) can be compared.<br />
INTRODUCTION<br />
Cytometry data describes cell characteristics in<br />
biological samples. Cells are labeled with fluorescent<br />
antibodies and a flow cytometer measures the properties<br />
of millions of cells one by one. Biologists use this<br />
information to get more insight in diseases and to<br />
diagnose patients. Most of them still analyse this data<br />
manually to differentiate between the different cell types<br />
present. This is done by plotting the data in 2D scatter<br />
plots and selecting groups of cells in a hierarchical way.<br />
This process is called `gating'. Recently, the number of<br />
properties that can be measured simultaneously has<br />
strongly increased. As the number of possible 2D scatter<br />
plots increases exponentially with the number of<br />
properties measured, it becomes infeasible to analyze<br />
them all and relevant information that is present in the<br />
data might be missed.<br />
METHODS<br />
We present FlowSOM, a new algorithm for the<br />
visualization and interpretation of cytometry data (Van<br />
Gassen, et al,. <strong>2015</strong>). Using a twolevel clustering and<br />
star charts, our algorithm helps to obtain a clear<br />
overview of how all markers are behaving on all cells,<br />
and to detect subsets that might be missed otherwise.<br />
Our algorithm consists of 4 steps: pre-processing the<br />
data, building a self-organizing map, building a minimal<br />
spanning tree and computing a meta-clustering result.<br />
RESULTS & DISCUSSION<br />
Although our results are quite similar to SPADE, another<br />
state-of-the art algorithm for the visualization of<br />
cytometry data, our results can be computed much faster<br />
and use less memory. By providing star-charts and an<br />
automatic meta-clustering step, much more information<br />
can be visualised in a single tree than is done by the<br />
SPADE algorithm.<br />
Additionally, multiple states can be compared (e.g.<br />
healthy versus diseased mice) with one another and the<br />
differences between the two states can be visualized via<br />
star-charts.<br />
On this conference, we would like to demonstrate a<br />
recently developed web interface to the underlying R<br />
functionality. This interface allows to upload cytometry<br />
data, run the aforementioned analysis, compare different<br />
cell states and explore the results, via interactive<br />
visualizations, all from the comfort of the browser.<br />
FIGURE 1. Example of a FlowSOM star chart.<br />
REFERENCES<br />
Van Gassen, et al. (<strong>2015</strong>), FlowSOM: Using self-organizing maps for<br />
visualization and interpretation of cytometry data. Cytometry,<br />
87: 636–645<br />
97
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P54. TOWARDS A BELGIAN REFERENCE SET<br />
Erika Souche 1* , Amin Ardeshirdavani 2 , Yves Moreau 2 , Gert Matthijs 1 & Joris Vermeesch 1 .<br />
Department of Human Genetics, KU Leuven 1 ; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and<br />
Data Analytic, KU Leuven 2 . * Erika.souche@uzleuven.be<br />
Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous<br />
sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved<br />
from sequencing to variant interpretation and classification. Although publically available databases of variant<br />
frequencies help distinguishing causative mutations from common variants, they often lack population specific variant<br />
frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence<br />
data of unrelated and unaffected individuals to filter out common local variants is often done. However the<br />
files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the<br />
utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of<br />
information between genetic centers and mine low coverage whole genome data for common variants.<br />
INTRODUCTION<br />
Next-Generation Sequencing (NGS) is increasingly used<br />
to study and diagnose human disorders. The simultaneous<br />
sequencing of a large number of genes leading to the<br />
detection of a large number of variants, the bottleneck has<br />
moved from sequencing to variant interpretation and<br />
classification. Publically available databases of variant<br />
frequencies provided by, among others, the Exome<br />
Sequencing Project (ESP) the 1000 genomes project<br />
(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help<br />
distinguishing causative mutations from common variants,<br />
identifying up to 78% of variants as common for a Belgian<br />
exome. However, these data sets often lack population<br />
specific variant frequencies and are outperformed by<br />
databases of local variants. For example, using GoNL<br />
(The Genome of the Netherlands Consortium, 2014) alone<br />
allowed the identification of up to 85% of variants as<br />
common for the same Belgian exome. The fact that the<br />
GoNL is based on only 498 individuals further highlights<br />
the importance of building and using population specific<br />
databases.<br />
Such population specific data can be retrieved from locally<br />
sequenced individuals that underwent Whole Exome<br />
Sequencing (WES) or Whole Genome Sequencing (WGS).<br />
Storing only the frequencies and genotype counts of the<br />
variants provides a valuable tool for variant classification<br />
while no sensitive information on the individuals is<br />
included.<br />
METHODS<br />
WES data of 350 unrelated and unaffected individuals<br />
have been parsed. All samples were analysed in a similar<br />
way i.e. reads were aligned to the reference genome with<br />
BWA (Li & Durbin, 2009) and genotyping was performed<br />
according to GATK best practices (McKenna et al., 2010;<br />
DePristo et al., 2011). All samples were genotyped at all<br />
polymorphic positions using GATK HaplotypeCaller and<br />
GenotypeGVCFs. For each position, samples with low<br />
quality genotype were considered as not genotyped and<br />
excluded from the genotype counts. The number of<br />
alternate alleles, allele counts and genotypes were<br />
compiled in a population VCF file, in which individual<br />
genotypes are not accessible.<br />
Variant frequencies can also be extracted from low<br />
coverage WGS. As a pilot we processed the data of<br />
chromosome 21 of about 4,000 WGS. The mapping was<br />
performed with BWA (Li & Durbin, 2009) and the BAM<br />
files were merged per 200 samples. All positions were<br />
genotyped using freebayes (Garrison & Marth, 2012).<br />
Genotype information of all locations outside low<br />
complexity regions were then compiled for all samples<br />
using the integration of Apache Hadoop, HBase and Hive<br />
(see poster “Big data solutions for variant discovery from<br />
low coverage sequencing data, by integration of Hadoop,<br />
Hbase and Hive”). Several models were then used to<br />
distinguish real variants from sequencing errors: the Minor<br />
Allele Frequency (MAF), the transition/transversion ratio,<br />
the expected number of loci with a MAF of 5%, etc.<br />
RESULTS & DISCUSSION<br />
We demonstrated the effect of our reference set on several<br />
exomes. The inclusion of only 350 individuals allowed the<br />
identification of about 3% additional common variants,<br />
not listed as common by ESP, dbSNP (Sherry et al., 2001),<br />
1000 Genomes (McVean et al., 2012) and GoNL (The<br />
Genome of the Netherlands Consortium, 2014). Since only<br />
the frequencies of the variants in the screened populations<br />
are reported, this file can easily be shared between<br />
laboratories. Besides, the procedure used to generate the<br />
population VCF file can easily be applied to several<br />
genetic centers in order to generate a common population<br />
VCF file, as planned within the BeMGI project.<br />
Finally we expect that the data from WGS will further<br />
increase the performance of our reference set. A genomewide<br />
variant frequencies file from local population will<br />
become worthwhile when WGS is routinely used in<br />
diagnostics.<br />
REFERENCES<br />
DePristo M et al. Nature Genetics 43, 491-498 (2011).<br />
Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle,<br />
WA (URL: http://evs.gs.washington.edu/EVS/).<br />
Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012).<br />
Li H & Durbin R Bioinformatics 25, 1754-60 (2009).<br />
McKenna A et al. Genome Research 20, 1297-303 (2010).<br />
McVean et al. Nature 491, 56–65 (2012).<br />
Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001).<br />
The Genome of the Netherlands Consortium. Nature Genetics 46,<br />
818–825 (2014).<br />
98
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P55. MANAGING BIG IMAGING DATA FROM MICROSCOPY:<br />
A DEPARTMENTAL-WIDE APPROACH<br />
Yves Sucaet 1* , Silke Smeets 1 , Stijn Piessens 1 , Sabrina D’Haese 1 , Chris Groven 1 , Wim Waelput 1 & Peter In’t Veld 1 .<br />
Department of Pathology 1 , Faculty of Medicine, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090 Brussels, Belgium.<br />
* yves.sucaet@usa.net<br />
With recent breakthroughs in whole slide imaging (WSI), almost any microscopic material can be digitized in an<br />
efficient manner. In order to mine these data efficiently, a top-down approach was employed to manage various imaging<br />
platforms. At Brussels Free University (VUB), we built a centralized infrastructure that integrates a variety of imaging<br />
platforms (brightfield, fluorescence, multi-vendor formats). With the help of the Pathomation software platform for<br />
digital microscopy, various datastores and image repositories were integrated. Custom coding was used to interact with<br />
various vendor-software and server applications, where needed. The end-result is an interconnected network of<br />
heterogeneous scalable information silos. We currently have two main use cases for WSI: education and biobanking.<br />
These applications are available to the public via http://www.diabetesbiobank.org.<br />
INTRODUCTION<br />
Too often, image analysis and data/image mining projects<br />
remain stuck in micro-environments because they are<br />
limited by vendor-specific solutions that neither scale nor<br />
interact with material from other departments or<br />
institutions. Successful roll-out of digital histopathology<br />
therefore requires more than a whole slide scanner.<br />
If the goal is for an imaging facility to allow a researcher<br />
to conduct a (microscopic) experiment, then that<br />
researcher should not be hindered by the imaging platform<br />
used. Similarly, an instructor integrating digital content<br />
into his or her course, should be able to make their<br />
materials as accessible as possible to as many students as<br />
possible.<br />
At Brussels Free University (VUB), we currently have two<br />
main use cases for whole slide imaging: education and<br />
biobanking. We have set these up in such a way that they<br />
are both scalable and expandable.<br />
METHODS<br />
Whole slide imaging (WSI) has recently provided a boost<br />
to digital capturing of microscopic content (and an<br />
explosion of data, resulting in a veritable digital treasure<br />
trove waiting for bioinformatics to be explored). But<br />
researchers have been digitizing content for a long time<br />
already through various technologies (mounted cameras,<br />
inverted fluorescent microscopes with low magnification,<br />
…).<br />
We envisioned an environment whereby a researcher can<br />
manage and view all of the material related to an<br />
experiment or observation from a single interface,<br />
irrespective of origin or technology used.<br />
The following steps were taken to accomplish this:<br />
<br />
<br />
<br />
Setup a central server (50TB storage)<br />
Centrally store all imaging data provide mapped<br />
drives on the individual workstations to facilitate<br />
a smooth transition for end-users<br />
Install the Pathomation platform for digital<br />
microscopy (PMA.core, PMA.view, PMA.zui)<br />
for universal viewing of digital content and to<br />
provide a uniform end-user experience<br />
<br />
<br />
Install Pydio (open source) for easy sharing of<br />
digital imaging content (integrated with<br />
Pathomation’s PMA.core so no duplicate user<br />
directories need to be maintained)<br />
Build custom portals to highlight specific<br />
collections of microscopic content and/or serve<br />
specific target audiences<br />
RESULTS & DISCUSSION<br />
The centralized digital imaging infrastructure is used by<br />
various researchers and graduate students. Recently over<br />
3,000 images were processed and hosted in the course of<br />
one month.<br />
Two use cases are worth highlighting:<br />
<br />
<br />
For undergraduate students (Medicine, BMS) we<br />
built custom portal websites to supplement their<br />
courses in histology and pathology. These sites<br />
are available at http://histology.vub.ac.be and<br />
http://pathology.vub.ac.be and provide students<br />
with (guided) virtual microscopy without the<br />
need to install any additional software<br />
We also provide access portals to different<br />
specialized biobanks. The Willy Gepts collection<br />
represents a historic milestone in diabetes<br />
research (http://gepts.vub.ac.be) and is<br />
complementary to the Alan Foulis collection<br />
(http://foulis.vub.ac.be). Furthermore, the clinical<br />
diabetes biobank can now be consulted online,<br />
too, via http://www.diabetesbiobank.org.<br />
CONCLUSION<br />
Digital histopathology has been around for some time now,<br />
but often results in heterogeneous data collections. It is<br />
only now that we start looking at integrated approaches on<br />
this varied data can be best handled. Digital pathology<br />
involves much more than the acquisition of a slide scanner.<br />
We have engaged five different imaging platforms onto a<br />
single architecture. We are storing data from all modalities<br />
in a single storage facility, and manage it through a single<br />
access point. The resulting environment assists in<br />
rendering content to any type of display device, without<br />
the need for extra software or background information<br />
concerning the content’s origin.<br />
99
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P56. ESTIMATING THE IMPACT OF CIS-REGULATORY VARIATION IN<br />
CANCER GENOMES USING ENHANCER PREDICTION MODELS AND<br />
MATCHED GENOME-EPIGENOME-TRANSCRIPTOME DATA<br />
Dmitry Svetlichnyy 1* , Hana Imrichova 1 , Zeynep Kalender Atak 1 & Stein Aerts 1 .<br />
Laboratory of Computational Biology, University of Leuven 1 . *dmitry.svetlichnyy@med.kuleuven.be<br />
The prioritization of candidate driver mutations in the non-coding part of the genome is a key challenge in cancer<br />
genomics. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on<br />
their recurrence, non-coding mutations are usually not recurrent at the same position. We aim to tackle this problem<br />
using machine-learning methods to predict regulatory regions and cancer genome sequences in combination with samplespecific<br />
chromatin profiles obtained using ChIP-seq against H3K27Ac.<br />
INTRODUCTION<br />
Perturbations of gene regulatory networks in cancer cells<br />
can arise from mutations in transcription factors or cofactors,<br />
but also from mutations in regulatory regions.<br />
Prioritizing candidate driver mutations that have a<br />
significant impact on the activity of a regulatory region is<br />
a key challenge in cancer genomics.<br />
METHODS<br />
We have developed enhancer prediction methods using<br />
Random Forest classifiers to estimate the Predicted<br />
Regulatory Impact of a Mutation in an Enhancer<br />
(PRIME). We find that the recently identified driver<br />
mutation in the TAL1 enhancer has a high PRIME score,<br />
representing a “gain-of-target” for the oncogenic<br />
transcription factor MYB [1]. We trained enhancer models<br />
for 45 cancer-related transcription factors, and used these<br />
to score somatic mutations across more than five hundred<br />
breast cancer genomes. Next, we re-sequenced the genome<br />
of ten cancer cell lines representing six different cancer<br />
types (breast, lung, melanoma, ovarian, and colon) and<br />
profiled their active chromatin by ChIP-seq against<br />
H3K27Ac.<br />
RESULTS & DISCUSSION<br />
Then we integrated these data with matched expression<br />
data and with the Random Forest model predictions for<br />
sets of oncogenic transcription factors per cancer type.<br />
This resulted in surprisingly few high-impact mutations<br />
that generate de novo regulatory (oncogenic) activity at<br />
the chromatin and gene expression level. Our framework<br />
can be applied to identify candidate cis-regulatory<br />
mutations using sequence information alone, and to<br />
samples with combined genome-epigenome-transcriptome<br />
data. Our results suggest the presence of only few cisregulatory<br />
driver mutations per genome in cancer genomes<br />
that may alter the expression levels of specific oncogenes<br />
and tumor suppressor genes.<br />
REFERENCES<br />
1. Mansour MR, Abraham BJ, Anders L, Berezovskaya A, Gutierrez A,<br />
Durbin AD, et al. An oncogenic super-enhancer formed through somatic<br />
mutation of a noncoding intergenic element. Science. 2014;346: 1373–<br />
1377. doi:10.1126/science.1259037<br />
100
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P57. I-PV: A CIRCOS MODULE FOR INTERACTIVE PROTEIN<br />
SEQUENCE VISUALIZATION<br />
Ibrahim Tanyalcin 1,2* , Carla Al Assaf 3 , Alexander Gheldof 1 , Katrien Stouffs 1,4 , Willy Lissens 1,4 & Anna C. Jansen 5,2 .<br />
Center for Medical Genetics, UZ Brussel, Brussels, Belgium 1 ; Neurogenetics Research Group, Vrije Universiteit Brussel,<br />
Brussels, Belgium 2 ; Center for Human Genetics, KU Leuven and University Hospitals Leuven, 3000 Leuven, Belgium 3 ;<br />
Reproduction, Genetics and Regenerative Medicine, Vrije Universiteit Brussel, Brussels, Belgium 4 ; Pediatric Neurology<br />
Unit, Department of Pediatrics, UZ Brussel, Brussels, Belgium 5 . *ibrahim.tanyalcin@i-pv.org or itanyalc@vub.ac.be<br />
Summary: Today’s genome browsers and protein databanks supply vast amounts of information about proteins. The<br />
challenge is to concisely bring together this information in an interactive and easy to generate format.<br />
Availability and Implementation: We have developed an interactive CIRCOS module called i-PV to visualize user<br />
supplied protein sequence, conservation and SNV data in a live presentable format. I-PV can be downloaded from<br />
http://www.i-pv.org.<br />
INTRODUCTION<br />
Today’s genome browsers and protein databanks supply<br />
vast amount of information about both the structural<br />
annotation and the single nucleotide variants (SNV) in<br />
genes. The challenge is to concisely bring together this<br />
information in an interactive and easy to generate format.<br />
Thus, we have developed an interactive CIRCOS<br />
(Krzywinski et al.) module combined with D3 (Bostock et<br />
al.) and plain javascript called i-PV to visualize user<br />
supplied protein sequence, conservation and SNV data<br />
while significantly easing and automating input file<br />
requirements and generation.<br />
METHODS<br />
To use i-PV, only 4 text files (with “.txt” extension) have<br />
to be supplied to the software: conservation scores,<br />
protein and cDNA sequences, and SNVs/Indels files.<br />
Protein and cDNA (or mRNA) sequence files are supplied<br />
in fasta format whereas SNP/Indel fıles are provided as<br />
annotated vcf file (Variant Call Format). The conservation<br />
scores are simply array of numbers separated by newline<br />
characters. The input files are supplied to i-PV, data are<br />
automatically checked for errors or duplicates and<br />
matched against the user provided fasta files, and then an<br />
interactive html file containing the graph is automatically<br />
generated as shown in Fig.1.<br />
RESULTS & DISCUSSION<br />
Many sequence visualization tools focus on certain aspects<br />
of proteins such as conservation, variations, sequence<br />
alignments or topology. While all these tools are very<br />
useful in their own right, we pursued a more interactivity<br />
based design. Therefore, i-PV is not solely designed for<br />
visualization but also for live presentable graphs and<br />
information that can selectively be displayed and<br />
customized. I-PV combines major sources of information<br />
under one html file that is easy to generate and share on<br />
both desktop and mobile environments.<br />
Last but not least, many visualization tools are based on<br />
rectangular-scroll based representation of information<br />
which does not deliver a “wide angle” view of the<br />
sequence data unlike circular visualization. However, as<br />
like all other types of visualizations, there are also<br />
limitations for circular graphs when it comes to<br />
conveniently zoom in to a particular region or visually<br />
align tracks with different radii. We intend to further<br />
develop this software with several other features based on<br />
end user needs. The current version of i-PV can be<br />
downloaded from http://www.i-pv.org.<br />
FIGURE 1. Overview of i-PV features. (A) SNVs with mouse over<br />
explanation and automatic generated dbSNP links (red: Nonsynonymous,<br />
green: Synonymous, gray: Not validated). (B) Console can<br />
be hidden for publication quality image. (C) Domains are colored based<br />
on user preference. (D) Conservation data from user generated<br />
alignment with mouse over information. (E) The user can define which<br />
amino acids to be shown on the sequence track. (F) Switch the color of<br />
the background to black. (G) Amino acids are plotted and split into 5<br />
main categories (nonpolar: gray circle, polar: magenta circle, negative:<br />
blue triangle, positive: red triangle, aromatic: green hexagon). (H)<br />
Adjustable conservation score threshold to display regions above a<br />
certain percentage of maximum conservation score. (I) Font-size of<br />
chosen amino acids can be adjusted. (J) User selectable amino acids to<br />
be displayed. (K) Up to 17 different amino acid properties can be chosen<br />
to be displayed from drop-down menu. (I) Tile track showing SNVs and<br />
indels (red: SNVs, magenta: Indels, gray stroke: Not validated, black:<br />
collapsed due to over display). (M) Gene Name. (N) Buttons for mass<br />
selection of amino acids. (O) User defined regions are marked with<br />
custom name tag and mouse over information. (P) Meta-analysis of<br />
amino acid distributions. This information is only displayed in case of<br />
single amino acid comparisons. The log2 ratios are capped between -3<br />
and 3. The maximum and the minimum blosum62 scores are -4 and 11.<br />
Since the blosum62 matrix is diagonally symmetric, the absolute value of<br />
the log ratios are mapped to this range and a p-value is indicated based<br />
on how close the two scores are.<br />
REFERENCES<br />
Bostock, M., et al. (2011), 'D3: Data-Driven Documents', IEEE Trans.<br />
Visualization & Comp. Graphics (Proc. InfoVis).<br />
Krzywinski, M., et al. (2009), 'Circos: an information aesthetic for<br />
comparative genomics', Genome Res, 19 (9), 1639-45.<br />
101
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P58. SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY<br />
PURIFICATION-MASS SPECTROMETRY DATA ANALYSIS<br />
Kevin Titeca 1,2 , Pieter Meysman 3,4 , Kris Gevaert 1,2 , Jan Tavernier 1,2 ,<br />
Kris Laukens 3,4 , Lennart Martens 1,2 & Sven Eyckerman 1,2* .<br />
Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, B-9000<br />
Ghent, Belgium 2 ; Advanced Database Research and Modeling (ADReM), University of Antwerp, Belgium 3 ; Biomedical<br />
informatics research center Antwerpen (biomina), Belgium 4 . sven.eyckerman@vib-ugent.be<br />
Affinity purification-mass spectrometry (AP-MS) is one of the most common techniques for the analysis of proteinprotein<br />
interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult because<br />
of the many false positives. The ideal filter technique for these data is highly accurate, fast and user friendly without the<br />
need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased.<br />
Because none of the existing filter techniques combines all these features, we developed SFINX, the Straightforward<br />
Filtering INdeX.<br />
We here describe the SFINX algorithm and its performance on two independent AP-MS benchmark datasets. SFINX<br />
shows superior performance over the other approaches with accuracy increases of up to 20%, and is extremely fast. It<br />
does not require parameter optimization, and is absolutely independent of external resources. Both the algorithm and its<br />
website interface are highly intuitive with limited need for user input and the possibility of immediate network<br />
visualization and interpretation at http://sfinx.ugent.be/. SFINX might become essential in the toolbox of any scientist<br />
interested in user-friendly and highly accurate filtering of AP-MS data.<br />
102
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION:<br />
AN EXTREMELY IMBALANCED BIG DATA PROBLEM<br />
Isaac Triguero 1,2* , Sara del Río 3 , Victoria López 3 , Jaume Bacardit 4 , José M. Benítez 3 & Francisco Herrera 3 .<br />
VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Department of Computer<br />
Science and Artificial Intelligence 3 ; School of Computing Science, Newcastle University 4 .<br />
* Isaac.Triguero@irc.vib-Ugent.be<br />
The application of data mining and machine learning techniques to biological and biomedicine data continues to be an<br />
ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and<br />
store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these<br />
problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these<br />
circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine<br />
learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was<br />
concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal<br />
with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics<br />
classifications problems.<br />
INTRODUCTION<br />
The prediction of a protein’s contact map is a crucial step<br />
for the prediction of the complete 3D structure of a protein.<br />
This is one of the most challenging bioinformatics tasks<br />
within the field of protein structure prediction because of<br />
the sparseness of the contacts (i.e. few positive examples)<br />
and the great amount of data extracted (i.e. millions of<br />
instances, Gbs of disk space) from a few thousand of<br />
proteins.<br />
This problem refers to an imbalance bioinformatics big<br />
data application, in which traditional machine learning<br />
techniques become non effective and non efficient due to<br />
the big dimension of the problem. However, with use of<br />
the emerging cloud-based technologies, these techniques<br />
can be redesigned to extract valuable knowledge from<br />
such amount of data.<br />
The ECDBL’14 competition (http://cruncher.ncl.ac.uk/<br />
bdcomp/) brought up a data set that modeled the contact<br />
map prediction problem as a classification task.<br />
Concretely, the training data set considered was formed by<br />
32 million instances, 631 attributes, 2 classes, 98% of<br />
negative examples and it occupies about 56GB of disk<br />
space.<br />
In this work we describe the methodology with which we<br />
have participated, under the name 'Efdamis', ranking as the<br />
winner algorithm (Triguero et al, <strong>2015</strong>).<br />
METHODS<br />
In the proposed methodology, we focused on the<br />
MapReduce (Dean et al, 2008) paradigm in order to<br />
manage this voluminous data set. We extended the<br />
applicability of some pre-processing and classification<br />
models to deal with large-scale problems. This is<br />
composed of four main parts:<br />
<br />
<br />
An oversampling approach: The goal is to balance the<br />
highly skewed class distribution of the problem by<br />
replicating randomly the instances of the minority<br />
class (del Rio et al, 2014).<br />
<br />
<br />
An evolutionary feature weighting method: Due the<br />
relative high number of features of the given problem<br />
we developed a feature selection scheme for largescale<br />
problems that improves the classification<br />
performance by detecting the most significant features<br />
(Triguero et al, 2012).<br />
Building a learning model: As classifier, we focused<br />
on a scalable RandomForest algorithm.<br />
Testing the model: Even the test data can be<br />
considered big data (2.9 millions of instances), so that,<br />
the testing phase was also deployed within a parallel<br />
approach.<br />
RESULTS & DISCUSSION<br />
Table 1 presents the final results of the top 5 participants<br />
in terms of True Positive Rate (TPR) and True Negative<br />
Rate (TNR). In this particular problem, the necessity of<br />
balancing the TPR and TNR ratios emerged as a difficult<br />
challenge for most of the participants of the competition.<br />
In this sense, the use of scalable preprocessing techniques<br />
played in important role to improve the results of the<br />
RandomForest classifier. First, the designed oversampling<br />
approach allowed us to prevent RandomForest to be<br />
biased to the negative class. Second, our feature weighting<br />
approach provided us the possibility of reducing the<br />
dimensionality of the problem by selecting the most<br />
relevant features. Thus, it resulted in a better performance<br />
as well as a notable reduction of the time requirements.<br />
Team TPR TNR TPR * TNR<br />
Efdamis 0.73043 0.73018 0.53335<br />
ICOS 0.70321 0.73016 0.51345<br />
UNSW 0.69916 0.72763 0.50873<br />
HyperEns 0.64003 0.76338 0.48858<br />
PUC-Rio_ICA 0.65709 0.71460 0.46956<br />
TABLE 1: Comparison with the top 5 of the competition.<br />
REFERENCES<br />
Dean J., Ghemawat S., Mapreduce: simplified data processing on large<br />
clusters, Commun. ACM 51 (1), 107–113 (2008).<br />
del Río S., et al., On the use of MapReduce for imbalanced big data using<br />
random forest, Inf. Sci. 285 (2014) 112–137.<br />
Triguero I. et al., Integrating a differential evolution feature weighting<br />
scheme into prototype generation, Neurocomputing 97 (2012) 332–<br />
343.<br />
103
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P60. COEXPNETVIZ: THE CONSTRUCTION AND VIZUALISATION OF CO-<br />
EXPRESSION NETWORKS<br />
Oren Tzfadia 1,2 , Tim Diels 1,2,4 , Sam De Meyer 1,2 , Klaas Vandepoele 1,2 , Yves Van de Peer 1,2,3,5,* & Asaph Aharoni 6 .<br />
Department of Plant Systems Biology, VIB, 9052 Ghent, Belgium 1 ; Department of Plant Biotechnology and<br />
Bioinformatics, Ghent University, 9052 Ghent, Belgium 2 ; Genomics Research Institute (GRI), University of Pretoria,<br />
0028 Pretoria, South Africa 3 ; Department of Mathematics and Computer Science, University of Antwerp, Antwerp,<br />
Belgium 4 ; Bioinformatics Institute Ghent, Ghent University, 9052 Ghent, Belgium 5 ; Department of Plant Sciences and<br />
the Environment, Weizmann Institute of Science, Rehovot 6 .<br />
INTRODUCTION<br />
Comparative transcriptomics is a common approach in<br />
functional gene discovery efforts. It allows for finding<br />
conserved co-expression patterns between orthologous<br />
genes in closely related plant species, suggesting that these<br />
genes potentially share similar function and regulation.<br />
Several efficient co-expression-based tools have been<br />
commonly used in plant research but most of these<br />
pipelines are limited to data from model systems, which<br />
greatly limit their utility. Moreover, in addition, none of<br />
the existing pipelines allow plant researchers to make use<br />
of their own unpublished gene expression data for<br />
performing a comparative co-expression analysis and<br />
generate multi-species co-expression networks.<br />
RESULTS<br />
We introduce CoExpNetViz, a computational tool that<br />
uses a set of bait genes as an input (chosen by the user)<br />
and a minimum of one pre-processed gene expression<br />
dataset. The CoExpNetViz algorithm proceeds in three<br />
main steps; (i) for every bait gene submitted, coexpression<br />
values are calculated using Pearson correlation<br />
coefficients, (ii) non-bait (or target) genes are grouped<br />
based on cross-species orthology, and (iii) output files are<br />
generated and results can be visualized as network graphs<br />
in Cytoscape.<br />
AVAILABILITY AND IMPLEMENTATION<br />
The CoExpNetViz tool is freely available both as a PHP<br />
web server (link:<br />
http://bioinformatics.psb.ugent.be/webtools/coexpr/)<br />
(implemented in C++) and as a Cytoscape plugin<br />
(implemented in Java). Both versions of the CoExpNetViz<br />
tool support LINUX and Windows platforms.<br />
104
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P61. THE DETECTION OF PURIFYING SELECTION DURING TUMOUR<br />
EVOLUTION UNVEILS CANCER VULNERABILITIES<br />
Jimmy Van den Eynden 1* & Erik Larsson 1 .<br />
Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, The Sahlgrenska Academy, University<br />
of Gothenburg, Sweden. * jimmy.van.den.eynden@gu.se<br />
Identification of somatic mutation patterns indicative of positive selection arguably has become the major goal of cancer<br />
genomics. This is motivated by a search for cancer driver genes and pathways that are recurrently activated in tumours<br />
but not normal cells, thus providing possible therapeutic windows. However, cancer cells additionally depend on a large<br />
number of basic cellular processes, and elevated sensitivity to inhibition of certain essential non-driver genes has been<br />
demonstrated in some cases. While such vulnerability genes should in theory be identifiable based on strong purifying<br />
(negative) selection in tumors, these patterns have been elusive and purifying selection remains underexplored in cancer.<br />
We established a new methodology and, using mutational data from 25 TCGA tumor types, we show for the first time<br />
that negative selection in candidate vulnerability genes can be detected.<br />
INTRODUCTION<br />
Recently it was shown that a hemizygous deletion of the<br />
well–known tumour suppressor gene TP53 creates<br />
therapeutic vulnerability in colorectal cancer due to<br />
concomitant loss of the neighbouring gene POLR2A (Liu<br />
et al., <strong>2015</strong>).<br />
As any damaging mutation occurring in the single allele of<br />
a hemizygously deleted essential gene, like POLR2A, is<br />
expected to lead to cell death, we hypothesized that<br />
purifying selection in these genes could be unveiled by<br />
demonstrating a lower number of damaging mutations<br />
then could be expected in the absence of any selection.<br />
Therefore we used the POLR2A case as a proof-ofconcept<br />
to develop a methodology to detect purifying<br />
selection in large genome sequencing datasets.<br />
METHODS<br />
Mutation and copy number data from 25 different cancers<br />
types and 7,871 samples were downloaded from the<br />
TCGA data portal and pooled together in a large pancancer<br />
dataset. Different mutational functional impact<br />
scores were calculated using Annovar. Copy number data<br />
were analyzed using Gistic 2.0 to differentiate POLR2A<br />
copy number neutral from hemizygously deleted samples.<br />
RESULTS & DISCUSSION<br />
POLR2A was found to be hemizygously deleted in 29% of<br />
all samples. As expected, in over 99% this deletion was<br />
part of the TP53 (driving) deletion on chromosome 17.<br />
POLR2A was mutated 228 times in 2.3% of all samples.<br />
While 14 nonsense mutations and small out-of-frame<br />
insertions or deletions occurred in the copy number<br />
neutral group, none of these damaging mutations were<br />
found in the deletion group (p=0.03, fisher test),<br />
suggesting purifying selection against this type of<br />
mutations.<br />
Next to these truncating mutations, also missense<br />
mutations that have a damaging effect on the gene’s<br />
protein function are expected to be selected against.<br />
Therefore we predicted the functional impact of all<br />
mutations using different functional impact scores. The<br />
median (PolyPhen-2) functional impact score was found<br />
to significantly lower in the deletion group compared to<br />
the copy number neutral group (p=0.002, Wilcoxon test,<br />
fig.1), further confirming that purifying selection has<br />
taken place in POLR2A during tumour evolution.<br />
These preliminary findings confirm that purifying<br />
selection is detectable in vulnerability genes like POLR2A<br />
and this approach could be used to detect other, new<br />
candidate vulnerability genes.<br />
FIGURE 1. Negative selection against POLR2A high impact mutations in<br />
hemizygously deleted tumour samples.<br />
REFERENCES<br />
Liu, Y., Zhang, X., Han, C., Wan, G., Huang, X., Ivan, C., … Lu, X.<br />
(<strong>2015</strong>). TP53 loss creates therapeutic vulnerability in colorectal<br />
cancer. Nature, 520(7549), 697–701.<br />
http://doi.org/10.1038/nature14418<br />
105
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P62. FLOREMI: SURVIVAL TIME PREDICTION<br />
BASED ON FLOW CYTOMETRY DATA<br />
Sofie Van Gassen 1,2,3* , Celine Vens 2,3,4 , Tom Dhaene 1 , Bart N. Lambrecht 2,3 & Yvan Saeys 2,3 .<br />
Department of Information Technology, Ghent University—iMinds 1 ; VIB Inflammation Research Center 2 ; Department of<br />
Respiratory Medicine, Ghent University 3 ; Department of Public Health and Primary Care, kU Leuven Kulak 4 .<br />
* sofie.vangassen@irc.vib-ugent.be<br />
Flow cytometry is a high-throughput technique for single cell analysis. It enables researchers and pathologists to study<br />
blood and tissue samples by measuring several cell properties, such as cell size, granularity and the presence of cellular<br />
markers. While this technique provides a wealth of information, it becomes hard to analyze all data manually. To<br />
investigate alternative automatic analysis methods, the FlowCAP challenges were organized. We will present an<br />
algorithm that obtained the best results on the FlowCAP IV challenge, predicting the time of progression to AIDS for<br />
HIV patients.<br />
INTRODUCTION<br />
The main task of the most recent FlowCAP IV challenge<br />
was a survival modeling challenge: participants had to<br />
predict the time of progression to AIDS for HIV patients,<br />
based on flow cytometry data of an unstimulated and a<br />
stimulated blood sample. Additionally, a secondary task<br />
was the identification of cell populations that could be<br />
indicative of this progression rate. Several challenges<br />
needed to be taken into account: the raw dataset was about<br />
20GB large and about eighty percent of the survival times<br />
were censored.<br />
METHODS<br />
We developed a new algorithm, FloReMi, which<br />
combined several preprocessing steps with a density based<br />
clustering algorithm, a feature selection step and a random<br />
survival forest (Van Gassen et al., <strong>2015</strong>).<br />
The input for our algorithm consisted of 2 flow cytometry<br />
samples for each patient: one unstimulated PBMC sample<br />
and one PBMC sample stimulated with HIV antigens. For<br />
each of these samples, 16 parameters were measured for<br />
hundreds of thousands of cells.<br />
First, we included quality control to remove erroneous<br />
measurements from the samples. We also made an<br />
automatic selection of live T cells to focus on the cells of<br />
interest in this specific flow cytometry staining.<br />
Once the dataset was cleaned up, we extracted features for<br />
each patient. This was done by clustering the cells using<br />
the flowDensity (Malek et al., <strong>2015</strong>) and flowType<br />
algorithms (Aghaeepour et al., 2012). These algorithms<br />
divide the values for each feature into either “high” or<br />
“low” and use all combinatorial options of “high”, “low”<br />
or “neutral” marker values to group the cells. This resulted<br />
in 3 10 different cell subsets.<br />
For each of these subsets, we computed the number of<br />
cells assigned to it and the mean fluorescence intensity for<br />
13 markers. Per patient, we collected these numbers for<br />
both samples and also computed the differences between<br />
the two. This resulted in a total of 2,480,058 features per<br />
patient.<br />
Because traditional machine learning algorithms cannot<br />
handle this amount of features, we then applied a feature<br />
selection step. To estimate the usefulness of a feature, we<br />
applied a Cox proportional hazards model on each feature.<br />
The resulting p-value indicates how well the feature<br />
corresponds with the known survival times for the training<br />
set. We ordered the features based on these scores, and<br />
picked only those that were uncorrelated with the others.<br />
This resulted in a final selection of 13 features, on which<br />
we applied several machine learning techniques. We<br />
compared the results of the Cox Proportional Hazards<br />
model, the Additive Hazards model and the Random<br />
Survival Forest.<br />
RESULTS & DISCUSSION<br />
All three methods performed well on the training dataset.<br />
However, on the test dataset, both the Cox Proportional<br />
Hazards model and the Additive Hazards model obtained<br />
bad results, probably due to overfitting on the training data.<br />
Only the Random Survival Forest obtained good results on<br />
the test dataset (Figure 1). This method outperformed all<br />
other methods submitted to the challenge.<br />
FIGURE 1. On the training dataset, there was a strong correlation<br />
between the scores and the actual survival times for all models. On the<br />
test dataset, only the Random Survival Forest performed well.<br />
One important challenge remains: the biological<br />
interpretation of our final features. Although they correlate<br />
with the transition times from HIV to AIDS, it is hard to<br />
interpret them as known cell types, due to our<br />
unsupervised feature extraction. Our method delivers a<br />
first step towards new insights in the progress from HIV to<br />
AIDS.<br />
REFERENCES<br />
Malek M et al. Bioinformatics 31.4, 606-607 (<strong>2015</strong>).<br />
Aghaeepour N et al. Bioinformatics 28, 1009-1016 (2012).<br />
Van Gassen S et al. Cytometry A, DOI 10.1002/cyto.a.22734<br />
106
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P63. STUDYING BET PROTEIN-CHROMATIN OCCUPATION TO<br />
UNDERSTAND GENOTOXICITY OF MLV-BASED GENE THERAPY VECTORS<br />
Sebastiaan Vanuytven 1* , Jonas Demeulemeester 1 , Zeger Debyser 1 & Rik Gijsbers 1,2 .<br />
Laboratory for Molecular Virology and Gene Therapy, KU Leuven 1 ; Leuven Viral Vector Core, KU Leuven 2 .<br />
* Sebastiaan.vanuytven@student.kuleuven.be<br />
Integrating retroviral vectors are used to treat genetic and acquired disorders that, theoretically, can be cured by<br />
introducing specific gene expression cassettes into patient cells. Clinical trials held over the past two decades have<br />
proven that this approach is effective in curing genetic disorders and can produce better results than the standard therapy<br />
(Touzot, F et al., <strong>2015</strong>). Nevertheless, adverse events in a limited number of patients treated with gamma-retroviral<br />
vectors have deterred their widespread application. Specifically, vector integration occurring in proximity of protooncogenes<br />
resulted in insertional mutagenesis and clonal expansion of the cells (Hacein-Bey-Abina S et al., 2003).<br />
INTRODUCTION<br />
Retroviruses and their derived viral vectors do not<br />
integrate at random. Their overall integration pattern is<br />
dictated by cellular cofactors that are co-opted by the<br />
invading viral complex. For gammaretroviral vectors<br />
(prototype MLV) the cellular bromo- and extraterminal<br />
domain (BET) family of proteins (BRD2, BRD3 and<br />
BRD4) tethers the viral integrase to the host cell<br />
chromatin (De Rijck J et al., 2013). At the moment the<br />
only available ChIP-seq data derives from HEK-293T<br />
cells exogenously overexpressing FLAG-tagged versions<br />
of the BET proteins (LeRoy G et al., 2012). Yet, the<br />
detailed chromatin binding profile of endogenous BET<br />
proteins in human cells is currently unknown. Here we<br />
report on the chromatin occupation of the endogenous<br />
BET proteins in K562 and human primary CD4+ T cells.<br />
METHODS<br />
Following fixation, all three BET proteins were pulleddown<br />
with specific antibodies (Bethyl Laboratories, α-<br />
BRD2: A302-583A; α-BRD3: A302-368A; α-BRD4:<br />
A301-985A or Abcam ab84776). Subsequently, 1x10 7<br />
cells per sample were processed for ChIP as previously<br />
described (Pradeepa MM et al., 2012). ChIPed DNA was<br />
amplified with WGA2 using the manufacturer's protocol<br />
(Sigma Aldrich). All ChIP experiments were done with at<br />
least two biological replicates in K562 and CD4+ T cells.<br />
After processing of the ChIP-seq data, we compared the<br />
obtained BET protein-binding sites with MLV integration<br />
sites, histone modifications and other genetic features.<br />
Furthermore, we used motif discovery in the<br />
neighbourhood of BET binding sites and MLV integration<br />
sites to try and discover potential new players in the MLV<br />
integration process.<br />
RESULTS & DISCUSSION<br />
Analysis showed that 24% of the MLV integration sites<br />
overlap with a BET-binding site in K562 cells, the<br />
majority of which are BRD4 sites. In addition, BET<br />
binding sites located in promoter and enhancer regions are<br />
preferred for MLV integration. Further, evaluation<br />
demonstrated a strong correlation between MLVintegration<br />
in these sites and the occurrence of the<br />
transcription factor recognition motifs for MAX, GATA2,<br />
EGR1, GAPBA and YY1, suggesting a role for these<br />
proteins or the underlying chromatin structures in<br />
targeting integration of MLV to these locations in the<br />
genome via interaction with BET proteins and/or the MLV<br />
long terminal repeat sequences. Recently, we generated<br />
MLV-based vectors that no longer recognize BET-proteins,<br />
BET independent MLV-based (BinMLV) vectors (El<br />
Ashkar S et al., 2014). Integration preferences of BinMLV<br />
vectors are shifted away from epigenetic marks associated<br />
with enhancers and promoters as shown in a PCA analysis,<br />
but they also associate less with BET and MAX binding<br />
sites. Even though, BinMLV vectors still did not integrate<br />
at random, their distribution can overall be described as<br />
more safe, with 3% more integration sites in so-called<br />
genomic "safe-harbor" regions (Sadelain M et al., 2012).<br />
REFERENCES<br />
De Rijck J et al. The BET family of proteins targets moloney murine<br />
leukemia virus integration near transcription start sites, Cell Rep, 5,<br />
886-894, (2013).<br />
El Ashkar S et al. BET-independent MLV-based Vectors Target Away<br />
From Promoters and Regulatory Elements, Mol Ther Nucleic Acids,<br />
3, e179, (2014).<br />
Hacein-Bey-Abina S et al. LMO2-associated clonal T cell proliferation in<br />
two patients after gene therapy for SCID-X1, Science, 302, 415-419,<br />
(2003).<br />
LeRoy G et al. Proteogenomic characterization and mapping of<br />
nucleosomes decoded by Brd and HP1 proteins, Genome Biol, 13,<br />
R68, (2012).<br />
Pradeepa MM et al. Psip1/Ledgf p52 binds methylated histone H3K36<br />
and splicing factors and contributes to the regulation of alternative<br />
splicing, PLoS Genet, 8, e1002717, (2012).<br />
Sadelain M, Papapetrou EP and Bushman FD. Safe harbours for the<br />
integration of new DNA in the human genome, Nat Rev Cancer, 12,<br />
51-58, (2012).<br />
Touzot, F et al. Faster T-cell development following gene therapy<br />
compared with haploidentical HSCT in the treatment of SCID-X1,<br />
Blood, 125, 3563-3569, (<strong>2015</strong>).<br />
107
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P64. THE COMPLETE GENOME SEQUENCE OF LACTOBACILLUS<br />
FERMENTUM IMDO 130101 AND ITS METABOLIC TRAITS RELATED TO<br />
THE SOURDOUGH FERMENTATION PROCESS<br />
Marko Verce, Koen Illeghems, Luc De Vuyst & Stefan Weckx * .<br />
Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Faculty of Sciences and Bioengineering<br />
Sciences, Vrije Universiteit Brussel, Brussels, Belgium. * stefan.weckx@vub.ac.be<br />
The genome of the lactic acid bacterium species Lactobacillus fermentum IMDO 130101, capable of dominating<br />
sourdough fermentation processes, was sequenced, annotated, and curated. Further, this genome sequence of 2.09 Mbp<br />
was compared to other complete genomes of different strains of L. fermentum to elucidate the potential of L. fermentum<br />
IMDO 130101 as a sourdough starter culture strain. As opposed to the other strains, L. fermentum IMDO 130101<br />
contained unique genes related to carbohydrate import and metabolism as well as a gene coding for a phenolic acid<br />
decarboxylase and a gene encoding a 4,6- -glucanotransferase. The latter enzyme activity may result in the production<br />
of isomalto/malto-polysaccharides. All these features make L. fermentum IMDO 130101 attractive for further study as a<br />
candidate sourdough starter culture strain.<br />
INTRODUCTION<br />
Lactobacillus fermentum is a heterofermentative lactic<br />
acid bacterium often found in fermented food products,<br />
including sourdough. Strain L. fermentum IMDO 130101,<br />
a dominant sourdough strain originally isolated from a rye<br />
sourdough (Weckx et al., 2010) and extensively described<br />
previously (e.g., Vrancken et al., 2008), was sequenced<br />
and compared to other L. fermentum strains with<br />
completed genomes to elucidate unique adaptations of the<br />
strain studied to the sourdough environment.<br />
METHODS<br />
High-quality genomic DNA was used to construct an 8-kb<br />
paired-end library for 454 pyrosequencing. The<br />
pyrosequencing reads were assembled using the GS De<br />
Novo Assembler version 2.5.3 with default parameters.<br />
Primers for gap closure were designed using CONSED<br />
23.0, the gaps amplified with polymerase chain reaction<br />
(PCR) assays and the amplicons sequenced using Sanger<br />
sequencing. The sequences were imported into CONSED<br />
23.0 and used to close the gaps. The genome was<br />
annotated using the automated genome annotation<br />
platform GenDB v2.2 (Meyer et al., 2003), followed by<br />
extensive manual curation. Publicly available genome<br />
sequences of L. fermentum F-6 (Sun et al., <strong>2015</strong>), L.<br />
fermentum IFO 3956 (Morita et al., 2008), and L.<br />
fermentum CECT 5716 (Jiménez et al., 2010) were<br />
acquired from RefSeq. Whole-genome comparisons with<br />
the other three L. fermentum strains and ortholog findings<br />
were performed using the progressiveMauve algorithm<br />
(Darling et al., 2010).<br />
RESULTS & DISCUSSION<br />
The 2.09 Mbp genome was assembled from 403,466 reads,<br />
resulting in 74 contigs. No plasmids were found. The<br />
comparative genome analysis with other strains showed<br />
that 477 coding sequences were found in L. fermentum<br />
IMDO 130101 solely (Figure 1).<br />
L. fermentum IMDO 130101 was predicted to be able to<br />
import and utilise glucose, fructose, xylose, mannose, N-<br />
acetylglucosamine, maltose, sucrose, lactose and gluconic<br />
acid via the heterolactic fermentation pathway. Also, the<br />
ability to degrade raffinose and arabinose was predicted.<br />
Consumption of glucose, fructose, maltose and sucrose<br />
was shown in previous research, although growth with<br />
sucrose as the sole energy source was impaired (Vrancken<br />
et al., 2008). The strain possibly imports isomaltose and<br />
maltodextrins, hence elaborating glucose subunits. The<br />
-glucosidase-encoding gene was not found in the<br />
genomes of the other three strains considered, and neither<br />
were the putative maltodextrin import-related genes, the<br />
trehalose-6-phosphate phosphorylase-encoding gene and a<br />
putative -glucanase-encoding gene, which all may be<br />
adaptations of L. fermentum IMDO 130101 to the<br />
sourdough environment. The presence of the arginine<br />
deiminase gene cluster was confirmed. Also, L. fermentum<br />
IMDO 130101 contained a gene for a phenolic acid<br />
decarboxylase, which may have an impact on sourdough<br />
aroma. Further, a 4,6- -glucanotransferase-encoding gene<br />
was present in strain IMDO 130101 solely, which could<br />
result in isomalto/malto-polysaccharide production, a<br />
soluble dietary fibre with prebiotic properties.<br />
Overall, comparative genome analysis revealed metabolic<br />
traits that are of interest for the use of L. fermentum IMDO<br />
130101 as a functional starter culture for sourdough<br />
fermentation processes.<br />
FIGURE 1. Venn diagram of shared coding sequences between four<br />
different strains of Lactobacillus fermentum.<br />
REFERENCES<br />
Darling et al. PLoS ONE 5, e11147 (2010).<br />
Jiménez E. et al. J. Bacteriol. 192, 4800-4800 (2010).<br />
Meyer et al. Nucleic Acids Res. 31, 2187-2195 (2003).<br />
Morita et al. DNA Res. 15: 151-161 (2008).<br />
Sun et al. J. Biotechnol. 194, 110-111 (<strong>2015</strong>).<br />
Vrancken et al. Int. J. Food Microbiol. 128, 58-66 (2008).<br />
Weckx et al. Food Microbiol. 27, 1000-1008 (2010).<br />
108
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P65. ORTHOLOGICAL ANALYSIS OF AN EBOLA VIRUS – HUMAN PPIN<br />
SUGGESTS REDUCED INTERFERENCE OF EBOLA VIRUS WITH EPIGENETIC<br />
PROCESSES IN ITS SUSPECTED BAT RESERVOIR HOST<br />
Ben Verhees 1* , Kris Laukens 1,2 , Stefan Naulaerts 1,2 , Pieter Meysman 1,2 & Xaveer Van Ostade 3 .<br />
Biomedical informatics research center Antwerpen (biomina) 1 ; Advanced Database Research and Modeling (ADReM),<br />
University of Antwerp 2 ; Laboratory of Protein Science, Proteomics and Epigenetic Signalling (PPES) and Centre for<br />
Proteomics and Mass spectrometry (CFP-CeProMa), University of Antwerp 3 . * ben.verhees@student.uantwerpen.be<br />
Ebola virus is a zoonosis, but its reservoir host has not yet been identified. Recent findings suggest however, that Mops<br />
condylurus, an insect-eating bat, is a likely candidate. Studying the interactions between Ebola virus and its reservoir<br />
host could prove highly informative, as reservoir hosts of zoonotic pathogens often appear to tolerate infections with<br />
these pathogens with little evidence of disease. In this study, a protein-protein interaction network (PPIN) was created<br />
between Ebola virus and human proteins. Orthology data in Myotis lucifugus – a model organism often used for bat<br />
studies – was employed to determine which of the human first neighbors of Ebola virus proteins do not possess an<br />
orthologue in M. lucifugus. Subsequent GO enrichment analysis suggested that these proteins are mostly involved in<br />
epigenetic processes, and thus we hypothesize that Ebola virus displays reduced interference with epigenetic processes in<br />
its reservoir host.<br />
INTRODUCTION<br />
The idea that bats serve as reservoirs for a wide range of<br />
zoonotic pathogens has been the topic of much recent<br />
research. Previous studies on human and bat orthology in<br />
this context have mainly focused on specific genes,<br />
important in fighting off viral infection.<br />
Our study is different however, in that it focuses on<br />
proteins the Ebola virus immediately interacts with in<br />
humans, and the existence of orthologues of these proteins<br />
in bats.<br />
METHODS<br />
Construction of an Ebola virus – human PPIN<br />
An Ebola virus – human PPIN was constructed from in<br />
silico data. All network analysis was done using<br />
Cytoscape v. 3.2.1.<br />
Orthology analysis<br />
Identification of orthologues was performed using the<br />
OMA orthology database, release: September <strong>2015</strong>.<br />
Statistics<br />
For the statistical analysis, the hypergeometric test was<br />
performed.<br />
GO enrichment<br />
GO enrichment analysis was performed using ClueGO v.<br />
1.2.7, a Cytoscape plug-in. Default settings were used, and<br />
all ontologies/pathways were examined.<br />
RESULTS & DISCUSSION<br />
Myotis lucifugus as a model for Mops condylurus<br />
In this study, Myotis lucifugus was used as a model to<br />
study interactions between Ebola virus and Mops<br />
condylurus, its suspected reservoir.<br />
Ebola virus – human PPIN and orthology in M.<br />
lucifugus<br />
An Ebola virus – human PPIN was created, and human<br />
first neighbors of Ebola virus proteins were examined for<br />
existence of orthologues in M. lucifugus. Statistical<br />
analysis revealed that there was an upregulation of human<br />
proteins with orthologues in M. lucifugus (p=0.019).<br />
GO enrichment suggests reduced interference of Ebola<br />
virus with epigenetic processes in its reservoir host<br />
Gene ontology (GO) enrichment analysis was performed<br />
of the human first neighbors of Ebola virus proteins which<br />
do not possess an orthologue in M. lucifugus. The analysis<br />
revealed that these proteins are mostly involved in<br />
epigenetic processes (Figure 1).<br />
FIGURE 1. GO enrichment analysis of human first neighbors of Ebola<br />
virus proteins which do not possess an orthologue in M. lucifugus.<br />
Discussion<br />
Using this novel approach, we have shown that Ebola<br />
virus is likely able to interfere with epigenetic processes in<br />
humans. Secondly, Ebola virus’ ability to interfere with<br />
host epigenetics is likely reduced or altered in its reservoir<br />
host.<br />
While the idea that viruses are able to interact with host<br />
epigenetic mechanisms is fairly recent, over the past few<br />
years significant research has been done exploring this<br />
topic. In a comprehensive review, Li et al. (2014) describe<br />
how specific viral proteins are able to modulate the<br />
activity of chromatin modification complexes, e.g. HATs,<br />
HDACs, HMTs, and HDMTs, and even directly bind<br />
histone proteins. These findings lend support to the results<br />
of our study, as these suggest that Ebola virus is also able<br />
to interact with HDACs, HMTs and several histone<br />
proteins in humans.<br />
REFERENCES<br />
Li S et al. Rev Med Virol 24, 223-241 (2014).<br />
109
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P66. PLADIPUS EMPOWERS UNIVERSAL DISTRIBUTED COMPUTING<br />
Kenneth Verheggen 1,2,3* , Harald Barsnes 4,5 , Lennart Martens 1,2,3 & Marc Vaudel 4 .<br />
Medical Biotechnology Center, VIB, Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, Ghent 2 ;<br />
Belgium,Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium 3 ; Proteomics Unit, Department of<br />
Biomedicine, University of Bergen, Norway 4 ; KG Jebsen Center for Diabetes Research, Department of Clinical Science,<br />
University of Bergen, Norway 5 . *kenneth.verheggen@vib-ugent.be<br />
The use of proteomics bioinformatics substantially contributes to an improved understanding of proteomes, but this novel<br />
and in-depth knowledge comes at the cost of increased computational complexity. Parallelization across multiple<br />
computers, a strategy termed distributed computing, can be used to handle this increased complexity. However, setting<br />
up and maintaining a distributed computing infrastructure requires resources and skills that are not readily available to<br />
most research groups.<br />
Here, we propose a free and open source framework named Pladipus that greatly facilitates the establishment of<br />
distributed computing networks for proteomics bioinformatics tools.<br />
INTRODUCTION<br />
Various modern day bioinformatics-related fields have a<br />
growing focus on large scale data processing. This<br />
inevitably leads to an increased complexity, as is<br />
illustrated by the recent efforts to elaborate a<br />
comprehensive MS-based human proteome<br />
characterization (Kim et al., 2014; Wilhelm et al., 2014).<br />
Such high-throughput, complex studies are becoming<br />
increasingly popular, but require high performance<br />
computational setups in order to be analyzed swiftly.<br />
METHODS<br />
Here, we present a generic platform for distributed<br />
proteomics software, called Pladipus. It provides an<br />
end-user-oriented solution to distribute<br />
bioinformatics tasks over a network of computers,<br />
managed through an intuitive graphical user interface<br />
(GUI).<br />
Pladipus comes with several modules that work out<br />
of the box. They include SearchGUI (Vaudel et al.,<br />
2011), PeptideShaker (Vaudel et al., <strong>2015</strong>),<br />
DeNovoGUI (Muth et al., 2014), MsConvert (part of<br />
Proteowizard (Kessner et al., 2008)) and three<br />
common forms of the BLAST (Altschul et al., 1990)<br />
algorithm (blastn, blastp and blastx). It is possible to<br />
link these together to set up tailored pipelines for<br />
specific needs, including custom, in-house<br />
algorithms and execute the whole on an inexpensive,<br />
scalable cluster infrastructure without additional cost<br />
or expert maintenance requirement. It can even be set<br />
up to allow existing (idle) hardware to hook into the<br />
network and participate in the processing.<br />
RESULTS & DISCUSSION<br />
To numerically assess the benefits of using a distributed<br />
computing framework, 52 CPTAC experiments (LTQ-<br />
Study6 : Orbitrap@86) (Paulovich et al., 2010) were<br />
searched three times against a protein sequence database<br />
(UniProtKB/SwissProt (release-<strong>2015</strong>_05)) on Pladipus<br />
networks of various. A selection of three search engines<br />
was applied: X!Tandem, Tide and MS-GF+. As expected<br />
for a distributed system, the wall time is very reproducible<br />
and decreased nearly exponentially with the number of<br />
workers.<br />
FIGURE 1. Benchmarking of a Pladipus network<br />
(16GB ram, 12cores, 250GB disk space, Ubuntu<br />
precise)<br />
Pladipus is freely available as open<br />
source under the permissive Apache2<br />
license. Documentation, including<br />
example files, an installer and a video tutorial, can be<br />
found at<br />
https://compomics.github.io/projects/pladipus.html.<br />
REFERENCES<br />
Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol.<br />
Biol., 215, 403–10.<br />
Kessner,D. et al. (2008) ProteoWizard: open source software for rapid<br />
proteomics tools development. Bioinformatics, 24, 2534–6.<br />
Kim,M.-S. et al. (2014) A draft map of the human proteome. Nature,<br />
509, 575–81.<br />
Muth,T. et al. (2014) DeNovoGUI: an open source graphical user<br />
interface for de novo sequencing of tandem mass spectra. J.<br />
Proteome Res., 13, 1143–6.<br />
Paulovich,A.G. et al. (2010) Interlaboratory study characterizing a yeast<br />
performance standard for benchmarking LC-MS platform<br />
performance. Mol. Cell. Proteomics, 9, 242–54.<br />
Vaudel,M. et al. (<strong>2015</strong>) PeptideShaker enables reanalysis of MS-derived<br />
proteomics data sets. Nat. Biotechnol., 33, 22–24.<br />
Vaudel,M. et al. (2011) SearchGUI: An open-source graphical user<br />
interface for simultaneous OMSSA and X!Tandem searches.<br />
Proteomics, 11, 996–9.<br />
Wilhelm,M. et al. (2014) Mass-spectrometry-based draft of the human<br />
proteome. Nature, 509, 582–7.<br />
110
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P67. IDENTIFICATION OF ANTIBIOTIC RESISTANCE MECHANISMS USING<br />
A NETWORK-BASED APPROACH<br />
Bram Weytjens 1,2,3,4 , Dries De Maeyer 1,2,,3,4 & Kathleen Marchal 1,2,4 *.<br />
Dept. of Information Technology (INTEC, iMINDS), UGent, Ghent, 9052, Belgium 1 ; Dept. of Plant Biotechnology and<br />
Bioinformatics, Ghent University, Technologiepark 927, 9052 Gent, Belgium 2 ; Dept. of Microbial and Molecular<br />
Systems, KU Leuven, Kasteelpark Arenberg 20, B-3001 Leuven, Belgium 3 , Bioinformatics Institute Ghent, Ghent<br />
University, Ghent B-9000, Belgium 4 . * kathleen.marchal@intec.ugent.be<br />
Antibiotic resistance is a growing public health concern as the effectiveness of multiple types of antibiotics is decreasing.<br />
To prevent and combat the further spread of antibiotic resistance in bacteria there is the need to better understand the<br />
relationship between genetic alterations and the (molecular) phenotype of antibiotic resistant strains. As several (-omics)<br />
experiments regarding the attainment of antibiotic resistance by bacteria have already been performed and are publicly<br />
available, we re-analysed a laboratory evolution experiment by Suzuki et al. (Suzuki, 2014) in order to demonstrate the<br />
power of a network-based approach in identifying mutations and molecular pathways driving the resistance phenotype.<br />
INTRODUCTION<br />
While network-based approaches are no longer new in<br />
high-throughput (-omics) analysis, they are not yet widely<br />
used in standard analysis pipelines. We analysed a dataset<br />
consisting of multiple E. coli MDS42 strains, each<br />
independently evolved in the presence of a specific<br />
antibiotic (10 in total). By adapting PheNetic (De Maeyer.<br />
2013), an algorithm which connects genetic alterations to<br />
their differentially expressed genes over a genome-wide<br />
interaction network, we were able to automatically<br />
identify mutations in genes which are known to induce<br />
antibiotic resistance.<br />
METHODS<br />
For every strain whole-genome sequencing data and<br />
microarray data (eQTL data) was available. By finding the<br />
most probable connections between the mutations of every<br />
strain and the strain’s respective expression data over a<br />
biological network, PheNetic was able to not only uncover<br />
potential driver genes and molecular pathways for the<br />
resistance phenotype but also to prioritize the identified<br />
mutations based on the likelihood that they are truly<br />
driving the resistance phenotype. Such network-based<br />
approach has following advantages:<br />
<br />
<br />
Integration of interactomics (network), genomics<br />
and interactomics data<br />
Multiple related datasets can be analyzed together<br />
FIGURE 1: Part of Amikacin resistance network.<br />
RESULTS & DISCUSSION<br />
In the case of Amikacin resistance (figure 1) we were able<br />
to uncover a gain-of-function mutation in cpxA, a gene of<br />
a two-component signal transduction mechanisms which is<br />
known to be involved in amikacin resistance for two<br />
strains out of four. For the other two strains, deleterious<br />
cyoB mutations were found, which is known to lead to<br />
intracellular oxidized copper and eventually multidrug<br />
resistance. These genes were furthermore ranked highest<br />
by PheNetic.<br />
REFERENCES<br />
Suzuki S et al. Nat Commun 5, 5792 (2014).<br />
De Maeyer D et al. Mol Biosyst 9: 1594-1603 (2013).<br />
111
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P68. DEFINING THE MICROBIAL COMMUNITY OF DIFFERENT<br />
LACTOBACILLUS NICHES USING METAGENOMIC SEQUENCING<br />
Sander Wuyts 1,2* , Eline Oerlemans 1 , Ilke De Boeck 1 , Wenke Smets 1 , Dieter Vandenheuvel, Ingmar Claes 1 & Sarah<br />
Lebeer 1 .<br />
Laboratory of Applied Microbiology and Biotechnology, University of Antwerp 1 ; Research Group of Industrial<br />
Microbiology and Food Biotechnology (IMDO), Vrije Universiteit Brussel 2 * Sander.Wuyts@UAntwerp.be<br />
Next-Generation Sequencing (NGS) has revolutionized the field of microbial community analysis. Due to these highthroughput<br />
DNA-technologies, microbiologists are now able to perform more in-depth analyses of various microbial<br />
communities compared to culture-independent methods. In our lab, we have successfully deployed 16S rDNA amplicon<br />
sequencing using MiSeq-sequencing (Illumina). A bioinformatic pipeline has been built based on mothur (Schloss et al.<br />
2009), UPARSE (Edgar 2013) and Phyloseq (McMurdie & Holmes 2013) to analyse different microbial community<br />
datasets. The focus is on functional analysis of lactobacilli and other lactic acid bacteria in different ecological niches:<br />
ranging from the human upper respiratory tract to naturally fermented plant-based foods.<br />
INTRODUCTION<br />
16S metagenomics is a technique that makes use of the<br />
highly conserved bacterial 16S rRNA gene. This gene<br />
codes for an RNA-molecule which is a component of the<br />
30S small subunit of bacterial ribosomes. It consists of 9<br />
hypervariable regions, flanked by conserved regions for<br />
which primer pairs for PCR/sequencing can be designed.<br />
Due to these characteristics and due to the slow rate of<br />
evolution, this gene has been widely used in bacterial<br />
phylogeny and taxonomy. NGS technologies like Illumina<br />
MiSeq have made it possible to study all the different<br />
16S rRNA gene copies from an environmental sample and<br />
use these to identify the bacteria present in the sample. But<br />
the use of these high-throughput technologies comes with<br />
a cost: the need for a more in-depth bioinformatic analysis.<br />
METHODS<br />
Wetlab:<br />
DNA is extracted using sample dependent extraction<br />
protocols. A barcoded PCR is performed on the V4 region<br />
of the 16S rRNA gene as described in Kozich et al. 2013.<br />
For each sample a different set of primers is used; each<br />
primerset contains a unique combination of barcodes. The<br />
PCR-products are cleaned using AMPure XP (Agencourt)<br />
bead purification and quantified using Qubit (Life<br />
technologies). All samples are equimolary pooled into one<br />
single library. A negative control (= “empty” DNAextraction)<br />
and a positive control (= “Mock” communities<br />
HM-276D and HM-782D) are always processed together<br />
with the samples. The library is sequenced using a dual<br />
index sequencing strategy (Kozich et al. 2013) and a<br />
2 x 250 bp kit on the Illumina MiSeq.<br />
Bio-informatic analysis:<br />
Samples are demultiplexed on the MiSeq itself, allowing 1<br />
bp difference in the barcodes. The general quality of the<br />
reads is checked using FastQC (Babraham Bioinformatics).<br />
The paired end reads are merged using mothur’s<br />
make.contigs command. Quality control in mothur is<br />
performed using screen.seqs, alignment to the SILVA<br />
database and removal of sequences that do not map to the<br />
database, removal of chimeras using chimera.uchime and<br />
removal of sequences that classify to the lineages<br />
“Mitochondria” and “Chloroplast”.<br />
The distance between sequences are calculated using<br />
mothur’s dist.seqs command and are clustered at 97 %<br />
sequence similarity using mothur’s cluster command.<br />
Alternatively the UPARSE clustering algorithm can be<br />
used for these last two steps. Sequences are classified<br />
using the RDP database and the complete dataset is<br />
exported as a .biom file.<br />
Visualisation and statistical analysis is performed using<br />
the R-package Phyloseq. This analysis depends on the<br />
experimental design but generally consists of a<br />
normalisation step (either using rarefying, proportions or a<br />
statistical mixture model (McMurdie & Holmes 2014)), a<br />
calculation of alpha diversity measurements and a<br />
calculation and visualisation of beta diversity.<br />
RESULTS & DISCUSSION<br />
The above described method was optimised and proved to<br />
be working. We successfully used this technique to obtain<br />
better insights in the role of lactobacilli in different<br />
ecological niches, e.g. in the murine gastrointestinal tract,<br />
vegetable fermentations and the human upper respiratory<br />
tract.<br />
REFERENCES<br />
Edgar, R.C., 2013. UPARSE: highly accurate OTU sequences from<br />
microbial amplicon reads. Nature methods, 10(10), pp.996–8.<br />
Kozich, J.J. et al., 2013. Development of a dual-index sequencing<br />
strategy and curation pipeline for analyzing amplicon sequence<br />
data on the MiSeq Illumina sequencing platform. Applied and<br />
environmental microbiology, 79(17), pp.5112–20.<br />
McMurdie, P.J. & Holmes, S., 2013. Phyloseq: An R Package for<br />
Reproducible Interactive Analysis and Graphics of Microbiome<br />
Census Data. PLoS ONE, 8(4).<br />
McMurdie, P.J. & Holmes, S., 2014. Waste not, want not: why rarefying<br />
microbiome data is inadmissible. PLoS computational biology,<br />
10(4), p.e1003531.<br />
Schloss, P.D. et al., 2009. Introducing mothur: Open-source, platformindependent,<br />
community-supported software for describing and<br />
comparing microbial communities. Applied and Environmental<br />
Microbiology, 75(23), pp.7537–7541.<br />
112
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P69. HUNTING HUMAN PHENOTYPE-ASSOCIATED GENES<br />
USING MATRIX FACTORIZATION<br />
Pooya Zakeri 1,2,* , Jaak Simm 1,2 , Adam Arany 1,2 , Sarah Elshal 1,2 & Yves Moreau 1,2 .<br />
Department of Electrical Engineering, STADIUS, KU Leuven, Leuven 3001, Belgium 1 ; iMinds Medical IT, Leuven 3001,<br />
Belgium 2 . * pooya.zakeri@esat.kuleuven.be<br />
In the last decade, the phenotype-genes identification has received growing attention. It is yet one of the most<br />
challenging problem in biology. In particular, determining disease-associated genes is a demanding process and plays a<br />
crucial role in understanding the relationship between phenotype disease and genes. Typical approaches for gene<br />
prioritization often models each diseases individually, that fails to capture the common patterns in the data. This<br />
motivates us to formulate the hunting phenotype-associated genes problem as a factorization of an incompletely filled<br />
gene-phenotype-matrix where the objective is to predict unknown values. Experimental result on the updated version of<br />
Endeavour benchmark demonstrates that our proposed model can effectively improve the accuracy of the state-of-the-art<br />
gene prioritization model.<br />
INTRODUCTION<br />
In biology, there is often the need to discover the most<br />
promising genes among large list of candidate genes to<br />
further investigate. While a single data source might not<br />
be effective enough, fusing several complementary<br />
genomic data sources results in more accurate prediction.<br />
Moreover, fusing the phenotypic similarity of diseases and<br />
sharing information about known disease genes across<br />
both diseases and genes through a multi-task approach,<br />
enable us to handle gene prioritization for diseases with<br />
very few known genes and genes with limited available<br />
information. Typical strategies for hunting phenotypeassociated<br />
genes often models each phenotype<br />
individually [1, 2, 3, 4], that fails to capture the common<br />
patterns in the data. This motivates us to formulate the<br />
hunting phenotype-associated genes task as a factorization<br />
of an incompletely filled gene-phenotype-matrix where the<br />
objective is to predict unknown values.<br />
METHODS<br />
We consider OMIM database which is a human phenotype<br />
disease specific association databases. OMIM focuses on<br />
the relationship between human genotype and associated<br />
diseases. OMIM database can be seen as an incomplete<br />
matrix where each row is a gene and each column is a<br />
phenotype (disease).<br />
The idea behind the factorizing the M×N OMIM matrix is<br />
to represent each row and each column by a latent vector<br />
of size D. Then, the OMIM matrix can be modeled by<br />
product of an N×D gene matrix G and an M× D disease<br />
matrix P.<br />
Bayesian matrix factorization (BPMF) [5] is a famous<br />
method to fill such an incomplete matrix. But BPMF uses<br />
no side information which results in an inaccurate genephenotype-matrix<br />
completion.<br />
We propose an extended version of BPMF with an ability<br />
to work with multiple side information sources for<br />
completing gene-phenotype-matrix [6], which allows to<br />
make out-of-genes-phenotype-matrix ranking. In our<br />
proposed framework we are also able to integrate both<br />
genomic data sources and phenotypes information,<br />
whereas earlier approaches for hunting phenotype<br />
associated genes are limited to only fuse genomic<br />
information. This modification is done by adding genomic<br />
and phenotypic features to the corresponding latent<br />
variables [6]. In this study, we consider several genomic<br />
data sources including annotation-based data sources such<br />
as UniProt annotation, literature-based data sources on<br />
each genes, and as well the literature-based phenotypic<br />
information on each diseases, as just as in [1, 4, 9]. The<br />
framework of our Bayesian data fusion model for gene<br />
prioritization is illustrated in Figure 1.<br />
FIGURE 1. The framework of our Bayesian data fusion model for gene<br />
prioritization.<br />
RESULTS & DISCUSSION<br />
We report the average TPR results, when considering the<br />
top 1%, 5%, 10%, and 30% of the ranked genes.<br />
Experimental result on the updated version of Endeavour<br />
[3] benchmark demonstrates that our proposed model can<br />
effectively improve the accuracy of the state-of-the-art<br />
gene prioritization model.<br />
REFERENCES<br />
Aerts, S. et al. Nat Biotech, 24(5), 537–544, (2006).<br />
De Bie T, Tranchevent LC, van Oeffelen LMM, Moreau Y,<br />
Bioinformatics, 23(13):i125-i132, (2007).<br />
Tranchevent LC1, et. al. NAR, (35) W377-W384(2008) .<br />
ElShal S, et al. Davis J. Moreau Y. NAR, (<strong>2015</strong>).<br />
R. Salakhutdinov and A. Mnih. 25th ICML, 880–887. ACM, (2008).<br />
SIMM J, et al. arXiv:1509.04610 [stat.ML], (2106).<br />
113
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P70. THE IMPACT OF HMGA PROTEINS ON REPLICATION ORIGINS<br />
DISTRIBUTION<br />
A. Zouaoui 1 , M. Kahli 2 , E. Besnard 3 , R. Desprat 1 , N. Kirsten 4 , P. Ben-sadoun 1 & J.M. Lemaitre 1 .<br />
Institute for Regenerative Medicine and Biotherapy, France 1 ; Institut de Biologie de l’École Normale Supérieure (ENS),<br />
France 2 ; The Gladstone Institutes, University of California San Francisco (UCSF), United States 3 ; Helmholtz Zentrum<br />
München, Research Unit Gene Vectors, Munich, Germany 4 .<br />
Proliferative cells can have an irreversible stop in the cell<br />
cycle that is called cellular senescence which can induct<br />
the development of cancer and ageing. Senescence is<br />
characterized by the development of Dense<br />
Heterochromatic Foci (SAHF) and the decline of the DNA<br />
replication. High-Mobility Group A proteins promote<br />
SAHF formation, a proliferative stop and stabilize<br />
senescence when overexpressed.<br />
In a cell, DNA replication is regulated on several<br />
genomics sites called replication origin (« Oris »). Prereplication<br />
proteic complex is required for DNA<br />
replication to occur. In the pre-replication complex, the<br />
ORC1 protein is involved in recognition of the origin of<br />
replication. DNA autoradiography of eukaryote cells<br />
allowed to find that human replication origins are<br />
bidirectional and spaced at 20-400kb intervals (Huberman<br />
and Riggs, 1968). At each origin, replication forks are<br />
formed and new short nascent strand are synthetized. A<br />
popular method to map replication origins is the<br />
purification of Short Nascent Strand (SNS). Several<br />
laboratories have identified up to 50 000 origins using<br />
microarray and sequencing techniques. Our laboratory has<br />
developed an origin mapping method divided in four cell<br />
type: IMR90, H9, iPSC and HeLa (Besnard et al., 2012).<br />
The Short Nascent Strand was isolated, sequenced and<br />
analyzed. 250 000 origin peaks have been identified with a<br />
peak detection tool named SoleSearch (Blahnik KR, Dou<br />
L, O’Geen H, et al. 2010).<br />
The objective is to find the most sensitive method to<br />
analyze the origin distribution in proliferative and<br />
senescent cells to observe if senescence has an impact on<br />
the origin distribution. The implication of HMGA proteins<br />
on the DNA replication is investigated. Two new methods<br />
are in development to analyze the replication origin with<br />
two more sensitive tools. In the first method, we search<br />
origin peaks with Macs2 tool (Zhang et al., 2008) which<br />
uses a new statistic and algorithm model. In a second time,<br />
origin enrichment is observed with Homer tool (Heinz S et<br />
al., 2010).<br />
Two methods are currently in development to identify the<br />
replication origin site by Illumina GaII sequencing of short<br />
nascent strand. Human SNS-seq reads of 36bp were<br />
mapped to human genome build GRCH38 with BWA tool<br />
(ref). Origin peaks were called by MACS2 and origin<br />
enrichment by Homer. To compare the two methods,<br />
active origins in HeLa cells were detected with each<br />
method. Correlation between ORC1 peaks and origins<br />
identified is calculated to choose the most sensitive<br />
method. The impact of pre-senecence is observed in<br />
comparing origins distribution observed in proliferative<br />
and senescent cells. Origins distribution is compared<br />
before and after induction of HMGA proteins to<br />
investigate the implication of these proteins on the DNA<br />
replication during senescence.<br />
REFERENCES<br />
Besnard et al. Best practices for mapping replication origins in<br />
eukaryotic chromosomes. Current Protoc Cell Biol. 2014 Sep 2;<br />
64:22.18.1-22.18.13<br />
Besnard et al. Unraveling cell type-specific and reprogrammable human<br />
replication origin signatures associated with G-quadruplex consensus<br />
motifs. Nat Struct Mol Biol. 2012 Aug; 19, 837-44<br />
Blahnik KR, Dou L, O’Geen H, et al. Sole-Search: an integrated analysis<br />
program for peak detection and functional annotation using ChIP-seq<br />
data. Nucleic Acids Res. 2010; 38:e13<br />
Fu H et al. Mapping replication origin sequences in eukaryotic<br />
chromosomes. Curr Protoc Cell Biol. 2014 Dec 1; 65:22.20.1-<br />
22.20.17<br />
Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of<br />
Lineage-Determining Transcription Factors Prime cis-Regulatory<br />
Elements Required for Macrophage and B Cell Identities. Mol Cell<br />
2010 May 28; 38, 576-589<br />
Hubberman JA et al. On the mechanism of DNA replication in<br />
mammalian chromosomes. J Mol Biol 1968 Mar 14; 32, 327-41<br />
Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol<br />
(2008) 9 pp. R13<br />
114
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
<strong>bbc</strong> <strong>2015</strong><br />
December 7 - 8, <strong>2015</strong> Antwerp, Belgium<br />
www.<strong>bbc</strong><strong>2015</strong>.be<br />
115