12.07.2015 Views

Exploring the 3D Protein Landscape - Biomedical Computation ...

Exploring the 3D Protein Landscape - Biomedical Computation ...

Exploring the 3D Protein Landscape - Biomedical Computation ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

StructuralBy Denise ChenGENOMICS<strong>Exploring</strong> <strong>the</strong> <strong>3D</strong><strong>Protein</strong> <strong>Landscape</strong>When <strong>the</strong> human genomewas completelysequenced in 2003,researchers were already ponderinghow biomedicine could make useof it. One hope was that <strong>the</strong>sequences would lead to a greaterunderstanding of how genes and<strong>the</strong>ir encoded proteins function.From <strong>the</strong>re, researchers envisionedthat <strong>the</strong>y would be steps closerto a better understanding ofdisease and <strong>the</strong> developmentof appropriate treatments. >Published by Simbios, <strong>the</strong> NIH National Center for Physics-Based Simulation of Biological Structures11


Dimensions of <strong>the</strong> <strong>Protein</strong> Universe. <strong>Protein</strong> structures are displayed here along axes signifyingsecondary protein structure elements: strictly α helices or β sheets, both α and β, orcombinations of α and β. The more complex and highly structured proteins reside at <strong>the</strong>extreme ends of <strong>the</strong> axes. In 1992, researchers estimated <strong>the</strong> number of protein families ataround 1000, but <strong>the</strong> size of <strong>the</strong> protein universe has turned out to be much larger than predicted--asexemplified by <strong>the</strong> 23rd release of <strong>the</strong> Pfam database listing over 10,000 proteinfamilies. Source: NIGMS image gallery: http://images.nigms.nih.gov/index.cfm?event=viewDetail&imageID=2367. Courtesy of Berkeley Structural Genomics Center, PSI.structural genomics effort in <strong>the</strong> UnitedStates. Known as <strong>the</strong> <strong>Protein</strong> StructureInitiative (PSI), <strong>the</strong> program establishedfour research centers and several specializedcenters. The plan: to determinestructures faster and cheaper; improvecomputational methods for predictingprotein models; and ultimately developdays,” he says. At <strong>the</strong> same time, proteinstructure prediction helps fill in <strong>the</strong> gapsbetween known and unknown structures,bringing us closer to knowing <strong>the</strong>“structure of everything.” This increasedcoverage of <strong>the</strong> structure space is transforming<strong>the</strong> field of biology, making itpossible to assemble all of <strong>the</strong> structures“The original question for which structural genomics cameinto being was: ‘Can we translate <strong>the</strong> sequence of everythinginto <strong>the</strong> structure of everything?’” Preusch says.But a fuller understanding of proteins’functions within <strong>the</strong>human body depends on determiningthose proteins’ structures. Andas <strong>the</strong> number of known gene sequencesgrew, many scientists realized <strong>the</strong>ycould not catch up simply by determiningprotein structures one by one. So agroup of scientists embarked on a strategicplan to uncover <strong>the</strong> three-dimensionalstructures of all <strong>the</strong> proteins that<strong>the</strong>se genes encode.This endeavor is called structuralgenomics. “The original question forwhich structural genomics came intobeing was: ‘Can we translate <strong>the</strong>sequence of everything into <strong>the</strong> structureof everything?’” says PeterPreusch, PhD, acting director of <strong>the</strong><strong>Protein</strong> Structure Initiative at <strong>the</strong>National Institute for General MedicalSciences (NIGMS).The primary motivator of structuralgenomics is <strong>the</strong> sheer speed with whichgenomic sequence data is accumulating.Structural determination in traditionalstructural biology laboratoriescan’t possibly keep up, researchers say.Fortunately, unlike sequences, which arenearly infinite in number, “<strong>the</strong>re may bea finite number of different shapes thatproteins actually adopt to perform <strong>the</strong>irfunctions in <strong>the</strong> cell,” says Ian Wilson,DPhil, professor of structural biology atThe Scripps Research Institute anddirector of <strong>the</strong> Joint Center forStructural Genomics (JCSG).In fact, a 1992 Nature paper estimatedthat <strong>the</strong> majority of proteins belong tono more than 1,000 families. Thus,researchers reasoned that it might bepossible to unveil <strong>the</strong> universe of proteinstructures through a combination ofexperimental structure determinationand computational structure prediction.And although upwards of 10,000 proteinfamilies have now been identified,uncovering <strong>the</strong> protein structure universeremains feasible.In pursuit of this goal, ten years ago,NIGMS made a major investment tofund and spearhead a coordinated publicinnovative strategies for delivering usefulstructural information to <strong>the</strong> greaterbiological community.In each of <strong>the</strong>se areas, <strong>the</strong> PSI hasmade great strides. Before <strong>the</strong> PSIlaunched, determining <strong>the</strong> structure of arelatively complex protein was a majortask, requiring <strong>the</strong> efforts of a graduatestudent for several months or even years,says Keith Hodgson, PhD, professor ofchemistry at Stanford and head of <strong>the</strong>JCSG structure determination unit.Today, at each of <strong>the</strong> four main PSI centers,“a structure is turned out every fewin a particular pathway and visualize <strong>the</strong>interplay between <strong>the</strong>m; or screen multiplestructures to determine what <strong>the</strong>ywill bind; or carefully study <strong>the</strong> structuresof proteins involved in disease.12 BIOMEDICAL COMPUTATION REVIEW Winter 2009/10PROGRESSING THROUGH THEPIPELINE: FROM SEQUENCETO STRUCTUREStructure determination consists ofmultiple steps including cloning, expressing,and purifying a protein, findingappropriate conditions for crystallizing<strong>the</strong> protein, performing structural analywww.biomedicalcomputationreview.org


“knowledge-based” approachwhich ga<strong>the</strong>rs hints fromknown structures used as templatesor a “physics-based”approach which starts fromscratch using first principles toexplore <strong>the</strong> possibilities of proteinfolds. The knowledge-basedapproach, also called homologymodeling, is essentially “designingnew buildings as better oldbuildings,” says Michael Levitt,PhD, professor and chair ofcomputational structural biologyat Stanford. “The idea is thatit has worked, so you can reuseit in a different combination.”High-throughput structuredetermination efforts haveincreased <strong>the</strong> number of knownprotein folds in sequence alignmentdatabases, making it morelikely that a protein withunknown structure will producematches with sequences ofknown proteins that can serveas a template to <strong>the</strong>n predicthigher quality structures. Thus,structural genomics efforts contributeto homology modeling.“You’re running on <strong>the</strong> samecomputers, same codes, but <strong>the</strong>database on which it runs ismuch larger now,” says NirKalisman, PhD, a postdoctoralresearcher of structural biology andcomputer science at Stanford.In turn, structural genomics has benefitedfrom structure prediction efforts,which leverage known structural informationto fill in gaps in <strong>the</strong> structurespace. “The main idea is that we reallycan get large scale coverage of all <strong>the</strong>structure space by sampling strategically,getting experimental structures of particularrepresentatives, and <strong>the</strong>n modelingaround that using homology modelingtechniques,” says John Moult,DPhil, professor at <strong>the</strong> University ofMaryland Biotechnology Institute.Currently, scientists are able to learnfrom structures generated from <strong>the</strong> bestof both <strong>the</strong> structure prediction and <strong>the</strong>structure determination worlds. “Forany structure that’s determined using X-ray crystallography or NMR, <strong>the</strong> model“To make a real impact, you’ve got to pick<strong>the</strong> right targets and <strong>the</strong>n use modelingto expand <strong>the</strong> structural information tomany more sequences,” Norvell says.that you get is very highly reliable, <strong>the</strong>gold standard,” says Helen Berman,PhD, professor of chemistry at RutgersUniversity and director of <strong>the</strong> PDB.Homology modeling, on <strong>the</strong> o<strong>the</strong>rhand, might be less certain, but stillprovides useful information, she says.Moult agrees: “A rough structuralIlluminating <strong>Protein</strong> Function via Structure. The Pfamdatabase currently contains 2,247 families of “hypo<strong>the</strong>ticalproteins”—proteins with unknown functionsor that are uncharacterized. In a 2009 PLOS Biologypaper, researchers looked at 248 of <strong>the</strong>se familiesthat were solved by <strong>the</strong> PSI to better understandregions of <strong>the</strong> yet unexplored protein universe that<strong>the</strong>se families represent. The top pie chart breaksdown <strong>the</strong> hypo<strong>the</strong>tical proteins into subgroupsbased on <strong>the</strong>ir structural similarity and homology toknown structures, ranging from proteins composedof new folds (red slice) to proteins with recognizablehomology to known structures (dark blue slice).Within each of <strong>the</strong> five slices are mini pie chartsshowing <strong>the</strong> percentage of structures within eachcategory for which hypo<strong>the</strong>ses about <strong>the</strong>ir functionsexist (white). What emerges is a relationshipbetween structural similarity and homology andhypo<strong>the</strong>ses about function: <strong>the</strong> greater <strong>the</strong> degreeof structural similarity and homology to knownstructures, <strong>the</strong> more likely a functional hypo<strong>the</strong>siscan be formed for that protein family. The lower piechart fur<strong>the</strong>r demonstrates that known structuralinformation can facilitate inferences of function.From Jaroszewski L, Li Z, Krishna SS, Bakolitsa C,Wooley J, et al. 2009 Exploration of UnchartedRegions of <strong>the</strong> <strong>Protein</strong> Universe. PLoS Biology 7(9):e1000205. doi:10.1371/journal.pbio.1000205.Published by Simbios, <strong>the</strong> NIH National Center for Physics-Based Simulation of Biological Structures15


o<strong>the</strong>rs were somewhere in between.But that effort proved worthwhile,Godzik says, because it led to a numberof insights, perhaps most significantly,into <strong>the</strong> evolution of protein structuresand organisms. The model <strong>the</strong>y hadconstructed demonstrated that a smallnumber of folds are represented in amajority of <strong>the</strong> proteins involved in <strong>the</strong>metabolic reactions of T. maritima. Infact, of <strong>the</strong> 478 proteins, including atotal of 714 domains, <strong>the</strong>re were only182 distinct folds. And proteinsinvolved in similar biochemical reactionshave a higher probability of adoptingsimilar folds. All of this supports <strong>the</strong>idea of structural conservation in nature,and to a much larger degree thanresearchers expected.annotation and different technologiesthat allow you to get <strong>the</strong> structures,”Berman says. “You have everythingwhere you can find it in order to beginmaking new hypo<strong>the</strong>ses and gainingnew understanding.”The launching of SGKB signifies animportant shift in <strong>the</strong> evolution of <strong>the</strong>PSI, says Emily Carlson of <strong>the</strong> NIGMSOffice of Communications and PublicLiaison. “It’s gone from being a group ofgrants to being an actual research networkwhere <strong>the</strong> researchers are sharinginformation and <strong>the</strong>y’re collaboratingin ways that hadn’t been done before.Not just within <strong>the</strong> PSI, but within <strong>the</strong>field and community in general.”By encouraging public access tosolved protein structures and providing“Classical structural biology focuses on individual proteins,so it’s sort of looking at each tree separately,” Godzik says.“Through changes in scale, what this becomes islooking at a forest—you suddenly see all <strong>the</strong> structurestoge<strong>the</strong>r and you start analyzing and comparinglarge groups of structures.”things to a new level altoge<strong>the</strong>r.“Classical structural biology focuseson individual proteins, so it’s sort oflooking at each tree separately,” Godziksays. “Through changes in scale, whatthis becomes is looking at a forest—yousuddenly see all <strong>the</strong> structures toge<strong>the</strong>rand you start analyzing and comparinglarge groups of structures.”In recent work published in <strong>the</strong>September 18, 2009, issue of Science,Godzik and colleagues took <strong>the</strong> first leapin this direction. They constructed acomprehensive model of <strong>the</strong> metabolicnetwork of <strong>the</strong>rmophilic bacterium T.maritima that includes all <strong>the</strong> threedimensionalprotein structures. For <strong>the</strong>first time, Godzik says, “we have a hugebiological network which can be simulatedand viewed as a mini cell in silico.”To build <strong>the</strong> model, Godzik’s teamhad to first identify all <strong>the</strong> proteins in<strong>the</strong> metabolic network by extractingrelevant information from more than150 publications, and <strong>the</strong>n subjectingthat list to in silico analyses to identifygaps or redundancies that had to beresolved manually. Of <strong>the</strong> complete setof 478 proteins in <strong>the</strong> T. maritima metabolicnetwork, 120 structures had beenexperimentally determined in part by<strong>the</strong> PSI JCSG. Using homology modeling,among o<strong>the</strong>r computational techniques,<strong>the</strong> researchers predicted <strong>the</strong>structures of <strong>the</strong> remaining 358 proteins.Standard methods produced <strong>the</strong>structures of 95 percent of <strong>the</strong>se proteins,but <strong>the</strong> last few percent took a lotof effort, Godzik says. “Getting to 100percent coverage was a huge challenge.”And <strong>the</strong> quality of <strong>the</strong> predictedstructures varied. For example,about 190 were comparable to low-resolution,experimental structures, whileabout 52 were merely approximate andWith this project, researchers alsochallenged <strong>the</strong> conventional thinkingthat accompanies structure determination.“When we first submitted ourpaper, <strong>the</strong> first question that came from<strong>the</strong> editor was ‘If this is a structuralbiology paper, what is <strong>the</strong> main structureyou’re talking about?’ And we said,‘Well, <strong>the</strong>re’s no main structure; <strong>the</strong>reare 478 main structures,’” Godzik says.“Both technological and conceptualchanges are what structural genomicshas brought to <strong>the</strong> table.”THE NEXT CHAPTER OFSTRUCTURAL GENOMICS:STEPPING OUT INTO THE PUBLICIn 2008, <strong>the</strong> Structural GenomicsKnowledgebase (PSI SGKB) (http://kb.psi-structuralgenomics.org) was launchedto integrate all <strong>the</strong> results from <strong>the</strong> PSIand make <strong>the</strong>m available to <strong>the</strong> publicalong with an array of technology, protocols,and software. “The PDB has <strong>the</strong>structures. The SGKB has <strong>the</strong> structuresand <strong>the</strong> sequences and <strong>the</strong> functionalover 150 different resources at <strong>the</strong> PSISGKB, <strong>the</strong> structural genomics communityis showing its commitment totransforming structural data into meaningfulinformation of use to <strong>the</strong> greaterbiological community, says MichaelSykes, PhD, postdoctoral researcher atThe Scripps Research Institute. “It isnot sufficient to determine structure forstructure’s sake. The scientific communityneeds to use <strong>the</strong>se structures tomake inroads into understanding <strong>the</strong>fundamental principles of biology.”The coverage of “structure space” willcontinue to be an aim of structuralgenomics, but <strong>the</strong> next phase—calledPSI Biology instead of PSI 3—is shiftingdirections. The aim: To bring structureand function studies back toge<strong>the</strong>r againand to connect biologists with <strong>the</strong> PSIeffort, Preusch says. “The new thing ispartnerships. We want to bring in peoplewho have a biological problem of significantscope for which solving a largenumber of protein structures is necessaryto really move <strong>the</strong> problem forward.” ■18 BIOMEDICAL COMPUTATION REVIEW Winter 2009/10www.biomedicalcomputationreview.org

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!