13.07.2015 Views

Combining Information about Epidemic Threats from Multiple Sources

Combining Information about Epidemic Threats from Multiple Sources

Combining Information about Epidemic Threats from Multiple Sources

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Fig. 1: Medisys main page (restricted site).2.3 Multilingual multi-document trenddetectionThe alert definitions in MedISys are multilingual,so that the mention of a disease or symptom in thenews in any of the languages can be identified. TheMedISys software keeps a running count of all diseasealerts for any country of the world, i.e., it maintainsa count of all documents mentioning both a certaincountry and a given disease over a fixed timewindow—a period of two weeks. An alerting functiondetects a sudden increase in the number of reportsfor a given disease and country by comparing thestatistics for the last 24 hours with the two-week dailyrolling average. It uses the Poisson distribution, whichis a discrete probability distribution that expresses theprobability of a number of events occurring in a fixedperiod of time if these events occur with a known averagerate.Figure 2 shows how the intersection of countryalerts with disease alerts in combination with the trendanalysis can be used to alert users to a potential healththreat. This screenshot, <strong>from</strong> the public MedISyssite (http://medusa.jrc.it/), shows increasing reportingon dengue fever related to several South-East Asiancountries (highlighted on map).2.4 Users and usage of MedISysCustomers of MedISys can use the Web interface toview the latest trends and access articles <strong>about</strong> diseasesand countries. However, they can also opt to receiveinstant email reports, or daily summaries regardingpre-selected diseases or countries, for their ownchoice of languages. Specific registered users can alsobe granted access to the JRCs Rapid News Service—RNS, which additionally allows to filter news <strong>from</strong> selectedsources or countries, and which provides functionalityto quickly edit and publish newsletters andto distribute them via email or to mobile phones.MedISys displays the title and the first few words ofeach article, plus a link to the URL containing the fulltext.MedISys users include the European Commission,the World Health Organisation (WHO), the CanadianGlobal Public Health Intelligence Network (GPHIN),the European Centre for Disease Control (ECDC) andthe US CDC, the French Institut de Veille Sanitaire(INVS), the Spanish Instituto de Salud Carlos III, andother national authorities.3 Extracting Facts <strong>about</strong> <strong>Epidemic</strong>sMedISys has proved to be a useful and an effectivetool, with thousands of users accessing it daily. Inconsidering possible extensions that would add furthervalue, a natural choice falls on IE technology:• IE could deliver information concerning specificincidents of the diseases tracked by MedISys,whereas IR is able to return entire matched documents(along with an indication which alertsfired within the document).• IE could boost precision, since keyword-basedqueries may trigger on documents which are off-


turns the most recent 50 incidents, filtering out duplicates:if multiple incidents of the same disease inthe same location are reported, PULS returns only themost recent one. 9On the MedISys side, the returned events are displayedin two views. The main MedISys page displaysthe five most recently published events—thesecorrespond to the most urgent news. For more detail,this box has a link to the entire batch of 50 most recentincidents. For the full view, the recent list has a linkto the complete PULS database.4.2 PerformanceWe now discuss some of the on-going evaluations ofthe currently deployed systems.The number of documents PULS receives <strong>from</strong>MedISys is approximately 10,000 per month. From2,700 of these, PULS extracts approximately 6,000incidents per month, on average. (It is quite commonfor a relevant document to contain more than one incident.)The remaining 6,300 documents fed to PULSby MedISys produce no incidents. That is to be expected,since MedISys does not explicitly search foroutbreaks, but for any mentions of disease names, andmany documents may mention the crucial diseases inthe context of new vaccines or treatments, eradicationcampaigns, etc.To determine what proportion of these 6,300 documentsactually do contain events—and are thereforefalse negatives <strong>from</strong> the perspective of PULS—werandomly selected and checked 100 MedISys documentsthat produced no events. Of these, 15% containedan event that the IE system missed.From the perspective of MedISys, this roughly indicatesthat at least 64% of the documents fed to PULSon average contain no events. This confirms that theIE component indeed serves its purpose by helpingto distinguish reports <strong>about</strong> epidemic outbreaks <strong>from</strong>other discussions concerning diseases.About 20% of all extracted incidents are rated asconfident. We tried to estimate the accuracy of theconfidence heuristics. We selected 100 confident incidentsat random, and checked their correctness byhand. Without employing a rigorous (e.g., MUCstyle)evaluation, we consider an incident to be correctonly if all of its principal attributes are correct (no partialcredit). This evaluation yielded: 72% of the confidentincidents are correct; in 14% of the cases, the informationextraction is erroneous, i.e., PULS extracts9 Note that under this arrangement, a recent event that was lastreported more than 10 days ago, will not appear in the resultlist, while an event <strong>from</strong> several months ago may appear—if itis mentioned in a very recently published report. This is a designdecision that aims to balance the tension between recencyof publication vs. recency of actual occurrence of an incident:both may be important to the user. Note also that in any case allevents are always available in the PULS database for browsing.an incident where there should be none; in 14% of thecases, the confident incident is incorrect—for at leastone attribute, the top-ranked value is not correct. Thelatter category of error is difficult to correct, since it isusually due to an inherent complexity in the text. Theformer type of error is simpler to correct, as it usuallyentails some tuning of the knowledge bases. Thus, ifwe could correct the erroneous cases with some tuninglabor, we might expect the confidence measure to becorrect in just under 84% of the confident incidents.Since outbreak aggregation is our primary means ofreducing redundant information in the flow of news, itis important to have an estimate of the accuracy of theoutbreak calculation. We analyzed a randomly chosenset of medium-sized outbreaks, 20 outbreaks, <strong>about</strong>10 incidents each. For each incident we tried to determinewhether it was appropriately included in theoutbreak. We found that 68% of the incidents werecorrectly identified with their outbreaks. Three of theoutbreaks (<strong>about</strong> 15%) were erroneous, i.e., based onincorrect confident incidents. 1022.5% of the examined incidents were confident(i.e., on average, the outbreaks contained only 2–3confident incidents).5 Conclusion and Future WorkThe public and the restricted MedISys applicationsare currently independent of each other, and they providedifferent functionality. The medical event informationis only available on the public site. The twosystems will soon be integrated in order to allow asingle entry point and visual presentation. Registeredusers will receive access to more functionality andmore alert definitions. Depending on the users’ accessrights, they may get access to newswires and tocommercial sources as well.We further plan to integrate a tool that automaticallyextracts terms <strong>from</strong> the comprehensive medicalthesaurus MeSH (Medical Subject Headings), 11 andto allow users to select articles by browsing anddrilling down in the multilingual MeSH hierarchy.This will give the user an alternative entry point tothe same information.We need to resolve some technical problems to improvethe quality of the input data. One problem relatesto the way MedISys extracts textual content <strong>from</strong>10 It was interesting to observe that aggregation is often usefuleven when the outbreak consists entirely of incorrectly analyzedincidents. E.g., in high-profile cases picked up by mainnews agencies, reports are re-circulated through multiple sitesworldwide. Because the text is very similar to the original report,the IE system extracts similar incidents <strong>from</strong> all reports,and correctly groups them together. Although some attributeis always analyzed incorrectly, the error is consistent, and thegrouping is still useful: it helps reduce the load on the user byaggregating related facts.11 www.nlm.nih.gov/mesh


source sites. Because the original focus of MedISyswas on the keywords contained in the text, it ignoreddocument layout information (such as headings,sub-headings, by- and date-lines, paragraph breaks,etc.), which provides important cues when detailedtext analysis is required. The lack of this informationis known to confuse the IE process, and needs to beaddressed to improve IE accuracy. 12We are currently investigating methods for extendingthe measure of local confidence to global confidence,across multiple documents and sources.We also plan to develop methods for MedISys toexploit the information returned <strong>from</strong> PULS in novelways. One problem that needs to be studied is to whatextent the outbreaks extracted by MedISys based onkeyword frequencies agree with outbreaks extractedby PULS, and how they can be best integrated. Anotherpath under consideration is to incorporate thePULS confidence as a criterion for the urgency ofMedISys alerts. The current scheme, based on cumulativestatistics, assumes that if something is newlyprominent in many news sources, it is urgent or interestingnews. However, in some cases, news thatappears everywhere is “dated” news—it is alreadyhighly publicized. For timely surveillance, it is alsointeresting to detect outlier reports—those that havenot yet achieved wide publicity, but in this case it iscrucial that the system be certain that it was correctlyidentified. Here, a high score on the PULS confidencescale may serve as a complementary criterion for theurgency of an event.[4] B. Pouliquen, M. Kimler, R. Steinberger, C. Ignat,T. Oellinger, K. Blackler, F. Fuart, W. Zaghouani,A. Widiger, A. Forslund, and C. Best. Geocodingmultilingual texts: Recognition, disambiguationand visualisation. In Proceedings of LREC-2006,Genova, Italy, 2006.[5] R. Steinberger, B. Pouliquen, and C. Ignat. Navigatingmultilingual news collections using automaticallyextracted information. Journal CIT,13(4), 2005.[6] R. Yangarber. Counter-training in discovery ofsemantic patterns. In Proc. ACL-2003, Sapporo,Japan, 2003.[7] R. Yangarber, L. Jokipii, A. Rauramo, and S. Huttunen.Extracting information <strong>about</strong> outbreaksof infectious epidemics. In Proc. HLT-EMNLP2005, Vancouver, Canada, 2005.AcknowledgementsThis work was in part supported by the Academy ofFinland. We wish to thank the anonymous reviewersfor their thorough and helpful comments.References[1] C. Best, E. van der Goot, K. Blackler, T. Garcia,and D. Horby. Europe media monitor—systemdescription. Technical Report 22173 EN, EUR,2005.[2] R. Grishman, S. Huttunen, and R. Yangarber. <strong>Information</strong>extraction for enhanced access to diseaseoutbreak reports. J. of Biomed. Informatics,35(4), 2003.[3] W. Lin, R. Yangarber, and R. Grishman. Bootstrappedlearning of semantic classes <strong>from</strong> positiveand negative examples. In Proc. ICML Workshop,Washington, DC, 2003.12 Extracting document layout accurately is a highly non-trivialproblem, since source sites are completely unstandardized, andin general the layout is hard to infer automatically.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!