ZentralblattGuarding your Searches:Data Protection at zbMATHJens Holzkämper (FIZ Karlsruhe, Germany) and Olaf Teschke (FIZ Karlsruhe, Germany)This year, as in the quadrennial cycles before, the mathematicalcommunity eagerly awaits the announcementof the Fields Medallists during the opening ceremony ofthe ICM on 13 August. Though there are as usual manyspeculations and educated guesses, we can be pretty optimisticthat the names of the prize winners will onceagain remain secret up till then; indeed, discretion hasworked well so far and this has been a great part of thefun.Of course, perfect secrecy is hard to maintain and isprobably harder than ever nowadays due to technicaldevelopments. It is not necessary to kidnap and interrogatethe Chair of the Fields Committee; surveillanceof communication of the people involved would do (includingnot just possible committee members but, atleast at a later stage, press officers of mainstream media),since at least some of them would most likely notrefrain from using insecure channels. Alternatively, theanalysis of search queries as employed in trend miningcould evolve into a pretty clear picture of the ongoingdiscussions if restricted to a sufficiently adapteduser pool – which might be intricate for Google or Bing(where mathematics is just a tiny noise in the large datastream) but would surely be applicable for MathSciNetor zbMATH requests.Fortunately, this Orwellian scenario is unlikely, butonly since it appears that no one capable of doing sowould be interested in spoiling the party at the ICM.However, there is much more sensitive informationaround. Hiring decisions are often connected to an evaluationthrough scientific databases. On a general level,knowing in advance what mathematical research is goingon is definitely of interest both inside and outside ofmathematics. Frequently, allegations of plagiarism havebeen made, involving claims that new results might havebeen copied from a colleague and (pre-)published firstby someone else; on the other hand, knowing developmentsin many applications ahead of publication couldlead to significant advantages. Cryptography is a wellknownexample, which directly pertains to organisationsthat are capable of large-scale surveillance. However,mathematical results are part of so many applicationsthat it wouldn’t make sense to restrict to this (or extendto network algorithms, data mining, pattern recognition,etc.) as it is not certain whether more impact might comefrom theoretical foundations of number theory or fastermatrix multiplication. Mathematics is interesting as awhole, which is actually pursued on a rather transparentlevel by the NSA (which publicly spends a lot of moneyon grants, and even more on recruiting mathematicians).No doubt similar activities occur, if less prominently, inother services and regions 1 . With the information thatbecame public from the Snowden files, one would besurprised if there were no algorithms that keep track ofresearch activities and persons in areas like, for example,cryptography; and the uneasy feeling that these algorithmsmay raise alarm due to connections unknowneven to the researchers themselves, or just due to falsepositive signals, is certainly not a good environment forindependent research.While the solution of this dilemma obviously requiresefforts of the society as a whole, we can try toimprove things in our small area. More than 20 millionsearch queries every year in zbMATH are only a verytiny fraction of the world’s web traffic but possibly largeenough to derive sensitive information in our subject,especially when queries could be personalised (whichhappens, for example, when EMS member accounts areused). This confronts us with the task of taking measuresfor data protection – at least as far as can be donefrom the zbMATH supplier side. Concerning the dataconnection, an SSL certificate (on a non-Heartbleed 2 -affected server) has been set up over the last month toprotect zbMATH queries (soon to be upgraded furtherwith software that allows perfect forward secrecy 3 ), encryptingall data exchange between your browser andour servers.While this closes the most obvious vector of attack,the question of handling the information on our serversremains. The level of possible access by the secretservices to user data stored by providers has been a centraltopic in recent discussions. While connection dataare elusive, the only secure way to protect search logs atsupply servers is their permanent deletion. On the otherhand, erasure is in conflict with requests from librarians,and also possibly cripples functions of the interface.Hence there are decisions to be taken, which will be outlinedbelow.From the librarians’ side, there is an ongoing demandfor access data. At the moment, the most common standardis described by the COUNTER Code of Practice fore-Resources 4 . While the desire to evaluate the usage ofthe resources is highly understandable, the 90 pages (in-1While there has been a discussion about the impact of mathematicalfinance to the banking crisis, it seems that the ethicaldimension of mathematicians’ contribution to global surveillanceinfrastructure is yet unexplored.2The crypto-apocalypse of April 2014: http://heartbleed.com/,https://xkcd.com/1354/.3https://en.wikipedia.org/wiki/Forward_secrecy.4http://www.projectcounter.org/code_practice.html.54 EMS Newsletter June 2014
Zentralblattcluding appendices) of the recent 4th release indicatean increasing demand for the granularity of informationabout the usage of journals and databases – informationwhich can only be generated from access logs. Details invarious library requests from the past included, for example,differentiation between searches, clicks, long andshort views, and sessions, together with overall accessnumbers. The need for each single number makes senseon its own (since, for example, the total number of viewsor downloads for articles, reviews or profiles is not alwaysmeaningful) but there is a prospective danger thatalong with the legitimate wish to evaluate the relevanceof a resource, a framework evolves around quantitativemeasures that monitors user behaviour in too detailed away. What is lost is the awareness that such statistics simplycannot grasp fundamental aspects from the side ofmathematical content. This might be best illustrated withan example. At the climax of the El Naschie scandal, oneof the authors asked a librarian whether they had succeededin eliminating Chaos, Solitons and Fractals fromtheir Elsevier bundle. The surprising answer was: “Whyshould we? The access numbers skyrocketed over thelast few months – this journal is obviously of highest importancefor our mathematicians!” The conclusion is notjust the old insight that ideally mathematicians shouldhave the final decision on which resources they need butalso that they should not overly rely on possibly treacherousstatistics. What we would like to add is that thecreation of detailed statistics may evolve into a privacyproblem itself. This problem reaches even beyond librarylicensing – in the area of Open Access, the trendto justify relevance by download figures is even morepronounced; on the other hand, when such statistics areused for ranking purposes (“most popular article”, etc.),the threat of manipulation is immanent. Detection andlevelling of manipulation attempts would again requirean overhead of user surveillance which doesn’t seem desirable– hence, the preferable alternative seems to refrainfrom an overuse of quantitative data.This also concerns the second issue: availability offunctions based on usage data. Nowadays, we are accustomedfrom shopping platforms to seeing options like“most popular items” or “users interested in this alsoviewed…”. Wouldn’t it make sense to implement somethingsimilar in scientific databases? The barrier wouldbe, again, the willingness to exploit data from users at alarge scale. Though not personalised at the level of suchapplications, sensitive data may become available implicitly.As a very basic example, it is known (and plausible)that researchers often search for themselves to ensurethe correctness of their data. Therefore, publishing“popular searches” is not fully independent of informationon who uses the database to what extent, somethingwhich is certainly not of public interest. On a more sophisticatedlevel, let us consider the example of hiringmentioned above. Institutions that have special hiringseasons tend to have significantly larger access numbersto zbMATH in these months, which may indicate thatthe evaluation by profiles and reviews contributes muchto the database usage. A seemingly innocent functionlike “people searching for this person looked also for…”would be prone to inadvertently revealing competingapplications in the hiring process, which is certainly notdesirable. Again, even the attempt would require longtermstorage and data mining of usage data, with all theknown (and, most likely, many yet unknown) problemsinvolved.Our approach to the topic is rather straightforward.Traditionally, FIZ Karlsruhe (as provider of zbMATH)has very high standards for data protection 5 , which alsoderives from the history of supplying not only informationon scientific progress but also, for example, on patents,where usage data would reveal business-relevantdevelopment strategies. Beyond these standards (whichforbid, for example, monitoring individual actions) and,of course, the requirements of German data protectionlaw (which is generally considered to meet the highestlevels in international comparison), we would assess thevalue of user data protection much higher than the potentialbenefit of applications derived from them. Therefore,our decision is to delete user logs permanently afterthe finalisation of the general access statistics requiredby the libraries and to refrain from further analysis. Thedeletion currently takes places at the latest a year afteraccess.We hope that this policy finds the acceptance of thezbMATH user community.5Privacy policy available at http://www.fiz-karlsruhe.de/fiz_privacy_policy.html?&L=1EMS Newsletter June 2014 55