13.07.2015 Views

Use of Data Mining in the field of Library and Information Science ...

Use of Data Mining in the field of Library and Information Science ...

Use of Data Mining in the field of Library and Information Science ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

512<strong>Use</strong> <strong>of</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>the</strong> <strong>field</strong> <strong>of</strong> <strong>Library</strong> <strong>and</strong> <strong>Information</strong> <strong>Science</strong> :An OverviewRoopesh K Dwivedi R P BajpaiAbstract<strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> refers to <strong>the</strong> extraction or “<strong>M<strong>in</strong><strong>in</strong>g</strong>” knowledge from large amount <strong>of</strong> data or<strong>Data</strong> Warehouse. To do this extraction data m<strong>in</strong><strong>in</strong>g comb<strong>in</strong>es artificial <strong>in</strong>telligence, statisticalanalysis <strong>and</strong> database management systems to attempt to pull knowledge form storeddata. This paper gives an overview <strong>of</strong> this new emerg<strong>in</strong>g technology which provides a roadmap to <strong>the</strong> next generation <strong>of</strong> library. And at <strong>the</strong> end it is explored that how data m<strong>in</strong><strong>in</strong>g canbe effectively <strong>and</strong> efficiently used <strong>in</strong> <strong>the</strong> <strong>field</strong> <strong>of</strong> library <strong>and</strong> <strong>in</strong>formation science <strong>and</strong> its direct<strong>and</strong> <strong>in</strong>direct impact on library adm<strong>in</strong>istration <strong>and</strong> services.Keywords : <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>, <strong>Data</strong> Warehouse, OLAP, KDD, e-<strong>Library</strong>0. IntroductionAn area <strong>of</strong> research that has seen a recent surge <strong>in</strong> commercial development is data m<strong>in</strong><strong>in</strong>g, or knowledgediscovery <strong>in</strong> databases (KDD). Knowledge discovery has been def<strong>in</strong>ed as “<strong>the</strong> non-trivial extraction <strong>of</strong>implicit, previously unknown, <strong>and</strong> potentially useful <strong>in</strong>formation from data” [1]. To do this extraction datam<strong>in</strong><strong>in</strong>g comb<strong>in</strong>es many different technologies. In addition to artificial <strong>in</strong>telligence, statistics, <strong>and</strong> databasemanagement system, technologies <strong>in</strong>clude data warehous<strong>in</strong>g <strong>and</strong> on-l<strong>in</strong>e analytical process<strong>in</strong>g (OLAP),human computer <strong>in</strong>teraction <strong>and</strong> data visualization; mach<strong>in</strong>e learn<strong>in</strong>g (especially <strong>in</strong>ductive learn<strong>in</strong>gtechniques), knowledge representation, pattern recognition, <strong>and</strong> <strong>in</strong>telligent agents.One may dist<strong>in</strong>guish between data <strong>and</strong> knowledge by def<strong>in</strong><strong>in</strong>g data as correspond<strong>in</strong>g to real worldobservations, be<strong>in</strong>g dynamic <strong>and</strong> quite detailed, whereas knowledge is less precise, is more static <strong>and</strong>deals with generalizations or abstraction <strong>of</strong> <strong>the</strong> data [2]. A number <strong>of</strong> terms have been used <strong>in</strong> place <strong>of</strong>data m<strong>in</strong><strong>in</strong>g, <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>formation harvest<strong>in</strong>g, data archaeology, knowledge m<strong>in</strong><strong>in</strong>g, <strong>and</strong> knowledgeextraction. The knowledge is stored <strong>in</strong> data warehouse, which is <strong>the</strong> central store house <strong>of</strong> data that hasbeen extracted from operational data over a time <strong>in</strong> a separate database. The <strong>in</strong>formation <strong>in</strong> a datawarehouse is subject oriented, non-volatile <strong>and</strong> historic <strong>in</strong> nature, so <strong>the</strong>y conta<strong>in</strong> extremely large datasets[3].Libraries also have <strong>the</strong> big collection <strong>of</strong> <strong>in</strong>formation <strong>and</strong> <strong>in</strong> e-<strong>Library</strong> <strong>the</strong>re are organize collection <strong>of</strong><strong>in</strong>formation which serves a rich resource for its user communities. E-<strong>Library</strong> <strong>in</strong>cludes all <strong>the</strong> processes<strong>and</strong> services <strong>of</strong>fered by traditional libraries though <strong>the</strong>se processes will have to be revised to accommodatedifference between digital <strong>and</strong> paper media. Today’s e-Libraries are built around Internet <strong>and</strong> Webtechnologies with electronic books <strong>and</strong> journals as <strong>the</strong>ir basic build<strong>in</strong>g blocks. Here Internet serves asa carrier <strong>and</strong> provides <strong>the</strong> contents delivery mechanism <strong>and</strong> Web technology provides <strong>the</strong> tools <strong>and</strong>techniques for content publish<strong>in</strong>g, host<strong>in</strong>g <strong>and</strong> access<strong>in</strong>g. The availability <strong>of</strong> comput<strong>in</strong>g power that allowparallel process<strong>in</strong>g, multitask<strong>in</strong>g <strong>and</strong> parallel knowledge navigation with <strong>in</strong>creas<strong>in</strong>g popularity <strong>of</strong> Internet<strong>and</strong> development <strong>in</strong> Web technologies are <strong>the</strong> ma<strong>in</strong> catalyst to <strong>the</strong> concept <strong>of</strong> e-<strong>Library</strong>.<strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> is relatively new term <strong>in</strong> <strong>the</strong> world <strong>of</strong> library <strong>and</strong> <strong>in</strong>formation science though it is be<strong>in</strong>g usedby both commercial <strong>and</strong> scientific communities s<strong>in</strong>ce a long time. There are three ma<strong>in</strong> reasons for that.First both <strong>the</strong> number <strong>and</strong> size <strong>of</strong> databases <strong>in</strong> many organizations are grow<strong>in</strong>g at a stagger<strong>in</strong>g rate.Terabyte <strong>and</strong> even petabyte databases, once unth<strong>in</strong>kable, are now becom<strong>in</strong>g a reality <strong>in</strong> a variety <strong>of</strong>2 nd International CALIBER-2004, New Delhi, 11-13 February, 2004 © INFLIBNET Centre, Ahmedabad


Dwivedi <strong>and</strong> Bajpai513doma<strong>in</strong>s, <strong>in</strong>clud<strong>in</strong>g market<strong>in</strong>g, sales, f<strong>in</strong>ance, healthcare, earth science, molecular biology (e.g. <strong>the</strong>human genome project), <strong>and</strong> various government applications. Second organizations have realized that<strong>the</strong>re is valuable knowledge which is buried <strong>in</strong> <strong>the</strong> data which, if discovered, could provide thoseorganizations with competitive advantage. Third, some <strong>of</strong> <strong>the</strong> enabl<strong>in</strong>g technologies have only recentlybecome mature enough to make data m<strong>in</strong><strong>in</strong>g possible on large datasets.1. <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>: The Concept“<strong>Data</strong> m<strong>in</strong><strong>in</strong>g is <strong>the</strong> exploration <strong>and</strong> analysis, by automatic <strong>and</strong> semiautomatic means, <strong>of</strong> large quantities<strong>of</strong> data <strong>in</strong> order to discover mean<strong>in</strong>gful patterns <strong>and</strong> rules”[5]. It takes data <strong>and</strong> bus<strong>in</strong>ess opportunities<strong>and</strong> produce actionable results.1.1 The Knowledge Discovery <strong>Data</strong>base ( KDD) ProcessThe data m<strong>in</strong><strong>in</strong>g is actually a step <strong>in</strong> a larger KDD process. The KDD process employs data m<strong>in</strong><strong>in</strong>gmethods or algorithms to extract or identify knowledge accord<strong>in</strong>g to some criteria or measure <strong>of</strong><strong>in</strong>terest<strong>in</strong>gness, but it also <strong>in</strong>cludes steps that prepare <strong>the</strong> data, such as preprocess<strong>in</strong>g, sub-sampl<strong>in</strong>g,<strong>and</strong> transformations <strong>of</strong> <strong>the</strong> database [6].Interpretation/EvaluationKnowledge<strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>TransformationPreprocess<strong>in</strong>gPatternsTransformeddataSelectionPreprocesseddataTarget<strong>Data</strong><strong>Data</strong>Figure1. The KDD Process


514 <strong>Use</strong> <strong>of</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>The first step <strong>in</strong> <strong>the</strong> KDD process is to select data to be analyzed from <strong>the</strong> set <strong>of</strong> all available data. In manycases, <strong>the</strong> data is stored <strong>in</strong> transaction databases. These databases are quite large <strong>and</strong> extremelydynamic. Therefore a subset <strong>of</strong> <strong>the</strong> data must be selected from those databases, s<strong>in</strong>ce it is unnecessary<strong>in</strong> <strong>the</strong> early stages to attempt to analyze all data.Target data is <strong>the</strong>n moved to a cache or ano<strong>the</strong>r database for fur<strong>the</strong>r preprocess<strong>in</strong>g. Preprocess<strong>in</strong>g isextremely important step <strong>in</strong> KDD process. Often, data have errors <strong>in</strong>troduced dur<strong>in</strong>g <strong>the</strong> <strong>in</strong>put process,ei<strong>the</strong>r from a data entry clerk enter<strong>in</strong>g data <strong>in</strong>correctly or from a faulty data collection device. If target dataare be<strong>in</strong>g extracted from several source databases, <strong>the</strong> databases can <strong>of</strong>ten be <strong>in</strong>consistent with eacho<strong>the</strong>r <strong>in</strong> terms <strong>of</strong> <strong>the</strong>ir data models, <strong>the</strong> semantics <strong>of</strong> <strong>the</strong> attributes, or <strong>in</strong> <strong>the</strong> way <strong>the</strong> data is represented<strong>in</strong> <strong>the</strong> database. If <strong>the</strong> two databases were built at different times <strong>and</strong> follow<strong>in</strong>g different guidel<strong>in</strong>es, it isentirely possible that <strong>the</strong>y may be two different data models ( relational <strong>and</strong> object-oriented ) <strong>and</strong> twodifferent representations <strong>of</strong> <strong>the</strong> entities or objects <strong>and</strong> <strong>the</strong>re relationships to each o<strong>the</strong>r. The preprocess<strong>in</strong>gstep should identify <strong>the</strong>se differences <strong>and</strong> make <strong>the</strong> data consistent <strong>and</strong> clean.The data can <strong>of</strong>ten be transformed for use with different analysis techniques. A number <strong>of</strong> separatetables can jo<strong>in</strong>ed <strong>in</strong>to one table, <strong>and</strong> vice versa. An attribute that may be represented <strong>in</strong> two differentforms (date written as 3/15/97 versus 15-3-1997) should be transformed <strong>in</strong>to common format. If <strong>the</strong> datais represented as text, but it is <strong>in</strong>tended to use a data m<strong>in</strong><strong>in</strong>g technique that requires <strong>the</strong> data to be <strong>in</strong>numerical form, <strong>the</strong> data must be transformed accord<strong>in</strong>gly.At this po<strong>in</strong>t, data m<strong>in</strong><strong>in</strong>g algorithms can be used to discover knowledge, e.g., trends, patterns,characteristics, or anomalies. The appropriate discovery or data m<strong>in</strong><strong>in</strong>g algorithms should be identified,as <strong>the</strong>y should be pert<strong>in</strong>ent to <strong>the</strong> purpose <strong>of</strong> <strong>the</strong> analysis <strong>and</strong> to <strong>the</strong> type <strong>of</strong> data to be analyzed. Often, <strong>the</strong>data m<strong>in</strong><strong>in</strong>g algorithms work more effectively if <strong>the</strong>y have some amount <strong>of</strong> doma<strong>in</strong> <strong>in</strong>formation availableconta<strong>in</strong><strong>in</strong>g <strong>in</strong>formation on attributes that have higher priority than o<strong>the</strong>rs, attributes that are not importantat all, or established relationships that are already known. Doma<strong>in</strong> <strong>in</strong>formation is <strong>of</strong>ten collected <strong>in</strong>knowledge base, a storage mechanism similar to a database but used to store doma<strong>in</strong> <strong>in</strong>formation <strong>and</strong>o<strong>the</strong>r knowledge.When a pattern is identified, it should be exam<strong>in</strong>ed to determ<strong>in</strong>e whe<strong>the</strong>r it is new, relevant <strong>and</strong> “correct”by some st<strong>and</strong>ard <strong>of</strong> measure. The <strong>in</strong>terpretation <strong>and</strong> evaluation step may <strong>in</strong>volve more <strong>in</strong>teraction witha user or with some agent <strong>of</strong> <strong>the</strong> user who can make relevancy determ<strong>in</strong>ations. When <strong>the</strong> pattern isdeemed relevant <strong>and</strong> useful, it can be deemed knowledge. The knowledge should be placed <strong>in</strong> <strong>the</strong>knowledge base for use <strong>in</strong> subsequent iterations. Note that <strong>the</strong> entire KDD process is iterative; at many<strong>of</strong> <strong>the</strong> steps, <strong>the</strong>re may be need to go back to a previous step, s<strong>in</strong>ce no patterns may be discovered, newdata should be selected for additional analysis, or <strong>the</strong> patterns that are discovered may not be relevant.In many step <strong>of</strong> KDD process, it is essential to provide good visualization support to <strong>the</strong> user. This isimportant for two reasons. First , without such visualizations, it may be difficult for users to determ<strong>in</strong>e <strong>the</strong>usefulness <strong>of</strong> discovered knowledge-<strong>of</strong>ten a picture is worth a thous<strong>and</strong> words. Second, given goodvisualization tools, <strong>the</strong> user can discover th<strong>in</strong>gs that automated data m<strong>in</strong><strong>in</strong>g tools may be unable todiscover. Work<strong>in</strong>g as a team, <strong>the</strong> user <strong>and</strong> automated discovery tools provide far more powerful datam<strong>in</strong><strong>in</strong>g capabilities than ei<strong>the</strong>r can provide alone.1.2 The Virtuous Cycle <strong>of</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>The promise <strong>of</strong> data m<strong>in</strong><strong>in</strong>g is to f<strong>in</strong>d <strong>the</strong> <strong>in</strong>terest<strong>in</strong>g patterns <strong>in</strong> <strong>the</strong> large amount <strong>of</strong> database. But merelyf<strong>in</strong>d<strong>in</strong>g <strong>the</strong> patterns is not enough. You must be able to respond to <strong>the</strong> patterns, to act on <strong>the</strong>m, ultimatelyturn<strong>in</strong>g <strong>the</strong> data <strong>in</strong>to <strong>in</strong>formation, <strong>the</strong> <strong>in</strong>formation <strong>in</strong>to action, <strong>and</strong> <strong>the</strong> action <strong>in</strong>to <strong>the</strong> value. This is <strong>the</strong>virtuous cycle <strong>of</strong> data m<strong>in</strong><strong>in</strong>g <strong>in</strong> a nutshell.


Dwivedi <strong>and</strong> Bajpai515There are four stages <strong>of</strong> <strong>the</strong> virtuous cycle <strong>of</strong> data m<strong>in</strong><strong>in</strong>g.i. Identify <strong>the</strong> bus<strong>in</strong>ess problemii. <strong>Use</strong> data m<strong>in</strong><strong>in</strong>g technique to transform <strong>the</strong> data <strong>in</strong>to actionable <strong>in</strong>formationiii. Act on <strong>the</strong> <strong>in</strong>formationiv. Measures <strong>the</strong> resultThese steps are highly <strong>in</strong>terdependent; <strong>the</strong> results <strong>of</strong> one stage are <strong>the</strong> <strong>in</strong>puts <strong>in</strong>to <strong>the</strong> next phase, muchlike <strong>the</strong> steps <strong>in</strong> multi-step manufactur<strong>in</strong>g process. The whole approach is driven by results. Each stagedepends on <strong>the</strong> results from <strong>the</strong> previous stage.2. <strong>Use</strong> <strong>of</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong><strong>Data</strong> m<strong>in</strong><strong>in</strong>g derives its name from <strong>the</strong> similarities between search<strong>in</strong>g for valuable <strong>in</strong>formation <strong>in</strong> a largedatabase. Granter group advanced technology research note listed data m<strong>in</strong><strong>in</strong>g <strong>and</strong> artificial <strong>in</strong>telligenceat <strong>the</strong> top <strong>of</strong> <strong>the</strong> five key technology areas that “will clearly have a major impact across a wide range <strong>of</strong><strong>in</strong>dustries with <strong>in</strong> <strong>the</strong> next three to five years.” Granter also listed parallel architectures <strong>and</strong> data m<strong>in</strong><strong>in</strong>gas two <strong>of</strong> <strong>the</strong> top ten new technologies <strong>in</strong> which companies will <strong>in</strong>vest dur<strong>in</strong>g <strong>the</strong> next five years [3].In fact practical data m<strong>in</strong><strong>in</strong>g can accomplish a limited set <strong>of</strong> tasks <strong>and</strong> only limited circumstances. Thema<strong>in</strong> important th<strong>in</strong>g is that it can be used <strong>in</strong> many problems <strong>of</strong> <strong>in</strong>tellectual, economic, <strong>and</strong> bus<strong>in</strong>ess<strong>in</strong>terest. These problems can be phrased <strong>in</strong> terms <strong>of</strong> <strong>the</strong> follow<strong>in</strong>g six tasks.i. Classification- classification consists <strong>of</strong> exam<strong>in</strong><strong>in</strong>g <strong>the</strong> features <strong>of</strong> newly presented object <strong>and</strong>assign<strong>in</strong>g it to one <strong>of</strong> <strong>the</strong> predef<strong>in</strong>ed set <strong>of</strong> classes.ii. Estimation- classification deals with discrete outcomes while estimation deals with cont<strong>in</strong>uouslyvalued outcomes. Given some <strong>in</strong>put data, we use estimation to come up with a value for someunknown cont<strong>in</strong>uous variable such as <strong>in</strong>come, height, or credit card balance.iii. Prediction- prediction is <strong>the</strong> same as classification or estimation except that <strong>the</strong> records are classifiedaccord<strong>in</strong>g to some predicted future behavior or estimated future value. In prediction task, <strong>the</strong> only wayto check <strong>the</strong> accuracy <strong>of</strong> <strong>the</strong> classification is to wait <strong>and</strong> see.iv. Aff<strong>in</strong>ity Group<strong>in</strong>g- <strong>the</strong> task <strong>of</strong> aff<strong>in</strong>ity group<strong>in</strong>g is to determ<strong>in</strong>e which th<strong>in</strong>gs go toge<strong>the</strong>r. Aff<strong>in</strong>ity group<strong>in</strong>gcan also be used to identify cross-sell<strong>in</strong>g opportunities <strong>and</strong> design attractive packages or group<strong>in</strong>g <strong>of</strong>product <strong>and</strong> services. Aff<strong>in</strong>ity group<strong>in</strong>g is one simple approach to generat<strong>in</strong>g rules from data. If twoitems, say cat food <strong>and</strong> kitty litter, occur toge<strong>the</strong>r frequently enough, we can generate two associationrules:• People who by cat food also by kitty litter with probability P1.• People who by kitty litter also by cat food with probability P2.v. Cluster<strong>in</strong>g- cluster<strong>in</strong>g is <strong>the</strong> task <strong>of</strong> segment<strong>in</strong>g a heterogeneous population <strong>in</strong>to a number <strong>of</strong> morehomogeneous subgroups or clusters. Cluster<strong>in</strong>g differs from classification <strong>in</strong> <strong>the</strong> way that cluster<strong>in</strong>gdoes not rely on predef<strong>in</strong>e classes. In cluster<strong>in</strong>g, <strong>the</strong>re are no predef<strong>in</strong>ed classes <strong>and</strong> no examples.The records are group toge<strong>the</strong>r on <strong>the</strong> basis <strong>of</strong> self-similarity.vi. Description- some time <strong>the</strong> purpose <strong>of</strong> data m<strong>in</strong><strong>in</strong>g is simply to describe what is go<strong>in</strong>g on <strong>in</strong>complicated database <strong>in</strong> a way that <strong>in</strong>creases our underst<strong>and</strong><strong>in</strong>g <strong>of</strong> <strong>the</strong> people, products, or processesthat produced <strong>the</strong> data <strong>in</strong> <strong>the</strong> first place. A good enough description <strong>of</strong> a behavior will <strong>of</strong>ten suggest anexplanation for it as well.


516Dwivedi <strong>and</strong> BajpaiNo s<strong>in</strong>gle data m<strong>in</strong><strong>in</strong>g tool <strong>and</strong> technique is equally applicable to all <strong>the</strong> tasks. In commercial application,data m<strong>in</strong><strong>in</strong>g is usually employed on very large databases. The reasons for this are two fold.• In small databases, it is possible to f<strong>in</strong>d <strong>in</strong>terest<strong>in</strong>g patterns <strong>and</strong> relationships by simple <strong>in</strong>spection<strong>of</strong> results from familiar tool such as spreadsheets <strong>and</strong> multidimensional query tools.• Most data m<strong>in</strong><strong>in</strong>g technique require large amount <strong>of</strong> tra<strong>in</strong><strong>in</strong>g data conta<strong>in</strong><strong>in</strong>g many examples <strong>in</strong> orderto generate classification rules, association rules, clusters, or predictions. Small datasets lead tounreliable conclusion based on chance patters.3. <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> Methodology<strong>Data</strong> m<strong>in</strong><strong>in</strong>g process can be divided <strong>in</strong>to four stages:i. Identify <strong>the</strong> problemii. Analyz<strong>in</strong>g <strong>the</strong> dataiii. Tak<strong>in</strong>g actioniv. Measur<strong>in</strong>g <strong>the</strong> outcomeThe first <strong>and</strong> third stages raise ma<strong>in</strong>ly bus<strong>in</strong>ess issues. For data m<strong>in</strong><strong>in</strong>g to be successful, <strong>the</strong>se bus<strong>in</strong>essissues must, <strong>of</strong> course properly addressed.There are two basic style <strong>of</strong> data m<strong>in</strong><strong>in</strong>g• Hypo<strong>the</strong>sis test<strong>in</strong>g- is a top down approach that attempts to substantiate or disprove preconceivedideas.• Knowledge Discovery- is a bottom approach that starts with <strong>the</strong> data <strong>and</strong> tries to get it to tell ussometh<strong>in</strong>g we didn’t know.3.1 Hypo<strong>the</strong>sis Test<strong>in</strong>gHypo<strong>the</strong>sis test<strong>in</strong>g is what scientists <strong>and</strong> statisticians spend <strong>the</strong>ir lives do<strong>in</strong>g. Test<strong>in</strong>g <strong>the</strong> validity <strong>of</strong> ahypo<strong>the</strong>sis is done by analyz<strong>in</strong>g data that may simply be collected by observation or generated throughan experiment, such as test mail<strong>in</strong>g. The hypo<strong>the</strong>sis test<strong>in</strong>g method has several steps:i. Generate good ideas (hypo<strong>the</strong>sis)ii. Determ<strong>in</strong>e what data would allow <strong>the</strong>se hypo<strong>the</strong>ses to be tested.iii. Locate <strong>the</strong> dataiv. Prepare <strong>the</strong> data for analysisv. Build computer model based on <strong>the</strong> datavi. Evaluates computers models to confirm or reject hypo<strong>the</strong>sis3.2 Knowledge DiscoveryKnowledge discovery can be ei<strong>the</strong>r directed or undirected. Directed Knowledge Discovery is goal oriented.There is a specific <strong>field</strong> whose value we want to predict, a fix set <strong>of</strong> classes to be assigned each record,or a specific relationship we want to explore. The directed knowledge discovery method has severalsteps :


<strong>Use</strong> <strong>of</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>517i. Identify sources <strong>of</strong> pre-classified dataii. Prepare data for analysisiii. Build <strong>and</strong> tra<strong>in</strong> <strong>the</strong> computer modeliv. Evaluate <strong>the</strong> computer modelv. Apply <strong>the</strong> Computer model to <strong>the</strong> new data.In Undirected Knowledge Discovery <strong>the</strong> data m<strong>in</strong><strong>in</strong>g tool is simply let loose on <strong>the</strong> data <strong>in</strong> <strong>the</strong> hope thatit will discover mean<strong>in</strong>gful structure. One common use <strong>of</strong> undirected knowledge discovery is market basketanalysis. Ano<strong>the</strong>r application is cluster<strong>in</strong>g, where groups <strong>of</strong> records are assigned to <strong>the</strong> same cluster if<strong>the</strong>y have some th<strong>in</strong>g <strong>in</strong> common. The undirected knowledge discovery method has several steps :i. Identify source <strong>of</strong> dataii. Prepare data for analysisiii. Build <strong>and</strong> tra<strong>in</strong> a computer modeliv. Evaluate <strong>the</strong> computer modelv. Apply <strong>the</strong> computer model to <strong>the</strong> new datavi. Identify potential targets for directed knowledge discoveryvii. Generate new hypo<strong>the</strong>sis to test.You will notice that steps I through V are <strong>the</strong> same as for directed knowledge discovery. The two additionalsteps reflect <strong>the</strong> fact that undirected knowledge discovery is usually a prelude to fur<strong>the</strong>r <strong>in</strong>vestigation viamore directed technique.4. <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>and</strong> <strong>the</strong> LibrariesTill now we have discusses about <strong>the</strong> data m<strong>in</strong><strong>in</strong>g <strong>and</strong> its work<strong>in</strong>g. Now we are go<strong>in</strong>g to explore how datam<strong>in</strong><strong>in</strong>g can be useful <strong>in</strong> <strong>the</strong> <strong>field</strong> <strong>of</strong> library <strong>and</strong> <strong>in</strong>formation science. As per fifth law <strong>of</strong> library science“<strong>Library</strong> is a grow<strong>in</strong>g organization” [9] so <strong>the</strong> volume <strong>of</strong> <strong>the</strong> library data is also grow<strong>in</strong>g with an enormousrate. For efficiently <strong>and</strong> effectively do<strong>in</strong>g <strong>the</strong> library adm<strong>in</strong>istration <strong>and</strong> extend<strong>in</strong>g library services <strong>the</strong> need<strong>of</strong> library automation <strong>and</strong> e-<strong>Library</strong> occur. But simply automat<strong>in</strong>g <strong>the</strong> library or develop<strong>in</strong>g an e-<strong>Library</strong> isnot <strong>the</strong> only solution unless <strong>and</strong> until we are not able to explore <strong>the</strong> hidden <strong>in</strong>formation from <strong>the</strong> largeamount <strong>of</strong> database. This can be done by apply<strong>in</strong>g <strong>the</strong> data m<strong>in</strong><strong>in</strong>g <strong>in</strong> <strong>the</strong> library data.Now we take a glance on <strong>the</strong> possibilities open<strong>in</strong>g <strong>in</strong> <strong>the</strong> new age <strong>of</strong> data m<strong>in</strong><strong>in</strong>g <strong>in</strong> <strong>the</strong> <strong>field</strong> <strong>of</strong> library <strong>and</strong><strong>in</strong>formation science.i. Classification - By us<strong>in</strong>g data m<strong>in</strong><strong>in</strong>g we can develop a computer program that will replace <strong>the</strong>manual classification with <strong>the</strong> automatic classification <strong>of</strong> library contents. Classification mimics librarycatalog<strong>in</strong>g procedures by group<strong>in</strong>g structured <strong>and</strong> unstructured data accord<strong>in</strong>g to certa<strong>in</strong> criteriasuch as source (e.g., government bodies), document type (e.g. maps), language, subject, or a number<strong>of</strong> o<strong>the</strong>r criteria [3].ii. L<strong>in</strong>k analysis- Like wise <strong>the</strong> paper materials, where similar documents tend to have similarbibliographical references, <strong>and</strong> frequency <strong>of</strong> citation is <strong>of</strong>ten considered to reflect <strong>the</strong> quality orimportance <strong>of</strong> document, l<strong>in</strong>k analysis assumes that higher-quality or o<strong>the</strong>rwise more desirabledocuments will generally be l<strong>in</strong>ked to more frequently than o<strong>the</strong>r documents, <strong>and</strong> that l<strong>in</strong>ks <strong>in</strong> acdocument reveal someth<strong>in</strong>g about <strong>the</strong> content <strong>of</strong> a document. L<strong>in</strong>k analysis can place frequentlyl<strong>in</strong>ked-to-documents at <strong>the</strong> top <strong>of</strong> a list or identify documents that are associated with each o<strong>the</strong>r [3].


518 <strong>Use</strong> <strong>of</strong> <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>iii. Sequence analysis- Sequence analysis uses statistical analysis to identify unl<strong>in</strong>ked documents thatusers are likely to want to read toge<strong>the</strong>r. It exam<strong>in</strong>es <strong>the</strong> paths that users follow when search<strong>in</strong>g for<strong>in</strong>formation <strong>and</strong> can help identify which documents users are likely to want toge<strong>the</strong>r [3].iv. Summarization- Though mach<strong>in</strong>e generated abstracts are <strong>in</strong>ferior to human-generated ones <strong>in</strong> terms<strong>of</strong> readability <strong>and</strong> content, yet <strong>the</strong>y can be very useful for help<strong>in</strong>g users decide what items <strong>the</strong>y need.Abstract-generat<strong>in</strong>g s<strong>of</strong>tware typically works by identify<strong>in</strong>g significant words or phrases based onposition with<strong>in</strong> documents association with critical phrases [3].v. Cluster<strong>in</strong>g- Cluster<strong>in</strong>g is similar to classification, except that <strong>the</strong> classes are determ<strong>in</strong>ed by f<strong>in</strong>d<strong>in</strong>gnatural group<strong>in</strong>gs <strong>in</strong> <strong>the</strong> data items based on probability analyses ra<strong>the</strong>r than by predeterm<strong>in</strong>edgroup<strong>in</strong>gs. Cluster<strong>in</strong>g <strong>and</strong> classification are <strong>of</strong>ten used as a start<strong>in</strong>g po<strong>in</strong>t for explor<strong>in</strong>g fur<strong>the</strong>rrelationships <strong>in</strong> data. For example, many search eng<strong>in</strong>e (such as Nor<strong>the</strong>rn Light) break down sites bylocation, subject, or language before sub-arrang<strong>in</strong>g data [3].5. Future <strong>of</strong> data m<strong>in</strong><strong>in</strong>g <strong>in</strong> <strong>the</strong> library work<strong>in</strong>gIn future <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> can provide <strong>the</strong> new road map for <strong>the</strong> next generation <strong>of</strong> library by apply<strong>in</strong>g it for <strong>the</strong>follow<strong>in</strong>g activities <strong>of</strong> library.i. Search<strong>in</strong>g <strong>of</strong> <strong>Information</strong> (Reference Service)- S<strong>in</strong>ce <strong>the</strong> data <strong>of</strong> <strong>the</strong> library cont<strong>in</strong>uously grow<strong>in</strong>g withan exponential rate <strong>and</strong> <strong>the</strong> ma<strong>in</strong> problem is how one can reference <strong>the</strong> required <strong>in</strong>formation form <strong>the</strong>large amount <strong>of</strong> redundant <strong>in</strong>formation <strong>of</strong> <strong>the</strong> library. This can be possible by apply<strong>in</strong>g data m<strong>in</strong><strong>in</strong>gtechniques, so one can say that <strong>the</strong> data m<strong>in</strong><strong>in</strong>g is <strong>the</strong> future <strong>of</strong> reference service.ii. Classification- It will replace <strong>the</strong> manual classification <strong>of</strong> content <strong>of</strong> <strong>the</strong> library with <strong>the</strong> computerassisted classification, so that <strong>the</strong> classification task can be accomplished by less skilled person <strong>in</strong>a fast <strong>and</strong> efficient way. This will simplify <strong>the</strong> classification task <strong>of</strong> <strong>the</strong> library.iii. Acquisition- As per third law <strong>of</strong> library science “Every book its reader” [9]. By apply<strong>in</strong>g <strong>the</strong> data m<strong>in</strong><strong>in</strong>g<strong>in</strong> <strong>the</strong> library data it can be easily f<strong>in</strong>d out <strong>the</strong> required contents that are necessary to acquire next. Thiswill reduce <strong>the</strong> work <strong>of</strong> library staff related to <strong>the</strong> acquisition as well as <strong>the</strong> efficient use <strong>of</strong> budgetallocated to <strong>the</strong> library.6. ConclusionIt can be concluded that <strong>the</strong>re is <strong>the</strong> need <strong>of</strong> data m<strong>in</strong><strong>in</strong>g techniques that will redesign <strong>and</strong> simplify <strong>the</strong>work<strong>in</strong>g <strong>of</strong> library like classification, acquisition, circulation <strong>and</strong> referenc<strong>in</strong>g. The ma<strong>in</strong> use <strong>of</strong> data m<strong>in</strong><strong>in</strong>gis <strong>in</strong> referenc<strong>in</strong>g but it can be used for some o<strong>the</strong>r work <strong>of</strong> library as well. So it is urgently needed thatsystematic efforts have been take place to develop data m<strong>in</strong><strong>in</strong>g techniques <strong>and</strong> algorithms for librarydatabase.7. References1. Frawley, W., Piatetsky-Shapiro, G., Ma<strong>the</strong>us, C. J., “Knowledge Discovery <strong>in</strong> <strong>Data</strong>base an Overview,” <strong>in</strong>knowledge Discovery <strong>in</strong> <strong>Data</strong>bases, G. Piatetsky-Shapiro <strong>and</strong> W. Frawley, (Eds.), MIT Press, 19912. Wiederhold, G., “Knowledge Versus <strong>Data</strong>,” <strong>in</strong> on Knowledge Base Management Systems: IntegrationArtificial Intelligence <strong>and</strong> <strong>Data</strong>base Technologies, Brodie, M. <strong>and</strong> Mylopoulos, J. (Eds.), Spr<strong>in</strong>ger-Verlag, 1986.3. Dhiman, Anil K., <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>and</strong> its use <strong>in</strong> Libraries, CALIBER-2003.4. Carbone, Patricia L., <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> or “Knowledge Discovery <strong>in</strong> <strong>Data</strong>bases” : An Overview, MITRECorporation, 1997


Dwivedi <strong>and</strong> Bajpai5195. Berry, Michael J.A., L<strong>in</strong><strong>of</strong>f, G., <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> Techniques for Market<strong>in</strong>g, Sales <strong>and</strong> Customer Support,Wiley Computer Publish<strong>in</strong>g.6. Fayyed, U., R. Uthursamy, (Eds.), Proceed<strong>in</strong>g <strong>of</strong> <strong>the</strong> first <strong>in</strong>ternational conference on knowledge discovery<strong>and</strong> data m<strong>in</strong><strong>in</strong>g, The AAAI Press, Menlo Park, CA, 1995.7. Han, J., Kamber, M., <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>: Concept <strong>and</strong> Techniques, Morgan Kaufmann Publishers, 2002.8. Adriaans, P., Zant<strong>in</strong>ge, D., <strong>Data</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong>, Pearson Education, 2003.9. Rangnathan, S.R., Five laws <strong>of</strong> library science, Sarda Rangnathan Endowment for <strong>Library</strong> <strong>Science</strong>,Banglore 1993.About AuthorsMr. Roopesh K. Dwivedi is <strong>Information</strong> Scientist at MGCGV, Chitrakoot, Satna, India.E-mail: rkdwivedi_mgcgv@rediffmail.comMr. R.P. Bajpai is In-charge Librarian at MGCGV, Chitrakoot, Satna, India.E-mail: rpbajpai_mgcgv@rediffmail.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!