OAIS in the Lifecycle Management of Records - Digital Library ...
OAIS in the Lifecycle Management of Records - Digital Library ...
OAIS in the Lifecycle Management of Records - Digital Library ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Performance Study <strong>of</strong> <strong>Digital</strong>Object Format Identification &Validation ToolsQuyen NguyenERA Systems Eng<strong>in</strong>eer<strong>in</strong>gNational Archives & <strong>Records</strong>Adm<strong>in</strong>istration
Agenda• Background• Format Identification Tools• Experiments• Analysis• Related Work• Summary2
<strong>OAIS</strong> Model for ERAPreservation Plann<strong>in</strong>gPRODUCERSIPDescriptiveInfoIngestAIPData<strong>Management</strong>ArchivalStorageDescriptiveInfoAIPAccess4-1.3queriesresult setsordersDIPCONSUMERAdm<strong>in</strong>istrationMANAGEMENT3
Challenges and RequirementsExtensibilityAvailabilityScalabilityINGESTPreserveAccessEvolvabilitySecurity1. Complexity<strong>Records</strong> <strong>in</strong>different formats,which may beobsolete.2. VolumeEnormousamounts <strong>of</strong>records•Format Identification•Ingest Verification4
Ingest Process OrchestrationInput FileIntegrity SealVirus ScanProcess VirusInfected FileYesVirus FoundNoAccessRestriction ScanMetadataCollectionFile Identification&ValidationSTOP5
• BackgroundAgenda• Format Identification Tools• Experiments• Analysis• Related Work• Summary6
File Format• Real issue: file extensionunreliable to determ<strong>in</strong>e <strong>the</strong>format <strong>of</strong> a digital object– depends on end-user orapplication.• Format identification. . Micros<strong>of</strong>tWord 2003, Acrobat 8 PDF, etc.• Format validation. . Once aformat f has been identified for adigital object X, does X reallyconform to format f. . Forexample, an XML documentmay be well-formed or not.ASCIITXT XML VRMLBINARYMS WORD TIFF PDF7
Identification & Validation Tool• Several <strong>in</strong>stitutions have developed such tools.• A tool performs follow<strong>in</strong>g task:– File Input.– F<strong>in</strong>d matched signature.– Output Metadata:• File format: PDF, JPEG, Micros<strong>of</strong>t Word, etc.• Version number <strong>of</strong> application used to create digitalobject.• Sounds simple yet difficult8
JHOVE• JSTOR/ STOR/HarvardObjectValidationEnvironment, developed by JStor(Journal STORage) and HarvardUniversity <strong>Library</strong>.• Set <strong>of</strong> modules called “handlers”,each <strong>of</strong> which is responsible for afile type• Traverse set <strong>of</strong> “handlers” untilone is found that can positivelyidentify <strong>the</strong> type <strong>of</strong> <strong>in</strong>put file.• JHOVE can output rich metadata.Technical metadata such as MIXdata elements for image files part<strong>of</strong> output.9
DROID• <strong>Digital</strong>RecordObjectIdentificationdeveloped by United K<strong>in</strong>gdomNational Archives.• Based on PRONOM registry <strong>of</strong> filesignatures specific to file types.• At runtime, <strong>the</strong> content <strong>of</strong> <strong>the</strong> registrycan be downloaded as an XML file,and cached <strong>in</strong> <strong>the</strong> DROID process.• Traverse signature file conta<strong>in</strong><strong>in</strong>gcached content <strong>of</strong> PRONOM.• DROID process will try to match oneby one <strong>the</strong> signatures <strong>in</strong> <strong>the</strong> signaturefile aga<strong>in</strong>st <strong>the</strong> one <strong>in</strong> <strong>the</strong> <strong>in</strong>put file10
JHOVE Screenshot11
DROID Screenshot12
• BackgroundAgenda• Format Identification Tools• Experiments• Related Work• Analysis• Summary13
Experimentation• Environment– Intel ® CPU T2500 @ 2.0 GHz, 2.0 GHz, 2.0 GB <strong>of</strong> RAM.– Micros<strong>of</strong>t W<strong>in</strong>dow XP Pr<strong>of</strong>essional Version 2002 ServicePak 2.– Runtime JVM comes with Sun JDK 1.6.0_01-b01.– java –Xms1024m–Xmx1024m– Jhove version 1.1 2006-02-13– DROID v1.1• Inserted simple trac<strong>in</strong>g code– System.currentTimeMillis()– Runtime.totalMemory() – Runtime.freeMemory()• Metrics– Execution Time (ms): time-jhove, time-droid.– Heap Size (KB): heap-jhove, heap-droid.• 50 measurements per collection or file.• Statistical tools: Micros<strong>of</strong>t Excel and Stats4U.14
Data Corpus• Corpus C1: examples shipped with JHOVE.– 112 files whose size ranges from 1 KB to 22 MB– most <strong>of</strong> <strong>the</strong> files are less than 100 KB.– Files are grouped <strong>in</strong>to subdirectories accord<strong>in</strong>g to<strong>the</strong>ir document types: ASCII, GIF, HTML, JPEG,PDF, TIFF, WAV, and XML.– HTML subdirectory also conta<strong>in</strong>s GIF and JPEGimages <strong>in</strong> <strong>the</strong> HTML pages.15
Data Corpus (2)• Corpus C2: 24 collections <strong>of</strong> documents used <strong>in</strong>NARA research lab.• Typical documents com<strong>in</strong>g to <strong>the</strong> public archives– Photos from National Park Service– Documents related to Katr<strong>in</strong>a– Case files <strong>of</strong> U.S. District Courts– White House press releases,– Environmental maps from EPA– 1280 files whose sizes range from 1 KB to 136MB.– Notably, document types are more varied.– In addition to <strong>the</strong> types found <strong>in</strong> C1 set, one canf<strong>in</strong>d audio, video clips files, geospatial files,statistical files, etc.16
Statistical Analysis• Perform T-Test Test us<strong>in</strong>g Micros<strong>of</strong>t Excel.• Execution Time.– Null Hypo<strong>the</strong>sis H0: time-droid = time-jhove.– Alternative Hypo<strong>the</strong>sis H1: time-droid > time-jhove.• Heap Size– Null Hypo<strong>the</strong>sis H0: heap-droid = heap-jhove.– Alternative Hypo<strong>the</strong>sis H1: heap -droid < heap -jhove.• To conclude H0 with confidence, we want t-Stat be small,and P(T
Experiment 1• Corpus C1.Execution Time:• From T-test, 99% confidence level, time-droid issignificantly greater than time-jhove– t Stat = 1487.29;– P(T
Experiment 2• Corpus C2.Execution Time:• From T-test, 95% confidence level, time-jhove issignificantly greater than time-droid– t Stat = -5.48;– P(T
Data Type Impact• Two collections --around 865th and1000th data po<strong>in</strong>tscaused a dramatic<strong>in</strong>crease <strong>in</strong> time-jhove• Conta<strong>in</strong> mostly VRML(Virtual RealityModel<strong>in</strong>g Language)files, which areessentially <strong>in</strong> ASCIItext, but can be<strong>in</strong>terpreted for display.20
Experiment 3• Tie.• Corpus: C2b = C2 – {2 VRML collections}• From T-test, 95% confidence level, time-droid issignificantly greater than time-droid– t Stat = 0.057– P(T
Experiment 4• Corpus C2 re-arranged by types and sizes.• Use Stat4U• 3-way ANOVA with factors: A=Tool; B=Type;C=Size (2 levels only)– All 3 factors and <strong>in</strong>teractions are significant with 95%confidence level.– Tool factor expla<strong>in</strong>s only 0.9 % <strong>of</strong> <strong>the</strong> variation.22
L<strong>in</strong>ear Regression: Size-Timet i m e ( m s )DROID - Time vs. File Size3000y = 0.1325x + 9.524R 2 = 0.9965250020001500100050000 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000size (kb)droid-timeL<strong>in</strong>ear (droid-time)• Corpus C2.• Only f<strong>in</strong>d l<strong>in</strong>earregression for time-droid:time-droid = 0.13 * size + 9.52• 100 TB ~ 5 months.• Information for siz<strong>in</strong>g:– Comput<strong>in</strong>g resources– Parallelism23
• BackgroundAgenda• Format Identification Tools• Experiments• Analysis• Related Work• Summary24
Analysis• Statistically, JHOVE and DROID perform equally wellfor format types that JHOVE can identify.– Qualitatively, JHOVE generated metadata is richer.• For types that JHOVE cannot validate, <strong>the</strong>performance decreases drastically compared toDROID.– Easy case: if JHOVE f<strong>in</strong>ds that a record is b<strong>in</strong>ary, itjust responds with a general identification, e.g.ByteStream.– But some ASCII cases such as VRML may throw it<strong>of</strong>f.25
Integrated Approach• Two-phase approach for File Identification andValidation:– Pass a file through DROID to quickly identify itstype.– If <strong>the</strong> type is found to be on <strong>the</strong> known list <strong>of</strong>JHOVE, <strong>the</strong>n pass through JHOVE to extracttechnical metadata.• These extracted technical metadata useful forautomatic verification purposes.• Examples <strong>in</strong>clude image resolution, format versionnumbers, creation dates, font <strong>in</strong>formation,etc.26
• BackgroundAgenda• Format Identification Tools• Experiments• Analysis• Related Work• Summary27
Related Work• GDFR: Global <strong>Digital</strong> Format Registry– Distributed and replicated registry <strong>of</strong> format <strong>in</strong>formation– Allow <strong>the</strong> registration and discovery <strong>of</strong> digital formats for<strong>the</strong> long term– Collaboration <strong>of</strong> Harvard University <strong>Library</strong> and OCLC– http://www.gdfr.<strong>in</strong>fo• FOCUS: Format Curation Service at <strong>the</strong> University <strong>of</strong>Maryland– Ma<strong>in</strong> component is Format Identifier (Fider)– Registry Global <strong>Digital</strong> Format Registry (GDFR)implemented us<strong>in</strong>g LDAP– https://wiki.umiacs.umd.edu/adapt/<strong>in</strong>dex.php/Focus:Ma<strong>in</strong>28
Related Work (2)• PERPOS: Presidential Electronic <strong>Records</strong> Pilot System at GeorgiaTechnology University– s<strong>of</strong>tware tools to support <strong>the</strong> <strong>OAIS</strong> functionalities– http://perpos.gtri.gatech.edu• Metadata Extract Tool from National <strong>Library</strong> <strong>of</strong> New Zealand– http://www.natlib.govt.nz/services/get-advice/digital-libraries/metadata-extraction-tooltool• AIHT: Automated Preservation Assessment <strong>of</strong> Heterogeneous <strong>Digital</strong>Collections bys Stanford University• The University <strong>of</strong> London Computer Centre issued a report to compareDROID, JHOVE, and AIHT:– Assessment <strong>of</strong> File Format Test<strong>in</strong>g Tool.http://www.ulcc.ac.uk/uploads/media/DAAT_file_format_tools_report.pdf.– Very good qualitative and functional analysis• O<strong>the</strong>r Projects?29
Summary• File Identification and Validation important step <strong>in</strong>Ingest Process.• Performance study <strong>of</strong> Jhove and DROID.• Optimal approach leverag<strong>in</strong>g both tools.• Monitor future progress <strong>of</strong> o<strong>the</strong>r tools.• Look<strong>in</strong>g forward to Jhove 2.30
Thank Youhttp://www.archives.gov/eraFor any comments or questions, please mailto:quyen.nguyen@nara.gov31