11.07.2015 Views

OAIS in the Lifecycle Management of Records - Digital Library ...

OAIS in the Lifecycle Management of Records - Digital Library ...

OAIS in the Lifecycle Management of Records - Digital Library ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Performance Study <strong>of</strong> <strong>Digital</strong>Object Format Identification &Validation ToolsQuyen NguyenERA Systems Eng<strong>in</strong>eer<strong>in</strong>gNational Archives & <strong>Records</strong>Adm<strong>in</strong>istration


Agenda• Background• Format Identification Tools• Experiments• Analysis• Related Work• Summary2


<strong>OAIS</strong> Model for ERAPreservation Plann<strong>in</strong>gPRODUCERSIPDescriptiveInfoIngestAIPData<strong>Management</strong>ArchivalStorageDescriptiveInfoAIPAccess4-1.3queriesresult setsordersDIPCONSUMERAdm<strong>in</strong>istrationMANAGEMENT3


Challenges and RequirementsExtensibilityAvailabilityScalabilityINGESTPreserveAccessEvolvabilitySecurity1. Complexity<strong>Records</strong> <strong>in</strong>different formats,which may beobsolete.2. VolumeEnormousamounts <strong>of</strong>records•Format Identification•Ingest Verification4


Ingest Process OrchestrationInput FileIntegrity SealVirus ScanProcess VirusInfected FileYesVirus FoundNoAccessRestriction ScanMetadataCollectionFile Identification&ValidationSTOP5


• BackgroundAgenda• Format Identification Tools• Experiments• Analysis• Related Work• Summary6


File Format• Real issue: file extensionunreliable to determ<strong>in</strong>e <strong>the</strong>format <strong>of</strong> a digital object– depends on end-user orapplication.• Format identification. . Micros<strong>of</strong>tWord 2003, Acrobat 8 PDF, etc.• Format validation. . Once aformat f has been identified for adigital object X, does X reallyconform to format f. . Forexample, an XML documentmay be well-formed or not.ASCIITXT XML VRMLBINARYMS WORD TIFF PDF7


Identification & Validation Tool• Several <strong>in</strong>stitutions have developed such tools.• A tool performs follow<strong>in</strong>g task:– File Input.– F<strong>in</strong>d matched signature.– Output Metadata:• File format: PDF, JPEG, Micros<strong>of</strong>t Word, etc.• Version number <strong>of</strong> application used to create digitalobject.• Sounds simple yet difficult8


JHOVE• JSTOR/ STOR/HarvardObjectValidationEnvironment, developed by JStor(Journal STORage) and HarvardUniversity <strong>Library</strong>.• Set <strong>of</strong> modules called “handlers”,each <strong>of</strong> which is responsible for afile type• Traverse set <strong>of</strong> “handlers” untilone is found that can positivelyidentify <strong>the</strong> type <strong>of</strong> <strong>in</strong>put file.• JHOVE can output rich metadata.Technical metadata such as MIXdata elements for image files part<strong>of</strong> output.9


DROID• <strong>Digital</strong>RecordObjectIdentificationdeveloped by United K<strong>in</strong>gdomNational Archives.• Based on PRONOM registry <strong>of</strong> filesignatures specific to file types.• At runtime, <strong>the</strong> content <strong>of</strong> <strong>the</strong> registrycan be downloaded as an XML file,and cached <strong>in</strong> <strong>the</strong> DROID process.• Traverse signature file conta<strong>in</strong><strong>in</strong>gcached content <strong>of</strong> PRONOM.• DROID process will try to match oneby one <strong>the</strong> signatures <strong>in</strong> <strong>the</strong> signaturefile aga<strong>in</strong>st <strong>the</strong> one <strong>in</strong> <strong>the</strong> <strong>in</strong>put file10


JHOVE Screenshot11


DROID Screenshot12


• BackgroundAgenda• Format Identification Tools• Experiments• Related Work• Analysis• Summary13


Experimentation• Environment– Intel ® CPU T2500 @ 2.0 GHz, 2.0 GHz, 2.0 GB <strong>of</strong> RAM.– Micros<strong>of</strong>t W<strong>in</strong>dow XP Pr<strong>of</strong>essional Version 2002 ServicePak 2.– Runtime JVM comes with Sun JDK 1.6.0_01-b01.– java –Xms1024m–Xmx1024m– Jhove version 1.1 2006-02-13– DROID v1.1• Inserted simple trac<strong>in</strong>g code– System.currentTimeMillis()– Runtime.totalMemory() – Runtime.freeMemory()• Metrics– Execution Time (ms): time-jhove, time-droid.– Heap Size (KB): heap-jhove, heap-droid.• 50 measurements per collection or file.• Statistical tools: Micros<strong>of</strong>t Excel and Stats4U.14


Data Corpus• Corpus C1: examples shipped with JHOVE.– 112 files whose size ranges from 1 KB to 22 MB– most <strong>of</strong> <strong>the</strong> files are less than 100 KB.– Files are grouped <strong>in</strong>to subdirectories accord<strong>in</strong>g to<strong>the</strong>ir document types: ASCII, GIF, HTML, JPEG,PDF, TIFF, WAV, and XML.– HTML subdirectory also conta<strong>in</strong>s GIF and JPEGimages <strong>in</strong> <strong>the</strong> HTML pages.15


Data Corpus (2)• Corpus C2: 24 collections <strong>of</strong> documents used <strong>in</strong>NARA research lab.• Typical documents com<strong>in</strong>g to <strong>the</strong> public archives– Photos from National Park Service– Documents related to Katr<strong>in</strong>a– Case files <strong>of</strong> U.S. District Courts– White House press releases,– Environmental maps from EPA– 1280 files whose sizes range from 1 KB to 136MB.– Notably, document types are more varied.– In addition to <strong>the</strong> types found <strong>in</strong> C1 set, one canf<strong>in</strong>d audio, video clips files, geospatial files,statistical files, etc.16


Statistical Analysis• Perform T-Test Test us<strong>in</strong>g Micros<strong>of</strong>t Excel.• Execution Time.– Null Hypo<strong>the</strong>sis H0: time-droid = time-jhove.– Alternative Hypo<strong>the</strong>sis H1: time-droid > time-jhove.• Heap Size– Null Hypo<strong>the</strong>sis H0: heap-droid = heap-jhove.– Alternative Hypo<strong>the</strong>sis H1: heap -droid < heap -jhove.• To conclude H0 with confidence, we want t-Stat be small,and P(T


Experiment 1• Corpus C1.Execution Time:• From T-test, 99% confidence level, time-droid issignificantly greater than time-jhove– t Stat = 1487.29;– P(T


Experiment 2• Corpus C2.Execution Time:• From T-test, 95% confidence level, time-jhove issignificantly greater than time-droid– t Stat = -5.48;– P(T


Data Type Impact• Two collections --around 865th and1000th data po<strong>in</strong>tscaused a dramatic<strong>in</strong>crease <strong>in</strong> time-jhove• Conta<strong>in</strong> mostly VRML(Virtual RealityModel<strong>in</strong>g Language)files, which areessentially <strong>in</strong> ASCIItext, but can be<strong>in</strong>terpreted for display.20


Experiment 3• Tie.• Corpus: C2b = C2 – {2 VRML collections}• From T-test, 95% confidence level, time-droid issignificantly greater than time-droid– t Stat = 0.057– P(T


Experiment 4• Corpus C2 re-arranged by types and sizes.• Use Stat4U• 3-way ANOVA with factors: A=Tool; B=Type;C=Size (2 levels only)– All 3 factors and <strong>in</strong>teractions are significant with 95%confidence level.– Tool factor expla<strong>in</strong>s only 0.9 % <strong>of</strong> <strong>the</strong> variation.22


L<strong>in</strong>ear Regression: Size-Timet i m e ( m s )DROID - Time vs. File Size3000y = 0.1325x + 9.524R 2 = 0.9965250020001500100050000 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000size (kb)droid-timeL<strong>in</strong>ear (droid-time)• Corpus C2.• Only f<strong>in</strong>d l<strong>in</strong>earregression for time-droid:time-droid = 0.13 * size + 9.52• 100 TB ~ 5 months.• Information for siz<strong>in</strong>g:– Comput<strong>in</strong>g resources– Parallelism23


• BackgroundAgenda• Format Identification Tools• Experiments• Analysis• Related Work• Summary24


Analysis• Statistically, JHOVE and DROID perform equally wellfor format types that JHOVE can identify.– Qualitatively, JHOVE generated metadata is richer.• For types that JHOVE cannot validate, <strong>the</strong>performance decreases drastically compared toDROID.– Easy case: if JHOVE f<strong>in</strong>ds that a record is b<strong>in</strong>ary, itjust responds with a general identification, e.g.ByteStream.– But some ASCII cases such as VRML may throw it<strong>of</strong>f.25


Integrated Approach• Two-phase approach for File Identification andValidation:– Pass a file through DROID to quickly identify itstype.– If <strong>the</strong> type is found to be on <strong>the</strong> known list <strong>of</strong>JHOVE, <strong>the</strong>n pass through JHOVE to extracttechnical metadata.• These extracted technical metadata useful forautomatic verification purposes.• Examples <strong>in</strong>clude image resolution, format versionnumbers, creation dates, font <strong>in</strong>formation,etc.26


• BackgroundAgenda• Format Identification Tools• Experiments• Analysis• Related Work• Summary27


Related Work• GDFR: Global <strong>Digital</strong> Format Registry– Distributed and replicated registry <strong>of</strong> format <strong>in</strong>formation– Allow <strong>the</strong> registration and discovery <strong>of</strong> digital formats for<strong>the</strong> long term– Collaboration <strong>of</strong> Harvard University <strong>Library</strong> and OCLC– http://www.gdfr.<strong>in</strong>fo• FOCUS: Format Curation Service at <strong>the</strong> University <strong>of</strong>Maryland– Ma<strong>in</strong> component is Format Identifier (Fider)– Registry Global <strong>Digital</strong> Format Registry (GDFR)implemented us<strong>in</strong>g LDAP– https://wiki.umiacs.umd.edu/adapt/<strong>in</strong>dex.php/Focus:Ma<strong>in</strong>28


Related Work (2)• PERPOS: Presidential Electronic <strong>Records</strong> Pilot System at GeorgiaTechnology University– s<strong>of</strong>tware tools to support <strong>the</strong> <strong>OAIS</strong> functionalities– http://perpos.gtri.gatech.edu• Metadata Extract Tool from National <strong>Library</strong> <strong>of</strong> New Zealand– http://www.natlib.govt.nz/services/get-advice/digital-libraries/metadata-extraction-tooltool• AIHT: Automated Preservation Assessment <strong>of</strong> Heterogeneous <strong>Digital</strong>Collections bys Stanford University• The University <strong>of</strong> London Computer Centre issued a report to compareDROID, JHOVE, and AIHT:– Assessment <strong>of</strong> File Format Test<strong>in</strong>g Tool.http://www.ulcc.ac.uk/uploads/media/DAAT_file_format_tools_report.pdf.– Very good qualitative and functional analysis• O<strong>the</strong>r Projects?29


Summary• File Identification and Validation important step <strong>in</strong>Ingest Process.• Performance study <strong>of</strong> Jhove and DROID.• Optimal approach leverag<strong>in</strong>g both tools.• Monitor future progress <strong>of</strong> o<strong>the</strong>r tools.• Look<strong>in</strong>g forward to Jhove 2.30


Thank Youhttp://www.archives.gov/eraFor any comments or questions, please mailto:quyen.nguyen@nara.gov31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!