Integrity Checking in Digital Preservation: The Essential Basics Guide 🔍
Integrity checking is a crucial step in ensuring the usability and accessibility of your digital archives over long periods. Not only can integrity checking guarantee that your content has made the journey in to your preservation workflow free of damage, you can also continuously identify any future corruption to your files and take action to repair or replace them as needed. Without a way to guarantee the stability of files, like integrity checking, your collections can easily become susceptible to corruption.
Actually implementing an integrity checking policy can be challenging; there are several decisions about your tools and processes that have to be carefully considered for their feasibility, efficacy and the ease of integration in to your existing setup.
This essential guide will help you understand the very basics of integrity checking, and equip you with the knowledge to begin building an effective integrity checking system.
Checksums, checksums, checksums
Checksums are an essential component of an integrity checking system. In very simple terms, a checksum is a value calculated from the raw data of a file which forms a discrete representation of the file's bit-level makeup. When a file is changed or modified in any fashion, no matter how minor or the cause, it's almost guaranteed the resultant checksum value that could be generated for that file would change. By comparing the checksum of a file at different points in time, you can detect any changes or damage to the file. There are several types of checksums, called check-summing algorithms, including MD5, SHA-1, and SHA-256, each with different strengths and weaknesses.
Dr Matthew Addis produced an excellent paper on behalf of the DPC detailing the process of selecting an appropriate algorithm, I'd recommend exploring it for a much closer analysis of check-summing algorithms and related factors that might affect your decision making process:
https://www.dpconline.org/docs/technology-watch-reports/2399-twgn-checksums-addis/file
How?
To generate checksum values for your files, you'll need to get your hands on some appropriate software. It's wise to always inform your selection of tools by how smoothly they will fit with your workflow, not just how powerful or feature rich they are. The excellent Community Owned digital Preservation Tool Registry (COPTR) project, maintained by the Open Preservation Foundation and the Digital Preservation Coalition, offers a broad set of recommendations of check-summing tools with differing features, algorithms and workflows:
https://coptr.digipres.org/index.php/Fixity
You'll also need a way to record past and future checksum values of files and compare them. If a file's checksum has changed, you can investigate further to determine if the file has been damaged.
While a DIY workflow is a great option for smaller collections, it can quickly become time-consuming and cumbersome to manage as your collection grows.
When?
Common best-practice guidelines frequently recommend that you perform integrity checks at several critical stages of your preservation workflow:
And for an even more secure workflow, you can perform checks in response to specific events.
Implementing and actioning an effective integrity checking policy can become exponentially more challenging when your target content is spread across different storage environments and stages of your workflow. You'll need to ensure you can enact your periodic checks consistently, that you have records to maintain the trail of checksums for an object and that you thoroughly verify your entire collection to avoid files slipping through the net.
If you're working with a small collection of personal files for your own archives, or only need to preserve a small slice of content, you could maintain a schedule with your calendar. Unfortunately this can come at a significant cost to the time you have available to manage your collections. Our software platform, Curate, automatically manages integrity checking for you in the background and silently repairs any damage from it's redundant storage services. You can try it out right now with our interactive demo:
Monitor and Review
Regardless of the integrity checking system you choose, it's essential to monitor the results of your process to ensure that your files remain intact and accessible. You will also need to make sure your backup and recovery strategy is solid and can provide the resources to repair any compromised files identified by your integrity checking system.
Whether you choose to implement a DIY workflow or use a comprehensive platform like Curate, regularly checking the integrity of your files is essential to maintaining their usability and accessibility over time.
For more information on best practices for integrity checking and digital preservation, get in touch with your local Digital Preservation guidance authority or explore the resources available at https://www.dpconline.org/, https://www.nationalarchives.gov.uk/, or https://ndsa.org/.
Curate, Preserve, Protect.
Preservica and the Preservica logo are trademarks of Preservica Ltd.
Artefactual, Archivematica and the ‘a’ logo are trademarks of Artefactual Systems Inc.
Curate™ is a trademark of Penwern Limited.