Some years ago (since ca. 2010) I decided to start archiving my documents (invoices, letters, certificates, manuals, and much more...) in electronic form. At first mostly for convenience, but now also as the main backup of the paper versions.
In this post I describe my approach and the thoughts behind some of the decisions.
Goals
Before going into the details, here are the goals or requirements related to archiving my documents.
-
Use open-source tools and open standards for reading, writing, handling, managing and processing the documents. This point applies to both each single document file as also to the document archive itself.
I think this helps to make sure that the archive is readable for a long time and I will not be locked in with some proprietary tools or file formats, which might not be supported in future anymore.
-
Documents should be stored as simple files.
If the data structure of the archiving software breaks down, the single files shall still be available as files.
-
I want to prevent modification and deletion of documents.
It's the purpose of an archive, to preserve the state of a document. It should mostly be write-once and then read-one.
-
It should be easy to back up the whole archive by copying / synchronizing it to another computer.
I keep a backup of the archive on at least one other computer.
Regulary copying the whole archive takes too much time, so there must be a way to easily synchronize changes to the backup system.
-
I would like to be able to trace any change or modification to archive files, i.e. get the log of modifications.
File Format
Allmost all documents are stored in PDF files. If I receive PDF files e.g. from a company by mail, I directly archive the file as is. If it is a paper document, I scan the file and then convert it to PDF, trying to comply with PDF/A with one embedded image per page, usually in JPEG format.
PDF/A is a restricted PDF variant specifically intended for archiving. It does e.g. not allow any interactive features (embedded javascript), references to external source (often used for fonts, etc.), no forms, so it is very likely that the document can still be read in the future without issues.
Currently I do not use OCR yet to make the PDF contents searchable. This is something on my todo list, but not high priority yet.
File Naming
I try to restrict filenames to only use ASCII characters for file naming, as far
as possible the subset a-z A-Z 0-9 - _
, so that I do not run into any trouble
with character set encoding issues or tools failing to handle special
characters.
Allmost all filenames start with a date in ISO format, i.e. yyyy-mm-dd
, so they
can easily be sorted chronologically.
Next part of the filename usually is the creator, e.g. the company creating or sending the document. After that follows some description of the content.
Here's a simple example:
2019-07-03 Amazon AWS Invoice.pdf
Directory Structure
This is the directory tree for sorting the document. The top-level directories are the main categories of documents like education, finances, insurance, invoices, identdata (scans of passports, ID cards, drivers license, etc.), work, etc.
Below each of these top level folders I have further, more or less complex directory structures. But over time I realized, that the more complex the structure gets, the more difficult it is to find the right place to store a document.
In any way, I usually search documents using find
and grep
on the
command-line. So now I try to keep the directory structure as simple as
possible.
Scanner Hardware
The scanners are not specifically related to arhiving documents. It's just the hardware I currently have and use.
The hardware for scanning is quite basic. I use an old Canon Lide 110 flatbed scanner and a Brother ADW-1500 document scanner. Both are supported well under Linux (currently Xubuntu 19.10).
The Brother scanner has integrated Wifi and a small display with touchscreen for configuration. But it seems the humid climate in Shanghai damaged the touchscreen, so now I can only use it by connecting to USB.
Software for Scanning
For a long time, I experimented with using the sane
command-line tools for
scanning documents. From time to time I re-created or reworked some shell
scripts to improve the scanning process, adjust image processing with
ImageMagick
and conversion to PDF. But I never reached a state where I had a
really well working script. At some point I was just annoyed, that for almost
every document to scan I first had to do some shell scripting and some retries
until the document was successfully scanned.
So I had a look for other available tools. And found a really simple one:
simple-scan
is a very basic GUI-based scanning application. It scans pages,
adjust brightness and contrast, use either color or grayscale mode and saves the
result as a PDF or a set of JPEG files.
Archiving
Over the last years, the system to archive documents changed a bit. The first
approach used for many year based on encfs
. Since some time now I changed to
git
.
EncFS
encfs
is a FUSE (filesystem in userspace) filesystem driver, which encrypts a
directory. This means your data is stored in a directory structure with
encrypted files. Even file and directory names are encrypted. When mounting the
file system, you enter a password and the data becomes accessible decrypted in
another directory. Any modification of this data is immediately encrypted and
reflected in the encrypted directory tree.
When I run my computer, the data is not mounted, so I cannot access any files in
the archive. I wrote two simple scripts called encm
and encu
to mount or
unmount the archive data.
So during normal operation, archive data is not easily accessible. When I need
access, I mount the filesystem with encm
, access or modify the files and later
run encu
to unmount it again.
I backup the encrypted files to another computer by running rsync
.
This setup worked fine for me for many years. Still roughly one year ago I stopped using it, because it has a few disadvantages.
-
It seems that
encfs
is not further maintained and has a few security issues. -
Backup is difficult. I do not want to use the
--delete
flag withrsync
. This means whenever I rename or remove a file, the old file stays in the backup. So over time the backup became out of sync with the main data and I could not easily resolve this anymore. -
I'm not protected against modification of files and I cannot trace changes at all.
Git
After I extracted all files from the encrypted archive, I kept them as a normal
directory structure on my computer for some time and started searching for
another solution. And along comes git
.
I had already used git
for many years for software development, but never
considered it for handling my document archive. But it seems a quite good fit,
as allmost all my requirements seem to be fulfilled.
I have a working copy of the archive on my computer, so all the files are accessible as plain files. If something goes wrong with the git data structures, at least the files are all still accessible.
When I commit files, the committed version cannot be changed anymore and I can
even go back to any older version of every file. I can also trace all
modifications of the archive in the git log
.
Creating backups is fun and easy, as I now just have to push the repository to another computer. All changes are automatically synchronized and I do not have and hassle with duplicate files, etc. anymore.
As i run a [gitea] server, I even have a nice web interface for browsing the archive.
So far, I'm really happy with this solution.
Statistics
Currently the archive has ca. 2500 files. It's size (including the .git
folder) is 4.1GB.