Thinking about how your master dataset is stored and managed is arguably the most important part of architecting a data-driven system. Your master dataset is your source of truth. An issue here, be it hardware failure, human error, or something else, could easily result in irreparable damage to some or all of your data.
I would suggest there is only 1 hard requirement for your master datastore: Be correct. Always.
Making your master datastore immutable, i.e. append only, really helps with this.
I’m sure anyone who has been working with databases for a sufficient amount of time has accidentally deleted data, or performed an incorrect update, losing some data for ever. I certainly have - more than a few times.
With an immutable datastore, you can’t lose data. A mistake might cause bad data to be written, but the good data is still there, and you can always handle the bad data at load time.
No matter how careful you are, people will make mistakes, and you must do what you can to limit the impact of them.
Even if you are the most careful person in the world, and you are the only person allowed to interact with the data, you probably still write code, and occasionally that code will have bugs.
Again, if you have a code that does a bad update or delete, you’ve lost your data. For good.
What about backups?
Good, regular backups of your master datastore are essential. But backups are always a little out of date. And any problems with your data will soon make it’s way to the backup. So if you’re relying on backups, they better be really good, and you had better notice any mistakes really quickly, and even that is not a complete guarantee.
Disadvantages of immutable datastores
Of course, there are disadvantages to immutable datastores. You might be storing data over and over again that you no longer care about (although I would argue the history of a record will have some value - you’re just not using it yet). Encoding and/or compressing your data can help with this. Using relatively cheap datastores may also make this less of a concern (HDFS, S3, etc), though make sure they are reliable.
A bigger concern might be having to apply all the updates and deletes at view time, which is probably going to be slow. In that case, maybe you don’t use this datastore for time critical views. Maybe you have another dataset, in a mutable datastore, that is used for these views. A “hot store”, if you like, optimised for these views.
It’s OK for this hot store to lose data, through bugs, human error or hardware failure, because you know you can always regenerate it from your immutable, always correct, master datastore.