Data-intensive science consists of three basic activities: capture, curation, and analysis.
We must think carefully about which data should be able to live forever and what additional metadata should be captured to make this feasible.
Jim Gray’s recipe for designing a database for a given discipline is that it must be able to answer the key 20 questions that the scientist wants to ask of it.
A decade ago, rereading the data was just barely feasible. In 2010, disks are 1,000 times larger, yet disc record access time has improved by only a factor of two.