Cassette tapes are the future of big data storage

mistercow · on Oct 23, 2012

This is sort of like saying that RAM will be replaced by HDDs because HDDs are so much bigger. It's comparing apples to oranges. It's fascinating that tape drives are making a comeback, but to suggest that they can replace HDDs is just a fantasy

This quote:

The downside of tapes is that they are slower to access than hard discs because they have to be fetched by a robotic mechanism, inserted in a reader and spooled to the right point. But the Linear Tape File System, which is being developed, expedites this process to make it comparable to disc drives, Eleftheriou says.

indicates that the reporter did not understand what he was being told. LTFS offers an abstraction layer so that software can treat a tape drive like a hard drive. That makes programming easier, but it will not have any effect on performance. A file system cannot, after all, magically bestow random access unto a sequential access medium.

So if tape is to make a comeback in storing data for, say, the web (as implied by the article's opening paragraph, then it's still going to have a HDD based system in front of it as a cache. It would be an interesting future where viewing a post that hadn't been accessed in a few years took several minutes as you waited for that old data to be recached from tape.

astine · on Oct 23, 2012

> LTFS offers an abstraction layer so that software can treat a tape drive like a hard drive. That makes programming easier, but it will not have any effect on performance.

Ugh. You're more right than you know. This is a misconception that I sometimes have to deal with at my work. Tape systems are not harddrive systems and can't be treated as such even if you use LTFS or an HSM. The problem is that even though you can delete things by removing them from the (disk based) inode table, you can't actually remove them from tape or overwrite them without a special effort, which means that you can't simply reclaim tape space.

Being able to access a tapes with the standard API used for HDDs is convenient but you can't lose awareness of the underlying system or you'll screw things up mightily.

acdha · on Oct 23, 2012

I have severe misgivings about HSMs for this reason: too often, they're simply making an API promise which can't be kept. The vendors have an incentive to low-ball the expensive spinning disk component of an HSM to maximize the apparent cost savings but without extensive application suppor that will be a disaster.

astine · on Oct 23, 2012

Think of HSM as a more convenient alternative to a traditional tape system, rather than a cheaper alternative to a disk system. I think the benefits outweight the costs, but only if you understand the limitations. Have a guy on staff who understands tape and near-line storage and you should be in a good position.

Also, stay away from DiskXttender. I've used that product and it's terrible.

acdha · on Oct 24, 2012

The main problem is that you need to add proprietary HSM interfaces to every application which uses much data. I'd be happier if that was standardized.

brudgers · on Oct 23, 2012

If tape is cheap enough, multiple tapes could store the same data in a different sequences, e.g. beginning at 0%, 25%, 50%, and 75% of the run length. Initial access could then be quicker, and predictive caching used for the remaining data on the tape.

astine · on Oct 23, 2012

It would still be too slow simulate a HDD system.

You generally have two bottlenecks in tape seek performance. The first is the fetch from the robot, which can take seconds. The other is spindle rate on the tape drive itself which is limited by the durability of the tape. Also, in order to make this work, you'd need many tape drives, with many robots accessing the same racks of tapes. There's a physical limit to the number of drives and robots that can generally fit in a cage and have access to the same tapes so you'll have a lot of trouble pulling this off.

All this and you still haven't addressed the difficulties involved in overwriting tape. Nor does it address the normal redundancy you already need to deal with corrupted and broken tapes, or the fact that you'll need to expand your setup even further if you want multiple users of the filesystem at once.

Current systems already do caching and concurrent reads off multiple tapes. Tape is still not a drop in replacement for disk. Any application you write for it has to have it's limitations in mind.

johngalt · on Oct 23, 2012

If screw drivers are cheaper than hammers, we can just tie a bunch of screw drivers together and use them to pound in nails.

Considering the seek time of the robot arm. Striping data across four tapes just means you need to wait for the robot to grab four tapes.

mistercow · on Oct 23, 2012

I think brudgers was suggesting something akin to RAID 1 (mirroring, but no striping), which would, for sufficiently random access patterns, improve read performance roughly linearly as you added tape drives. The streaming speed of modern tape drives is comparable to hard drives, so striping isn't that useful anyway (and one a single tape with tapes switched out, you'd lose the physical parallelism that makes striping useful). But seek time is what really kills tape drives, and that would benefit from mirroring (like RAID 0), which is what brudgers was suggesting.

Although brudgers talked about writing at different offsets, that's actually pointless if you don't rewind each tape between accesses (and why would you?). Assuming random access patterns, your read speed on the array will, on average, increase linearly with the number of tapes. This will come at the cost of write latency, the magnitude of the cost depending on how you implement the array.

If the array has independent drives (which would be very expensive, but let's assume for the moment), then your write latency is always dependent on the longest seek (you have to wait for the write to occur on every tape). That means that as the number of tapes increases, your average writing latency will approach the worst case for a single tape. In other words, with an infinite number of full drives, you would always have to wait for one of the drives to seek across an entire tape. That's pretty bad news, but it's acceptable for some cases.

On the other hand, if you had one drive and a tape library, your write times would really be completely unacceptable. Your linear read seek speed increase would come at the cost of a linear write seek time increase, plus a linear increase in writing throughput time.

SnorkelTan · on Oct 23, 2012

The important thing to determine that isn't specified (but implied) in the article is whether or not random access is a hard requirement. If yes then it would probably be cheaper to go with a disk-based solution. From the sounds of the article the radio telescope array they mention faces largely archival problems and doesn't necessarily need random access to the data.

mistercow · on Oct 23, 2012

In other words, a tape drive RAID. But since you can use related techniques to improve performance with hard drives, I don't think that will make up much ground for tape drives.

brudgers · on Oct 23, 2012

There are two separate goals: improving the speed with which data may be accessed on a tape and improving that speed to the point where it will compete with hard disk drives - or to drive home how important it is to keep these goals distinct, solid state drives.

I'm not saying that future tape technologies will be as fast as the hard drives of the equivalent future - linear access has inherent limitations. However, there are relatively simple ways in which data access could be improved and such improvements (and some tuning) could make tape adequate for some applications.

bduerst · on Oct 23, 2012

Yes, but the point was to get read time with tapes up to that of HDDs.

mistercow · on Oct 23, 2012

Which a RAID style system probably won't achieve. HDDs start ahead of tape drives in performance. So it's reasonable to assume that HDD + RAID is also going to perform better than tape drive + RAID.

bduerst · on Oct 23, 2012

They weren't comparing RAID tapes against RAID HDDs, just RAID Tapes against normal HDD

jdbernard · on Oct 23, 2012

No, the discussion was whether they would replace HDDs. From the initial comment:

> It's fascinating that tape drives are making a comeback, but to suggest that they can replace HDDs is just a fantasy

Tape + RAID vs. HDD with no optimizations is a useless comparison. Why would anyone even care about that?

bduerst · on Oct 23, 2012

Because brudgers talking about speed, specifically?

mistercow · on Oct 24, 2012

But it still doesn't make sense to compare HDDs without optimization to tape drives with optimization. Data centers are going to use every affordable optimization at their disposal, so if we want to know "will tape drives be competitive against HDDs" we can't handicap HDDs by comparing mirrored arrays of tape drives to straight, un-optimized HDDs.

With that said, I outlined in another comment[1] why a mirrored array of tapes would have pretty huge drawbacks regardless. Even if you have an expensive array of tape drives (as opposed to one drive with many tapes), a mirrored array will drag your write seek performance toward a constant worst case as the array grows. The same is true of RAID 1 for HDDs, of course, but the worst case seek time of an HDD is orders of magnitude less than that of a tape drive, so HDDs still win.

In any case, if you look at the details of the situation, the answer you come to is pretty boring: tape drives can compete with HDDs in a small slice of real world cases (often when used in conjunction with HDDs), and higher density tape drives will slightly widen that slice.

If you have a situation where you are writing a ton of data, but almost never reading or erasing, then high density tapes will play well to that situation (sans mirroring).

If you have a situation where you are writing a relatively small amount of data over a long period of time, reading it back comparatively frequently, and never erasing anything so that your total storage needs become very large, then maybe a mirrored array of tape drives would make sense, if it were behind a sizable HDD array. But it's a stretch.

[1]http://news.ycombinator.com/item?id=4690877

brudgers · on Oct 23, 2012

I'm not sure whose point that was. As is typical, the point is to get tape performance to the point where it is an alternative to HDD. As the article mentions, power consumption may be a viable consideration and at some point it may offset an idealized version of access speed.

When the data set is so large that maintaining it all online at once is impractical and the nature of the data makes continuous and immediate access unnecessary, then tape may solve the problem.

stephengillie · on Oct 23, 2012

Researchers at Fuji Film in Japan and IBM in Zurich, Switzerland, have already built prototypes that can store 35 terabytes of data - or about 35 million books' worth of information - on a cartridge that measures just 10cm x 10cm x 2cm. This is achieved using magnetic tape coated in particles of barium ferrite.

Where does the gain in storage density come from, New Scientist? This article is so breathless to capture the linkbait of using old technology that just about any useful info has been left out. The prototype, at about 4" x 4" x 0.75", is just larger than a desktop HDD, and can hold about 8.5 times as much data as a 4TB 3.5" HDD.

Does barium ferrite allow for tighter magnetic fields, which would allow for a higher data density? If so, is that material being used in spinning disk HDDs? And how does this extremely physically complex system stack up against SSDs, as they lower in cost?

johngalt · on Oct 23, 2012

Tape has always been denser geometrically than HDDs. More surface area on a reel vs a platter.

SSDs would be the opposite of tape. Big$/GB low latency vs. Low$/GB high latency. Obviously SSDs would replace almost all other media if it could beat tape on $/GB.

stephengillie · on Oct 23, 2012

When you consider the $/GB, do you consider the cost of operating these as mechanical devices - mechanical wear and fatigue, increased electricity usage, administrative costs (tape storage, space for robot to work, robot repairs, etc), and possibility of damage (dropped tapes)?

johngalt · on Oct 23, 2012

Yes of course those are considered and the results may surprise you. One of the beautiful things about tapes is that you can store them unpowered. All you power is the robot(s) and the drive(s). 100 tapes sitting idle in a silo will last much longer than 100 hard drives spinning at 7200rpm. SSD's failure rates are even higher than HDDs.

ibmzrl · on Oct 24, 2012

Here is more detail on the demonstration with Fuji. http://www.zurich.ibm.com/news/10/storage.html

And here is a scientific paper on Scaling tape- recording areal densities to 100 Gb/in2 http://signallake.com/innovation/argumedo.pdf

arb99 · on Oct 23, 2012

"Current projections by the trade body Information Storage Industry Consortium show that although hard drives will be able to store 3 terabytes a piece in a decade's time, that still amounts to at least 120,000 drives a year."

huh?

aristidb · on Oct 23, 2012

I think that's for each individual platter, of which a modern 3 TB hard drive has typically 3 or 4.

arb99 · on Oct 23, 2012

think its a typo tbh

hollerith · on Oct 23, 2012

"3 terabytes" should be "3 petabytes".

ADDED. Withdrawn: see child comment.

schiffern · on Oct 23, 2012

That would be 120 drives/year, not 120,000.

hollerith · on Oct 23, 2012

Good point. I withdraw grandparent.

bitwize · on Oct 23, 2012

True, but cassette tapes are also the past and present of big data storage. If you have gobs of data you need to stash somewhere, tape is still massively more cost-effective than disk, and this is no secret.

zupreme · on Oct 23, 2012

I shudder to think of all the guys who will use this article as ammunition the next time their wives ask them to toss out the old cassette collection once and for all.

Seriously, though, I think that this is great news. If this really gains any traction this could represent a resurgence for some Japanese companies which hold patents on cassette tape technology. After the recent Fukushima debacle, they could certainly use it.

jlarocco · on Oct 23, 2012

Off on a tangent, but does anybody know why they store the data from the radio telescope array for such a long time?

Wouldn't it be more efficient to analyze it, save the interesting data, and get rid of the rest?

schiffern · on Oct 23, 2012

Lots of discoveries in astronomy are now made by re-analyzing data from previous observations.

The data storage costs are a pittance compared to the cost of telescope time. By saving the data you're getting the most out of each observation.

bithive123 · on Oct 23, 2012

I think you answered your own question. In many cases finding out which data is interesting _is_ the analysis.

acdha · on Oct 23, 2012

Science is all about reproducibility: if you discard the original data and someone raises a question about your processing techniques, you face the prospect of having to completely retract everything which depends on that data. Entire careers are potentially at risk and the failure mode might not even be “spend a year or two repeating an experiment” (i.e. falling behind everyone else in your field) but “hope desperately that some rare event occurs again”.

There's also a long history of data being mined later for things which the original experimenters didn't even think to collect - e.g. http://www.nasa.gov/mission_pages/hubble/science/elusive-pla... - or types of analysis which was considered too expensive to perform in the past (e.g. biology went from a pre- to post- informatics era when the unit of work went from “single grad student” to “many-thousand node cluster”). Given how expensive the data was to collect, it almost always makes sense to see if it will lead to any other value.

andrewcooke · on Oct 23, 2012

am not sure re-analysis is that important myself, but it looks like the SKA will be used in a pretty traditional way, with groups of astronomers applying to do particular observations. that means that the processing of data will likely be different for each group, and will probably take place over time (some poor grad student will likely spend most of their thesis - several years - on it).

the other approach - taken with something like the SDSS - is to make a telescope for just one task (typically a survey). then you can process the data and throw it away. i am pretty sure that is what SDSS did, and what, say, the LSST will probably do too (if it ever gets finished).

the advantage of the first approach is that you are much more flexible, which means more likelihood of making a big discovery (particularly when, as with this telescope, collecting area is larger than ever before, which means that you can see fainter and further back in time, making it ideal to study isolated, unusual objects - in contrast, survey telescopes make different technical compromises so that they cover a wider field of view), and also more chances to make and exploit upgrades over time. the downside is dealing with issues like this (another issue is data transport - traditionally you go to a telescope and then take the data home with you; i imagine we're now getting to the point where instead you will process data local to the data storage).

disclaimer: i was an optical astronomer, not radio, and that was years ago, so this may already be old news / incorrect in details. but the general idea should be ok.

http://en.wikipedia.org/wiki/Square_Kilometre_Array http://www.sdss.org/ http://www.lsst.org/lsst/

ps often telescopes do make data public after a certain time. but the idea is not so much to allow reanalysis as to make sure the people who originally took the data reduce and publish it. it's easy to postpone that kind of work, but the idea that someone else might do it and publish first is quite a motivator.

schiffern · on Oct 26, 2012

SDSS raw data is available online: http://www.sdss.org/dr7/algorithms/dataProcessing.html

Evbn · on Oct 23, 2012

Hard drives use more power because they are usually turned on for use, while tape drives use less because they are usually turned off and not used?

Thanks for the insight, New Scientist.

The most interesting item in that story was the claim (probably incorrect, based on the article's overall sloppiness) that LTFS will be "comparable" in speed to hard drive access.