Saturday, November 26, 2005

An anonymous reader asks: "I've been asked to build a massive storage solution to scale from an initial threshold of 25TB to 1PB, primarily on commodity hardware and software. Based on my past experience and research, the commercial offerings for such a solution becomes cost prohibitive, and the budget for the solution is fairly small. Some the technologies that I've been scoping out are iSCSI, AoE and plain clustered/grid computers with JBOD (just a bunch of disks). Personally I'm more inclined on a grid cluster with 1GB interface where each node will have about 1-2TB of disk space and each node is based on a 'low' power consumption architecture. Next issue to tackle is finding a file system that could span across all the nodes and yet appear as a single volume to the application servers. At this point data redundancy is not a priority, however it will have to be addressed. My research has not yielded any viable open source alternative (unless Google releases GoogleFS) and I've researched into Lustre, xFS and PVFS. There some interesting commercial products such as the File Director from NeoPath Networks and a few others; however the cost is astronomical.I would like to know if any Slashdot readers have any experience in build out such a solution? Any help/idea(s) would be greatly appreciated!" Building a Massive Single Volume Storage Solution? Log in/Create an Account | Top | 450 comments (Spill at 50!) | Index Only | Search Discussion Display Options Threshold: -1: 450 comments 0: 440 comments 1: 336 comments 2: 214 comments 3: 70 comments 4: 37 comments 5: 21 comments Flat Nested No Comments Threaded Oldest First Newest First Highest Scores First Oldest First (Ignore Threads) Newest First (Ignore Threads) The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way. (1) | 2 gmail (Score:4, Funny) by Adult film producer (866485) <van@i2pmail.org> on Tuesday October 25, @03:24PM (#13874188) register a few thousand gmail accounts and write the interface that will make writing of data to gmail inboxes invisible to the app. [ Reply to ThisEr... be careful by LeonGeeste (Score:2) Tuesday October 25, @03:43PMRe:Er... be careful by CommanderC (Score:1) Tuesday October 25, @04:05PM1 reply beneath your current threshold.Re:gmail by Anonymous Coward (Score:2) Tuesday October 25, @03:43PM Re:gmail (Score:5, Funny) by Stuart Gibson (544632) on Tuesday October 25, @03:44PM (#13874433) (http://www.abovetheinternet.co.uk/) That would have been my second answer.The first, and presumably the reason this was posted to /. is simple...Imagine a Beowolf cluster...Stuart [ Reply to This | ParentBeen there done that by CommanderC (Score:2) Tuesday October 25, @03:52PMOr better yet use Slashdot by pebs (Score:1) Tuesday October 25, @03:58PM2 replies beneath your current threshold.The SCIENTIFIC Answer by MightyMartian (Score:3) Tuesday October 25, @05:04PMRe:The SCIENTIFIC Answer by noamsml (Score:1) Tuesday October 25, @05:13PM1 reply beneath your current threshold.Re:gmail by JoshRosenbaum (Score:1) Tuesday October 25, @07:25PM1 reply beneath your current threshold. GFS? (Score:5, Informative) by fifirebel (137361) on Tuesday October 25, @03:24PM (#13874189) (http://www.fifi.org/~phil/) Have you checked out GFS [redhat.com] from RedHat (formerly Sistina)? [ Reply to ThisOracle, also by PCM2 (Score:2) Tuesday October 25, @03:28PMRe:Oracle, also by PCM2 (Score:2) Tuesday October 25, @03:32PMRe:Oracle, also by N1ck0 (Score:1) Tuesday October 25, @03:38PMRe:Oracle, also by Spudley (Score:3) Tuesday October 25, @05:52PMRe:Oracle, also by catprog (Score:1) Tuesday October 25, @06:28PM Re:Oracle, also (Score:4, Funny) by menkhaura (103150) on Tuesday October 25, @07:20PM (#13876636) (http://rootlinux.com.br/) what on Earth kind of data do they anticipate will take a petabyte of contiguous storage?I know. They don't know I know, but I do. It's data gathered by the black helicopters, by Echelon, by Carnivore, by our very own printers, by RFID, about every movement of every single one of us... *They* do it. They. [ Reply to This | ParentRe:Oracle, also by bezgin (Score:1) Tuesday October 25, @08:06PMRe:Oracle, also by pboulang (Score:2) Tuesday October 25, @08:14PMRe:Oracle, also by rebelcan (Score:1) Tuesday October 25, @05:37PMRe:Oracle, also by jericho4.0 (Score:2) Wednesday October 26, @12:53AM1 reply beneath your current threshold.Re:Oracle, also by Catbeller (Score:3) Tuesday October 25, @06:44PM1 reply beneath your current threshold.Re:GFS? by N1ck0 (Score:3) Tuesday October 25, @03:32PMRe:GFS? by LnxAddct (Score:3) Tuesday October 25, @04:33PMRe:GFS? by ralf1 (Score:1) Tuesday October 25, @04:36PMRe:GFS? by InsaneGeek (Score:2) Tuesday October 25, @06:20PMControllers! by man_of_mr_e (Score:3) Tuesday October 25, @05:06PMRe:Controllers! by sr180 (Score:2) Tuesday October 25, @09:14PMGet a demo of the Compellent Storage Center by roj3 (Score:2) Tuesday October 25, @05:44PMInternet Backplane Protocol [IBP] by mosel-saar-ruwer (Score:2) Tuesday October 25, @07:19PMRe:GFS? by mathrock (Score:1) Wednesday October 26, @12:05AMRed Hat GFS != Google FS by Anonymous Coward (Score:2) Tuesday October 25, @04:11PM2 replies beneath your current threshold.Apple Xserve? by mozumder (Score:3) Tuesday October 25, @03:24PM Re:Apple Xserve? (Score:4, Informative) by Jeff DeMaagd (2015) on Tuesday October 25, @03:29PM (#13874248) (http://www.demaagd.com/ | Last Journal: Sunday October 27, @07:53PM) Apple Xserve may be the cheapest of that kind of storage, but it's probably not fitting the original idea of commodity hardware.Scaling to petabytes means spanning storage across multiple systems. [ Reply to This | Parent Re:Apple Xserve? (Score:4, Informative) by stang7423 (601640) on Tuesday October 25, @03:55PM (#13874541) Apple has a solution for this. Xsan [apple.com] is a distrubuted filesystem that is based on the ADIC's StoreNext filesystem. Apple states on that page that it will scale into the range of petabytes. [ Reply to This | ParentRe:Apple Xserve? by SWroclawski (Score:2) Tuesday October 25, @04:13PMRe:Apple Xserve? by Anonymous Coward (Score:1) Tuesday October 25, @06:48PM1 reply beneath your current threshold.1 reply beneath your current threshold.Re:Apple Xserve? by rhaig (Score:2) Wednesday October 26, @12:48AM1 reply beneath your current threshold. Re:Apple Xserve? (Score:5, Interesting) by medazinol (540033) on Tuesday October 25, @03:30PM (#13874269) My first thought as well. However, he is asking for a single volume solution. So XSAN from Apple would have to be implemented. Good thing that it's compatible with ADIC's solution for cross-platform support.Probably would be the least expensive option overall and the simplest to implement. Don't take my word for it, go look for yourself. [ Reply to This | ParentXsan has volume size limits by Rhys (Score:3) Tuesday October 25, @04:59PMRe:Xsan has volume size limits by blofeld42 (Score:1) Tuesday October 25, @07:32PM1 reply beneath your current threshold.Re:Apple Xserve? by 76chyquem (Score:1) Tuesday October 25, @06:02PMRe:Apple Xserve? by nanophilia (Score:1) Tuesday October 25, @07:06PM1 reply beneath your current threshold. Re:Apple Xserve? (Score:4, Informative) by TRRosen (720617) on Tuesday October 25, @04:19PM (#13874845) To do this would cost around $50,000 with xRaids and xSan...$2000/TB is probably the best price your going to get. You could do this with generic hardware but the cost of assembling, the extra room, extra power consumption and the maintaince and enginnering costs will cetainly wipe out what you might save. The xRaid solution could be up in a day and fit in one (actually 1/2) rack.I do remember some college buiding a nearline backup storage system using 1U servers with 2 or 3raid cards each connected to like 12 drives per machine in homemade brackets but it was hardly ideal. But It did work. Anybody remember where that was? [ Reply to This | Parent How about a PetaBox? (Score:5, Interesting) by McSpew (316871) on Tuesday October 25, @04:44PM (#13875101) The folks at the Internet Archive [archive.org] have already done the hard work of figuring out how to create a petabyte storage system [archive.org] using commodity hardware. The system works so well they started a company to sell PetaBoxes [capricorn-tech.com] to others. Why reinvent the wheel? [ Reply to This | ParentRe:How about a PetaBox? by owlstead (Score:2) Tuesday October 25, @08:43PMRe:How about a PetaBox? by yppiz (Score:2) Tuesday October 25, @09:42PMRe:Apple Xserve? by labratuk (Score:2) Tuesday October 25, @09:00PM1 reply beneath your current threshold. Andrew FIle System (Score:5, Informative) by mroch (715318) * on Tuesday October 25, @03:25PM (#13874211) Check out AFS [openafs.org]. [ Reply to ThisRe:Andrew FIle System by Simon Lyngshede (Score:3) Tuesday October 25, @03:27PMRe:Andrew FIle System by miles31337 (Score:3) Tuesday October 25, @04:40PMRe:Andrew FIle System by finkployd (Score:2) Tuesday October 25, @05:23PMRe:Andrew FIle System by ashpool7 (Score:1) Tuesday October 25, @03:29PMRe:Andrew FIle System by finkployd (Score:2) Tuesday October 25, @05:19PM1 reply beneath your current threshold.Re:Andrew FIle System by JAZ (Score:2) Tuesday October 25, @03:37PMRe:Andrew FIle System by Anonymous Coward (Score:2) Tuesday October 25, @03:46PMRe:Andrew FIle System by Trepalium (Score:2) Tuesday October 25, @04:11PMRe:Andrew FIle System by kfhickel (Score:1) Tuesday October 25, @05:31PM2 replies beneath your current threshold. AFS Rocks- Now stop (Score:5, Insightful) by sirket (60694) on Tuesday October 25, @05:01PM (#13875307) Stop what you are doing right now. If your architecture requires you to have one huge volume then you have architected things wrong. Imagine trying to fsck this damned thing! What about file system corruption- What the hell are you going to do when you lose a Petabyte of data because of some file system corruption? Small, sensible, easily managed smaller partitions are the way to go. Use a database to organize where given files are stored. Do something that makes sense. I have a client now who just lost a bunch of data because they used a system like this.Having said all this- If you are still intent on finding a good file system then use AFS. It's probably your best free solution. If you want to sleep at night call EMC.-sirket [ Reply to This | ParentRe:AFS Rocks- Now stop by tweakt (Score:2) Tuesday October 25, @10:02PMRe:AFS Rocks- Now stop by rhaig (Score:2) Wednesday October 26, @12:53AM3 replies beneath your current threshold.Re:Andrew File System - I Like It, But Try SAM-FS by draggin_fly (Score:1) Tuesday October 25, @11:08PMRe:Andrew File System - I Like It, But Try SAM-FS by mathrock (Score:1) Tuesday October 25, @11:46PM PetaBox (Score:4, Informative) by Anonymous Coward on Tuesday October 25, @03:26PM (#13874221) Howabout the PetaBox [capricorn-tech.com], used by the Internet Archive [archive.org] ? [ Reply to This Re:PetaBox (Score:5, Funny) by sycodon (149926) on Tuesday October 25, @03:45PM (#13874454) Just don't call it PetaFile. [ Reply to This | Parent Re:PetaBox (Score:4, Informative) by MikeFM (12491) on Tuesday October 25, @03:52PM (#13874516) (http://kavlon.org/ | Last Journal: Friday March 21, @03:10PM) I priced one of those and decided I'd have to work my way up to that kind of toy. Instead I started with Buffalo's TeraStations [buffalotech.co.uk] which are affordable and have built-in RAID support. You can mount them in Linux and use LVM to span a single filesystem across several of them or just mount them normally depending on your needs. $1-$2 per GB for external, RAID, storage isn't bad at all. [ Reply to This | ParentRe:PetaBox by walrusx (Score:1) Wednesday October 26, @12:51AM1 reply beneath your current threshold.MogileFS from livejournal by mikeee (Score:3) Tuesday October 25, @03:27PM Go the Easy Route (Score:4, Funny) by Evil W1zard (832703) on Tuesday October 25, @03:27PM (#13874228) (http://www.f3dz.com/ | Last Journal: Thursday October 13, @11:30AM) I know a certain recent Zombie network that was discovered which collectively had quite a few Pbs of storage... Of course I wouldn't recommend going down that road as it leads to you know ... jail. [ Reply to ThisPetabox by treerex (Score:1) Tuesday October 25, @03:27PMPetabox.... by HotNeedleOfInquiry (Score:2) Tuesday October 25, @03:28PMRe:Petabox.... by treerex (Score:2) Tuesday October 25, @03:32PMRe:Petabox.... by timeOday (Score:3) Tuesday October 25, @03:55PMPetabox by russ_allegro (Score:2) Tuesday October 25, @03:28PM Re:Petabox (Score:5, Insightful) by afidel (530433) on Tuesday October 25, @03:56PM (#13874563) This guy is worried about budget, yet even with the "low power" usage of the petabox it would still use 50kW for one petabyte of storage! When you combine the cooling for that with the cost of electricity you are talking some serious money. If you have trouble getting the capital funds for something like this how are you ever going to pay the operating costs? [ Reply to This | ParentRe:Petabox by OrangeTide (Score:2) Tuesday October 25, @04:03PMRe:Petabox by holloway (Score:1) Tuesday October 25, @06:24PMRe:Petabox by buck_wild (Score:2) Tuesday October 25, @09:19PMRe:Petabox by holloway (Score:1) Tuesday October 25, @11:17PMRe:Petabox by twiddlingbits (Score:2) Tuesday October 25, @04:21PMRe:Petabox by Darby (Score:1) Tuesday October 25, @05:38PMRe:Petabox by rpresser (Score:3) Tuesday October 25, @04:40PMRe:Petabox by Databass (Score:3) Tuesday October 25, @06:22PMRe:Petabox by russ_allegro (Score:2) Tuesday October 25, @06:38PMRe:Petabox by digidave (Score:2) Tuesday October 25, @04:08PMRe:Petabox by TTK Ciar (Score:2) Tuesday October 25, @06:49PM GPFS from IBM (Score:5, Informative) by LuckyStarr (12445) on Tuesday October 25, @03:29PM (#13874246) May or may not be what you search. Quite expensive but impressive featurelist.http://www-03.ibm.com/servers/eserver/clusters/sof tware/gpfs.html [ibm.com] [ Reply to ThisRe:GPFS from IBM by Zombie (Score:3) Tuesday October 25, @05:20PMRe:GPFS from IBM by cartoon (Score:1) Tuesday October 25, @05:20PMRe:GPFS from IBM by cartoon (Score:1) Tuesday October 25, @05:24PMRe:GPFS from IBM by Obasan (Score:3) Tuesday October 25, @05:23PMRe:GPFS from IBM by icehawk55 (Score:1) Tuesday October 25, @08:34PM1 reply beneath your current threshold.Go Virtual by furry_wookie (Score:1) Tuesday October 25, @03:29PMRe:Go Virtual by krbvroc1 (Score:3) Tuesday October 25, @03:39PMRe:Go Virtual by gstoddart (Score:1) Tuesday October 25, @04:06PMRe:Go Virtual by krbvroc1 (Score:2) Tuesday October 25, @06:29PMRe:Go Virtual by gstoddart (Score:2) Tuesday October 25, @09:26PMRe:Go Virtual by Kaenneth (Score:2) Tuesday October 25, @09:52PMRe:Go Virtual by krbvroc1 (Score:2) Tuesday October 25, @09:56PMRe:Go Virtual by gstoddart (Score:2) Tuesday October 25, @11:17PMWhy? by Anonymous Coward (Score:1) Tuesday October 25, @03:29PMRe:Why? by temojen (Score:2) Tuesday October 25, @03:50PMNetwork Block Device by drightler (Score:1) Tuesday October 25, @03:30PM1 reply beneath your current threshold.Google Releases OSS? by TheoMurpse (Score:2) Tuesday October 25, @03:31PMRe:Google Releases OSS? by Evangelion (Score:1) Tuesday October 25, @03:37PMRe:Google Releases OSS? by adrianbaugh (Score:2) Tuesday October 25, @03:59PMRe:Google Releases OSS? by i wanted another nam (Score:1) Tuesday October 25, @09:43PMRe:Google Releases OSS? by morcego (Score:1) Tuesday October 25, @04:12PMRe:Google Releases OSS? by ggvaidya (Score:2) Tuesday October 25, @03:39PMRe:Google Releases OSS? by LLuthor (Score:1) Tuesday October 25, @03:39PMRe:Google Releases OSS? by Bananatree3 (Score:1) Tuesday October 25, @03:46PM1 reply beneath your current threshold.Scale by LLuthor (Score:3) Tuesday October 25, @03:33PMRe:Scale by LLuthor (Score:1) Tuesday October 25, @03:35PMRe:Scale by Wesley Felter (Score:3) Tuesday October 25, @04:19PM1 reply beneath your current threshold.Re:Scale by Karcaw (Score:1) Tuesday October 25, @04:47PMRe:Scale by LLuthor (Score:1) Tuesday October 25, @05:59PMRe:Scale by fatcatman (Score:2) Tuesday October 25, @07:20PM1 reply beneath your current threshold.Here's a couple to look at by Anonymous Coward (Score:2) Tuesday October 25, @03:33PM Wow (Score:5, Funny) by DingerX (847589) on Tuesday October 25, @03:33PM (#13874302) (Last Journal: Saturday September 10, @08:26AM) I never thought I'd see the day when sites were boasting a petabyte of porn.That's over 3 million hours of .avis -- if you sat down and watched them end-to-end, you'd have 348 years of "backdoor sliders", "dribblers to short", "pop flies", and "long balls". We live in an enlightened age. [ Reply to ThisRe:Wow by Surt (Score:2) Tuesday October 25, @03:44PM1 reply beneath your current threshold. Re:Wow (Score:5, Funny) by spuke4000 (587845) on Tuesday October 25, @03:57PM (#13874569) I'm not really sure I need 348 years of porn. I usually find porn really interesting for the first 3 minutes or so, then for some reason it's not so interesting anymore. But maybe that's just me. [ Reply to This | Parent Re:Wow (Score:4, Funny) by rco3 (198978) on Tuesday October 25, @04:13PM (#13874774) (http://www.2005dauphin.org/) Three minutes? You wish!Come to think of it, so do I. [ Reply to This | Parent1 reply beneath your current threshold.Obligatory Bash Quote. by e.loser (Score:1) Tuesday October 25, @04:45PM1 reply beneath your current threshold.Re:Wow by bchernicoff (Score:1) Tuesday October 25, @05:56PMRe:Wow by natefanaro (Score:1) Tuesday October 25, @06:06PMRe:Wow by buck_wild (Score:2) Tuesday October 25, @09:24PMRe:Wow by SillySnake (Score:1) Tuesday October 25, @08:32PMRe:Wow by jred (Score:2) Tuesday October 25, @09:17PM2 replies beneath your current threshold.Re:Wow by smithmc (Score:2) Tuesday October 25, @04:26PMRe:Wow by shut_up_man (Score:2) Tuesday October 25, @05:05PM5 replies beneath your current threshold. Data redundancy REQUIRED (Score:5, Informative) by cheesedog (603990) on Tuesday October 25, @03:34PM (#13874314) One thing to think about when building such a system from a large number of hard disks is that disks will fail, all the time. The argument is fairly convincing:Suppose each disk has a MTBF (mean time before failure) of 500,000 hours. That means that the average disk is expected to have a failure about every 57 years. Sounds good, right? Now, suppose you have 1000 disks. How long before the first one fails? Chances, are, not 57 years. If you assume that the failures are spread out evenly across time, a 1000-disk system will have a failure every 500 hours, or about every 3 weeks!Now, of course the failures won't be spread out evenly, which makes this even trickier. A lot of your disks will be dead on arrival, or fail within the first few hundred hours. A lot will go for a long time without failure. The failure rates, in fact, will likely be fractal -- you'll have long periods without failures, or with few failures, and then a bunch of failures will occur in a short period of time, seemingly all at once.You absolutely must plan on using some redundancy or erasure coding to store data on such a system. Some of the filesystems you mentioned do this. This allows the system to keep working under X number of failures. Redundancy/coding allows you to plan on scheduled maintanence, where you simply go in and swap out drives that have gone bad after the fact, rather than running around like a chicken with its head cut off every time a drive goes belly up. [ Reply to This Re:Data redundancy REQUIRED (Score:4, Insightful) by OrangeSpyderMan (589635) on Tuesday October 25, @03:42PM (#13874407) Agreed. We have around 50 TByte of data in one of our datacenters and it's great, but the number of disks that fail when you have to restart the systems (SAN fabric firmware install ) is just scary. Even on the system disks of the Wintel servers (around 400) which are DAS, around 10% fail on Datacenter powerdowns. That's where you pray that statistics are kind and you have no more failures on any one box than you have hot spares+tolerance :-) Last time one server didn't make it back up because of this.... though it was actually strictly speaking the PSUs that let go, it would appear. [ Reply to This | ParentRe:Data redundancy REQUIRED by CuteVlogger (Score:1) Tuesday October 25, @04:03PMRe:Data redundancy REQUIRED by Feyr (Score:2) Tuesday October 25, @04:35PMRe:Data redundancy REQUIRED by OrangeSpyderMan (Score:1) Tuesday October 25, @04:39PM1 reply beneath your current threshold.Re:Data redundancy REQUIRED by Alef (Score:3) Tuesday October 25, @04:09PMRe:Data redundancy REQUIRED by Retric (Score:2) Tuesday October 25, @04:38PMRe:Data redundancy REQUIRED by photon317 (Score:2) Tuesday October 25, @05:34PMRe:Data redundancy REQUIRED by cheesedog (Score:2) Tuesday October 25, @05:44PMRe:Data redundancy REQUIRED by photon317 (Score:2) Tuesday October 25, @06:03PMRe:Data redundancy REQUIRED by cheesedog (Score:2) Tuesday October 25, @06:07PMRe:Data redundancy REQUIRED by Sparohok (Score:1) Tuesday October 25, @04:13PMRe:Data redundancy REQUIRED by cheesedog (Score:3) Tuesday October 25, @05:56PMGood point, bad data by fm6 (Score:3) Tuesday October 25, @04:34PMRe:Good point, bad data by forand (Score:2) Tuesday October 25, @04:46PMRe:Good point, bad data by Syberghost (Score:2) Tuesday October 25, @05:15PMRe:Good point, bad data by Intron (Score:2) Tuesday October 25, @05:30PMRe:Good point, bad data by ohmypolarbear (Score:1) Wednesday October 26, @12:20AM That's not MTBF, this is.. (Score:5, Informative) by beldraen (94534) <beldraen_sd@beldraensdomain. c o m . com> on Tuesday October 25, @05:54PM (#13875942) Just a comment about MTBF. It's often not understood, and it is one of my little pet peaves with tech producers because they don't try to correct it. MTBF is a rating for reliability to achieve lasting the warrenty period.You have a drive that is rated 500,000 hours MTBF. Suppose you bought a drive and let it run at rated duty. Driver are normally rated to run 100% of the time, but many other devices will have duty period. Further, you run the drive until its warrenty is up. You then throw this perfectly working drive out the window and replace it. If you keep the up this pattern, then approximately once per 500,000 hours on average you should have a drive fail before the warrenty period is up. This is why it is important to not only look at the MTBF but also its warrenty period.As a side note: In theory, you should be throwing drives out on a periodic basic. One way around this is to not buy all the same drive type and manufacturer. By having a pool of drive types, you distribute, thus minimize, risk of drive failures. Additionally, you may want to have a standard period of time for drive replacement so as to shedule your down time, as opposed to it all being unexpected. [ Reply to This | ParentRe:That's not MTBF, this is.. by cheesedog (Score:2) Tuesday October 25, @06:13PMRe:That's not MTBF, this is.. by Blackforge (Score:1) Tuesday October 25, @09:02PMRe:That's not MTBF, this is.. by beldraen (Score:2) Tuesday October 25, @11:19PMWe've come full circle. . . by lisnter (Score:1) Tuesday October 25, @06:08PMRe:Data redundancy REQUIRED by GoodOmens (Score:1) Tuesday October 25, @07:24PM1 reply beneath your current threshold.Oooo... by temojen (Score:2) Tuesday October 25, @03:34PM I just have to ask... (Score:5, Informative) by jcdick1 (254644)

0 Comments:

Post a Comment

<< Home