Massive Data Storage (Tutorials)
Monday, May 6th, 2013
Renaissance Computing Institute (Bio
Abstract: In this tutorial, we discuss how to implement end-to-end reliability and resilience into very large storage systems. We start with the basics, including how applications interface to the operating system and local and global namespace semantics and implementation. Then we cover low-level hardware issues, including storage interfaces like SAS and SATA, RAID and parity checking, checksums and ECC, comparing and contrasting different approaches for achieving resilience in current systems. Then we show specific examples from today's large storage systems where low-level bits can change in a way that remains undetected in the system. Finally, we discuss current efforts in standards like T10 for new interface design and new approaches to system design to address these issues at scale.
3:30—5:30 Using Solr and Cassandra to Implement Big Data Analytics on the Web (Presentation)
Abstract: In this talk, we discuss the latest releases of Solr and Cassandra and how they can be used to build fast, web-enabled search and analytics software.
Massive Data Storage
Tuesday, May 7th, 2013
The 1000x Rule: How to Design for Scalability at Internet Scale (Presentation)
Fellow, Storage Architecture, Shutterfly
Abstract: Shutterfly has built a nearly 100-petabyte digital photo archive, and plans to scale it aggressively in the future. In his keynote, Justin will outline the techniques, based on his experiences at Shutterfly, Facebook, eBay and PayPal, that he and his team used to accomplish this, and he will describe the general rules for building scalable web storage systems.
Chair: Matthew O'Keefe,
University of Minnesota
Abstract: In July of last year, Intel acquired Whamcloud, which was the company created by the community to preserve Lustre. In the past three years, the community has come together to ensure Lustre survives and today there is a vibrant ecosystem of developers, maintainers and vendors offering it. As a result, there has been a regular cadence of feature and maintenance releases and building momentum of users of the technology. We will discuss the current state and future direction of this work.
Los Alamos National Laboratory (Bio
Abstract: The US DOE Office of Science and National Nuclear Security Administration Exascale activities leading to Exascale class computing in the next decade involves a number of initiatives, including the Fast Forward industry technology concepts funding activity. The current Exascale activities and coordination mechanisms for those activities will be explained, including the Fast Forward initiative. The Storage and IO Fast Forward project will also be described, including schedules, project management, and technical aspects.
Optical Media Technical Roadmap: The Revival of Optical Storage (Presentation)
Abstract: Optical storage has been seeing a resurgence in many industry verticals for it's unique preservation and environmental qualities. Recent developments have increased capacities and functionality while maintaining decades of backwards compatibility. This is due to the wide range of industries and markets that support this medium.
Optical Library System with Extended Error-Correction Coding for Long-Term Preservation (Presentation)
Abstract: Hitachi has developed an archive system with long-term preservation capability, storing data on optical disks. Extended parity mounting technology improves durability against scratches while maintaining compatibility with optical disk specifications.
Achieving 1000-year Data Persistence: "Engraved in Stone" (Presentation)
Abstract: Proper choices in materials coupled with the flexibility of Optical Data Storage hardware enables the implementation of truly persistent digital data on a DVD or Blu-ray disc. Recently completed accelerated lifetime studies conducted in accordance with the ISO 10995 test standard demonstrate that a lifetime on the order of 1,000 years is achievable in a mass-market-priced product.
House of Moves
Issues in Large-Scale Storage and Computing Systems for Film Production
CTO, Walt Disney Animation Studios (Bio
Data Architect, House of Moves
CTO, Method Studios (Bio
Storage Networking Industry Association Long Term Retention Technical Working Group(Presentation)
Hitachi Data Systems
Wednesday, May 8th, 2013
Analytics Drives Big Data Drives Infrastructure - Confessions of Storage turned Analytics Geeks (Presentation)
Abstract:This talk is about how "form follows function" bu,t in iterative steps, on how the infrastructure of big data analytics has evolved, from early days preceding Hadoop and Map Reduce, and continues to evolve even today. With new and emerging data sources, data types and diverse analytics on them, there is different and growing needs on the processing, the storage of persistent data and its subsequent access. We believe data processing and the associated data and storage architectures to support today's and future analytics will become even more demanding, and simple Hadoop processing with local storage will not suffice. We will illustrate our learning from our own experiences with two analytics services that evolved, from batch mode analytics processing in 2008, to today's hybrid of both real-time processing and batch mode querying on stored data used at Cruxly.
Advanced Tape Technologies for Future Archive Storage Systems (Presentation)
Building Blocks Required for Long Term Retention and Access to Enormous Quantities of Data (Presentation)
CTO, Spectra Logic
Abstract: This session is designed to give an introduction to the newest file management technologies that have emerged in 2013 for Active Archive. Anyone managing a mass storage infrastructure for HPC, Big Data, Cloud, research, etc., is painfully aware that the growth, access requirements and retention needs for data are relentless, and show no sign of letting up. This growth directly represents the increase in storage infrastructure for business, new file-oriented applications used in both enterprise and technical computing, and expanded server and desktop virtualization projects.
The result is a flood of data that needs to be readily available, on the appropriate media in an active state at all times, even though the bulk of that data is seldom accessed. And at the heart of that problem is the need to (1) rationalize the way that data is managed; and (2) create online access to all that data without maintaining it in a continuous, power-consuming state.
The solution lies in creating an active archive that enables straight-from-the-desktop access to files stored at any tier for rapid data access. Active archive software technologies allow existing file systems to expand over flash, disk and tape library storage technologies. Long term planning is also a key factor when considering an active archive approach. Storage hardware is by its nature short term, while data longevity is long term. A true active archive environment should contemplate and provide for a seamless upgrade to future technologies across any or all performance tiers.
The Economics of Tape, Disk, and Flash for Petabyte Storage (Presentation)
Abstract:Increases in annual petabyte (PB) shipments for storage class memories (SCM) are driven by both increases in areal density and increases in manufacturing capacity. Increases in areal density tend to reduce cost per bit while increases in manufacturing capacity are cost neutral or slightly increase cost per bit. This paper surveys the last five years of PB shipments, areal density, revenue, and cost per bit for magnetic tape (TAPE), hard disk drives (HDD), and NAND flash to study manufacturing and cost trends for storage class memories.
Chair: Sean Roberts,
Piston Cloud Computing
Chair: Matthew O'Keefe,
University of Minnesota
Abstract: Low power, high bandwidth main memory systems and storage will be a major architectural focus in the next three to five years. Chip stacking with through-silicon via's (TSV) opens a door of innovation not available to computer architects in the past 25 years. The NAND roadmap is providing new opportunities to SSD developers. This presentation will cover both volatile and non-volatile memory trends and roadmaps. The Hybrid Memory Cube and Micron’s PCIe SSD architecture will be introduced.
Hard Drives: Obstacles and Opportunities in the Next Three Years (Presentation)
Abstract: This talk will share data on recent changes in the areal density growth rate and describe techniques under development to enable resumption of higher rates in capacity growth in the future.
Abstract: The trends of technology are rocking the storage industry. Fundamental changes in basic technology, combined with massive scale, new paradigms, and fundamental economics leads to predictions of a new storage programming paradigm. The growth of low cost/GB disk is continuing with technologies such as Shingled Magnetic Recording. Flash and RAM are continuing to scale with roadmaps, some argue, down to atom scale. These technologies do not come without a cost. It is time to reevaluate the interface that we use to all kinds of storage, RAM, Flash and Disk. The discussion starts with the unique economics of storage (as compared to processing and networking), discusses technology changes, posits a set of open questions and ends with predictions of fundamental shifts across the entire storage hierarchy.
Dr. Ken Anderson,
CEO, Akonia Holographics
Massive Data Storage (Research Track)
(Presenter names are in bold font)
Thursday, May 9th, 2013
Chair: Dr. Theodore Wong,
The Impact of Areal Density and Millions of Square Inches (MSI) of Produced Memory on
Petabyte Shipments of TAPE, NAND Flash, and HDD Storage Class Memories (Presentation)
), Gary Decad, and Steven Hetzler,
Abstract: Increases in annual petabyte (PB) shipments for storage class memories (SCM) are driven by both increases in areal density and increases in manufacturing capacity. Increases in areal density tend to reduce cost per bit while increases in manufacturing capacity are cost neutral or slightly increase cost per bit. This paper surveys the last five years of PB shipments, areal density, revenue, and cost per bit for magnetic tape (TAPE), hard disk drives (HDD), and NAND flash to study manufacturing and cost trends for storage class memories. First, using the five year data for PB shipments and areal density values for TAPE, HDD and NAND flash, this paper applies a manufacturing measure used by semiconductors, millions of square inches or MSI of produced memory, to TAPE, HDD, and NAND flash in order to compare manufacturing requirements for these three SCM technologies. The MSI calculations shows for HDD and NAND, with slowing areal density increases, that manufacturing investments will be required for sustaining PB shipment growth while for TAPE modest investment in manufacturing capacity is required. The MSI calculations also show that the cost of NAND replacing HDD is prohibitive based simply on present day manufacturing capacity and show that for HDD to adopt processing requirements for patterned media, the next proposed areal density improvement for HDD, would require significant manufacturing investments. Second, using the five year data for PB shipments and revenue for TAPE, HDD, and NAND flash, trends in cost per bit for the SCM technologies can be determined and related to both technology innovations, i.e. lithography for NAND flash and predictable areal density increases for TAPE, and to external market factors, i.e. industry consolidation for HDD and mobile computing for NAND flash. Lastly, while 2012 PB shipments for TAPE, HDD, and NAND flash totaled 430,000 PB, dominated by HDD with 380,000 PB, perceived information creation in 2012 was over 1,300,000 PB, posing the question to SCM manufacturers as to how information is stored in today’s environment.
DROP: Facilitating Distributed Metadata Management in EB-scale Storage Systems (Presentation)
, Rajesh Vellore Arumugam, Khai Leong Yong and Sridhar Mahadevan,
Data Storage Institute, A*STAR
Abstract: Efficient and scalable distributed metadata management is critically important to overall system performance in large-scale distributed storage systems, especially in the EB era. Traditional state-of-the-art distributed metadata management schemes include hash-based mapping and subtree partitioning. The former evenly distributes workload among metadata servers, but it eliminates all hierarchical locality of metadata. It cannot efficiently handle some operations, e.g., renaming or moving a directory that requires metadata to be migrated among metadata servers. The latter does not uniformly distribute workload among metadata servers, and metadata need to be migrated to keep the load balanced roughly. In this paper, we present a ring-based metadata management scheme, called Dynamic Ring Online Partitioning (DROP). It can preserve metadata locality using locality-preserving hashing, as well as dynamically distribute metadata among metadata server cluster to keep load balancing. By conducting performance evaluation, experimental results demonstrate the effectiveness and scalability of DROP.
Zettabyte Reliability with Flexible End-to-end Data Integrity (Presentation)
, Daniel Myers, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,
University of Wisconsin - Madison
Abstract: We introduce flexible end-to-end data integrity for storage systems, which enables each component along the I/O path (e.g., memory, disk) to alter its protection scheme to meet the performance and reliability demands of the system. We apply this new concept to Zettabyte File System (ZFS) and build ZettabyteReliable ZFS (Z2FS). Z2FS provides dynamical tradeoffs between performance and protection and offers Zettabyte Reliability, which is one undetected corruption per Zettabyte of data read. We develop an analytical framework to evaluate reliability; the protection approaches in Z2FS are built upon the foundations of the framework. For comparison, we implement a straightforward End-to-End ZFS (E2ZFS) with the same protection scheme for all components. Through analysis and experiment, we show that Z2FS is able to achieve better overall performance than E2ZFS, while still offering Zettabyte Reliability.
Chair: Prof. Darrell Long,
University of California, Santa Cruz
Secure Logical Isolation for Multi-tenancy in Cloud Storage (Presentation}
Michael Factor, David Hadas, Aner Hamama, Nadav Har'El, Hillel Kolodner
, Anil Kurmus, Eran Rom,
Alexandra Shulman-Peleg, and Alessandro Sorniotti,
Abstract: Storage cloud systems achieve economies of scale by serving multiple tenants from a shared pool of servers and disks. This leads to the commingling of data from different tenants on the same devices. Typically, a request is processed by an application running with sufficient privileges to access any tenant’s data; this application authenticates the user and authorizes the request prior to carrying it out. Since the only protection is at the application level, a single vulnerability threatens the data of all tenants, and could lead to cross-tenant data leakage, making the cloud much less secure than dedicated physical resources. To provide security close to physical isolation while allowing complete resource pooling, we propose Secure Logical Isolation for Multi-tenancy (SLIM). SLIM incorporates the first complete security model and set of principles for the safe logical isolation between tenant resources in a cloud storage system, as well as a set of mechanisms for implementing the model. We show how to implement SLIM for OpenStack Swift and present initial performance results.
Hybrid Solid State Drives for Improved Performance and Enhanced Lifetime (Presentation)
, Eunjae Lee, and Donghee Lee,
University of Seoul
Sam H. Noh,
Abstract: As the market becomes more competitive, SSD manufacturers are moving from SLC (Single-Level Cell) to MLC (Multi-Level Cell) flash memory chips that store two bits per cell as building blocks for SSDs. Recently, TLC chips, which store three bits per cell, is being considered as a viable solution due to their low cost. However, performance and lifetime of TLC chips are considerably limited and thus, pure TLC-based SSDs may not be viable as a general storage device. In this paper, we propose a hybrid SSD solution, namely HySSD, where SLC and TLC chips are used together to form an SSD solution performing in par with SLC-based products. Based on an analytical model, we propose a near optimal data distribution scheme that distributes data among the SLC and TLC chips for a given workload such that performance or lifetime may be optimized. Experiments with two types of SSDs both based on DiskSim with SSD Extension show that the analytic model approach can dynamically adjust data distribution as workloads evolve to enhance performance or lifetime.
A Novel I/O Scheduler for SSD with Improved Performance and Lifetime (Presentation)
Hua Wang, Ping Huang
, Shuang He, Ke Zhou, and Chunhua Li,
HuaZhong University of Science and Technology
Virginia Commonwealth University
Abstract: This paper presents a novel block I/O scheduler specifically for SSDs. The scheduler leverages the internal rich parallelism resulting from SSD’s highly parallelized architecture. It speculatively divides the entire SSD space into different subregions and dispatches requests into those subregions in a roundrobin fashion at the Linux kernel block layer. In the meanwhile, to reduce the severe read-write interference problem associated with SSDs, the scheduler only dispatches a batch of unidirectional requests to the disk driver for each subregion’s scheduling opportunity. Furthermore, to take advantage of SSD’S better sequential performance over random patterns, the scheduler sorts the pending requests while they are awaiting in the dispatching queues as those HDD-oriented schedulers do. The experimental results with a variety of workloads have demonstrated that the new I/O scheduler not only improves the user-perceived performance, but also enhances the underlying SSD’s lifetime via reducing the block erase operations during the running processes.
Proactive Drive Failure Prediction for Large Scale Storage Systems (Presentation)
Bingpeng Zhu, Gang Wang, Xiaoguang Liu, and Jingwei Ma
Tianjin University of Technology
Abstract: Most of the modern hard disk drives support Self-Monitoring, Analysis and Reporting Technology (SMART), which can monitor internal attributes of individual drives and predict impending drive failures by a thresholding method. As the prediction performance of the thresholding algorithm is disappointing, some researchers explored various statistical and machine learning methods for predicting drive failures based on SMART attributes. However, the failure detection rates of these methods are only up to 50% ~ 60% with low false alarm rates (FARs). We explore the ability of Backpropagation (BP) neural network model to predict drive failures based on SMART attributes. We also develop an improved Support Vector Machine (SVM) model. A real-world dataset concerning 23,395 drives is used to verify these models. Experimental results show that the prediction accuracy of both models is far higher than previous works. Although the SVM model achieves the lowest FAR (0.03%), the BP neural network model is considerably better in failure detection rate which is up to 95% while keeping a reasonable low FAR.
CORE: Augmenting Regenerating-Coding-Based Recovery for Single and
Concurrent Failures in Distributed Storage Systems (Presentation)
Runhui Li, Jian Lin
, and Patrick P. C. Lee,
The Chinese University of Hong Kong
Abstract: Data availability is critical in distributed storage systems, especially when node failures are prevalent in real life. A key requirement is to minimize the amount of data transferred among nodes when recovering the lost or unavailable data of failed nodes. This paper explores recovery solutions based on regenerating codes, which are shown to provide fault-tolerant storage and minimum recovery bandwidth. Existing optimal regenerating codes are designed for single node failures. We build a system called CORE, which augments existing optimal regenerating codes to support a general number of failures including single and concurrent failures. We theoretically show that CORE achieves the minimum possible recovery bandwidth for most cases. We implement CORE and evaluate our prototype atop a Hadoop HDFS cluster testbed with up to 20 storage nodes. We demonstrate that our CORE prototype conforms to our theoretical findings and achieves recovery bandwidth saving when compared to the conventional recovery approach based on erasure codes.
TIGER:Thermal-Aware File Assignment in Storage Clusters (Presentation)
, Xunfei Jiang, and Xiao Qin,
Abstract: In this paper, we present thermal-aware file assignment technique called TIGER for reducing cooling cost of storage clusters in data centers. TIGER first calculates the thresholds of disks in each node based on its contribution to heat recirculation in a data center. Next, TIGER assigns files to data nodes according to calculated thresholds. We evaluated performance of TIGER in terms of both cooling energy conservation and response time of a storage cluster. Our results confirm that TIGER reduces cooling-power requirements for clusters by offering about 10 to 15 percent cooling energy savings without significantly degrading I/O performance.
Georgia Institute of Technology
Abstract: Smartphone applications are becoming more sophisticated and require high storage performance. Unfortunately, the OS storage software stack is not well engineered to support flash-based storage used in smartphones. On top of that, storage software stack is configured to be too conservative due to the fear of sudden power failures. We believe that this conservatism with respect to data reliability is misplaced considering that many of the popular apps (e.g., Web browsing, Facebook, Gmail) that run on today’s smartphones are cloud-backed, and the local storage on the smartphone is often used as a cache for cloud data.
In this paper, we propose Informed Storage Management framework, named Fjord, for mobile platforms. The key insight is to use system-wide dynamic context information to improve the storage performance on mobile platforms. We implement a set of mechanisms (write buffering, logging, and fine-grained reliability control), and through judicious use of these mechanisms based on system context, we show how we can achieve significant improvement in storage performance. As proof of concept, we implement Fjord in two Android smartphones and experimentally validate the performance advantage of informed storage management with multiple smartphone applications.
HCTrie: A Structure for Indexing Hundreds of Dimensions for Use in File Systems Search (Presentation)
University of California, Santa Cruz
Abstract: Data management in large-scale storage systems involves indexing and search for data objects (e.g., files). There are hundreds of types of metadata attributed to the data objects: examples include environmental settings of photograph files and simulation configurations for simulation output files. To provide intelligent file search that uses file metadata, we introduce a novel search structure called Hyper-Cube Trie (HCTrie), that can handle a few hundred dimensions of data attributes. HCTrie can utilize the differences in many dimensions effectively: candidates can be pruned based on differences in all dimensions. To the best of our knowledge, this is the first approach to restrain the memory growth to a linear scale against the number of dimensions, when multiple dimensions are indexed at the same time. Our prototype has successfully indexed five million data entries with one hundred attributes in a single data structure. We show that HCTrie can outperform MySQL in range search where ranges for less than 100 dimensions are specified in the search query.
Dynamic I/O Congestion Control in Scalable Lustre File System (Presentation)
Yingjin Qian and Ruihai Yi,
Satellite Marine Tracking & Control Department of China
, Nong Xiao, Shiyao Jin,
State key Laboratory of High Performance Computing, National University of Defense Technology
Abstract: This paper introduces a scalable I/O model of Lustre file system and propose a dynamic I/O congestion control mechanism to support the incoming exascale HPC systems. Under its control, clients are allowed to issue more concurrent I/O requests to servers, which optimizes the utilization of the network/server resources and improves the I/O throughput, when servers are under light load; on the other hand, it can throttle the clients’ I/O and limit the number of I/O requests queued on the server to control the I/O latency and avoid congestive collapse, when the server is overloaded. The results of series of experiments demonstrate the effectiveness of our congestion control mechanism. It prevents the occurrence of congestive collapse and on this premise it can maximize the I/O throughput for the scalable Lustre file system.
SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs (Presentation)
Sangwook Shane Hahn
, Sungjin Lee, and Jihong Kim,
Seoul National University
Abstract: We propose an efficient software-based out-of-order scheduling technique, called SOS, for high-performance NAND flash-based SSDs. Unlike an existing hardware-based out-of-order technique, our proposed software-based solution, SOS, can make more efficient out-of-order scheduling decisions by exploiting various mapping information and I/O access characteristics obtained from the flash translation layer (FTL) software. Furthermore, SOS can avoid unnecessary hardware-level operations and manage I/O request rearrangements more efficiently, thus maximizing the multiple-chip parallelism of SSDs. Experimental results on a prototype SSD show that SOS is effective in improving the overall SSD performance, lowering the average I/O response time by up to 42% over a hardware-based out-of-order flash controller.
NVMFS: A Hybrid File System for Improving Random Write in NAND-flash SSD (Presentation)
and A. L. Narasimha Reddy,
Texas A&M University
Abstract: In this paper, we design a storage system consisting of Nonvolatile DIMMs (as NVRAM) and NAND-flash SSD. We propose a file system NVMFS to exploit the unique characteristics of these devices which simplifies and speeds up file system operations. We use the higher performance NVRAM as both a cache and permanent space for data. Hot data can be permanently stored on NVRAM without writing back to SSD, while relatively cold data can be temporarily cached by NVRAM with another copy on SSD. We also reduce the erase overhead of SSD by reorganizing writes on NVRAM before flushing to SSD.
We have implemented a prototype NVMFS within a Linux Kernel and compared with several modern file systems such as ext3, btrfs and NILFS2. We also compared with another hybrid file system Conquest, which originally was designed for NVRAM and HDD. The experimental results show that NVMFS improves IO throughput by an average of 98.9% when segment cleaning is not active, while improves throughput by an average of 19.6% under high disk utilization (over 85%) compared to other file systems. We also show that our file system can reduce the erase operations and overheads at SSD.
Chair: Dr. Theodore Wong,
Proteus: A Flexible Simulation Tool for Estimating Data Loss Risks in Storage Arrays (Presentation)
Hsu-Wan Kao and Jehan-Francois Paris,
University of Houston
University of California, Santa Cruz
Universidad Católica del Uruguay
Abstract: Proteus is an open-source simulation program that can predict the risk of data loss in many disk array configurations, among which, mirrored disks, all levels of RAID arrays and various two-dimensional RAID arrays. It characterizes each array by five numbers, namely, the size nof the array, the number nf of simultaneous disk failures the array will always tolerate without data loss, and the respective fractions f1, f2 and f3 of simultaneous failures of nf + 1,nf + 2 and nf + 3 disks that will not result in a data loss. As with any simulation tool, Proteus imposes no restriction on the distributions of failure and repair events. Our measurements have shown a surprisingly good agreement with the results obtained through analytical techniques and no measurable difference between values obtained assuming deterministic repair times and those assuming exponential repair times.
and Kaladhar Voruganti,
Abstract: Designers of storage and file systems use I/O traces to emulate application workloads while designing new algorithms and for testing bug fixes. However, since traces are large, they are hard to store and moreover inflexible to manipulate. Thus, researchers have proposed techniques to create trace models in order to alleviate these concerns. However, the prior trace modeling approaches are limited with respect to 1) number of trace parameters they can model, and hence, the accuracy of the model and 2) with respect to manipulating the trace model in both temporal and spatial domains (that is, changing the burstiness of a workload, or scaling the size of the data supporting the workload). In this paper we present a new algorithm/tool called Paragone that addresses the above mentioned problems by fundamentally re-thinking how traces should be modeled and replayed.
A Deduplication Study for Host-side Caches in Virtualized Data Center Environments (Presentation)
and Jiri Schindler,
Abstract: Flash memory-based caches inside VM hypervisors can reduce I/O latencies and offload much of the I/O traffic from network-attached storage systems deployed in virtualized data centers. This paper explores the effectiveness of content deduplication in these large (typically 100s of GB) host-side caches. Previous deduplication studies focused on data mostly at rest in backup and archive applications. This study focuses on cached data and dynamic workloads within the shared VM infrastructure. We analyze I/O traces from six virtual desktop infrastructure (VDI) I/O storms and two longterm CIFS studies and show that deduplication can reduce the data footprint inside host-side caches by as much as 67%. This in turn allows for caching a larger portion of the data set and improves the effective cache hit rate. More importantly, such increased caching efficiency can alleviate load from networked storage systems during I/O storms when most VM instances perform the same operation such as virus scans, OS patch installs, and reboots.
On the Design and Implementation of a Simulator for Parallel File System Research (Presentation)
, Renato Figueiredo,
University of Florida
Yiqi Xu and Ming Zhao,
Florida International University
Abstract: Due to the popularity and importance of Parallel File Systems (PFSs) in modern High Performance Computing (HPC) centers, PFS designs and I/O optimizations are active research topics. However, the research process is often time consuming and faces cost and complexity challenges in deploying experiments in real HPC systems. This paper describes PFSsim, a trace-driven simulator of distributed storage systems that allows the evaluation of PFS designs, I/O schedulers, network structures, and workloads. PFSsim differentiates itself from related work in that it provides a powerful platform featuring a modular design with high flexibility in the modeling of subsystems including the network, clients, data servers and I/O schedulers. It does so by designing the simulator to capture abstractions found in common PFSs. PFSsim also exposes scriptbased interfaces for detailed configurations. Experiments and validation against real systems considering sub-modules and the entire simulator show that PFSsim is capable of simulating a representative PFS (PVFS2) and of modeling different I/O scheduler algorithms with good fidelity. In addition, the simulation speed is also shown to be acceptable.
Friday, May 10th, 2013
Chair: Prof. Dr. André Brinkmann,
Johannes Gutenberg-Universität Mainz
DRepl: Optimizing Access to Application Data for Analysis and Visualization (Presentation)
and Michael Lang,
Los Alamos National Laboratory
University of California, Santa Cruz
Abstract: Until recently most scientific applications produced data that is saved, analyzed and visualized at later time. In recent years, with the large increase in the amount of data and computational power available there is demand for applications to support data access in-situ, or close-to simulation to provide application steering, analytics and visualization. Data access patterns required for these activities are usually different than the data layout produced by the application. In most of the large HPC clusters scientific data is stored in parallel file systems instead of locally on the cluster nodes. To increase reliability, the data is replicated, using standard RAID schemes. Parallel file server nodes usually have more processing power than they need, so it is feasible to offload some of the data intensive processing to them. DRepl replaces the standard methods of data replication with replicas having different layouts, optimized for the most commonly used access patterns. Replicas can be complete (i.e. any other replica can be reconstructed from it), or incomplete. DRepl consists of a language to describe the dataset and the necessary data layouts and tools to create a user-space file server that provides and keeps the data consistent and up to date in all optimized layouts. DRepl decouples the data producers and consumers and the data layouts they use from the way the data is stored on the storage system. DRepl has shown up to 2x for cumulative performance when data is accessed using optimized replicas.
FSMAC: A File System Metadata Accelerator with Non-Volatile Memory (Presentation)
, Qingsong Wei, Cheng Chen, and Lingkun Wu,
Data Storage Institute, A*STAR
Abstract: File system performance is dominated by metadata access because it is small and popular. Metadata is stored as block in the file system. Partial metadata update results in whole block read and write which amplifies disk I/O. Huge performance gap between CPU and disk aggravates this problem.
In this paper, a file system metadata accelerator (referred as FSMAC) is proposed to optimize metadata access by efficiently exploiting the advantages of Nonvolatile Memory (NVM). FSMAC decouples data and metadata I/O path, putting data on disk and metadata on NVM at runtime. Thus, data is accessed in block from I/O bus and metadata is accessed in byte-addressable manner from memory bus. Metadata access is significantly accelerated and metadata I/O is eliminated because metadata in NVM is not flushed back to disk periodically anymore. A light-weight consistency mechanism combining fine-grained versioning and transaction is introduced in the FSMAC. The FSMAC is implemented on the basis of Linux Ext4 file system and intensively evaluated under different workloads. Evaluation results show that the FSMAC accelerates file system up to 49.2 times for synchronized I/O and 7.22 times for asynchronized I/O.
A Lightweight I/O Scheme to Facilitate Spatial and Temporal Queries of Scientific Data Analytics (Presentation)
, Zhuo Liu, Bin Wang, and Weikuan Yu,
Scott Klasky, Hasan Abbasi, and Norbert Podhorszki,
Oak Ridge National Laboratory
Northrop Grumman Corporation
NASA Goddard Space Flight Center
University of Tennessee, Knoxville
Abstract: In the era of petascale computing, more scientific applications are being deployed on leadership scale computing platforms to enhance the scientific productivity. Many I/O techniques have been designed to address the growing I/O bottleneck on large-scale systems by handling massive scientific data in a holistic manner. While such techniques have been leveraged in a wide range of applications, they have not been shown as adequate for many mission critical applications, particularly in data postprocessing stage. One of the examples is that some scientific applications generate datasets composed of a vast amount of small data elements that are organized along many spatial and temporal dimensions but require sophisticated data analytics on one or more dimensions. Including such dimensional knowledge into data organization can be beneficial to the efficiency of data post-processing, which is often missing from exiting I/O techniques. In this study, we propose a novel I/O scheme named STAR (Spatial and Temporal AggRegation) to enable high performance data queries for scientific analytics. STAR is able to dive into the massive data, identify the spatial and temporal relationships among data variables, and accordingly organize them into an optimized multi-dimensional data structure before storing to the storage. This technique not only facilitates the common access patterns of data analytics, but also further reduces the application turnaround time. In particular, STAR is able to enable efficient data queries along the time dimension, a practice common in scientific analytics but not yet supported by existing I/O techniques. In our case study with a critical climate modeling application GEOS-5, the experimental results on Jaguar supercomputer demonstrate an improvement up to 73 times for the read performance compared to the original I/O method.
Chair: Dr. Zvonimir Bandic,
, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau,
University of Wisconsin - Madison
Abstract: Flash-based devices are cost-competitive to traditional hard disks in both personal and industrial environments and offer the potential for large performance gains. However, as flash-based devices have a high bit-error rate and a relatively short lifetime, reliability issues remain a major problem. One possible solution is redundancy; using techniques such as mirroring, data reliability and availability can be greatly enhanced. All standard RAID approaches assume that devices do not wear out, and hence distribute work equally among them; unfortunately, for flash, this approach is not appropriate as the life of flash cell depends on the number of times it is written and cleaned. Hence, identical write patterns to mirrored flash drives introduce a failure dependency in the storage system, increasing the probability of concurrent device failure and hence data loss.
We propose Warped Mirrors as a solution to this endurance problem for mirrored flash devices. By carefully inducing a slight imbalance into write traffic across devices, we intentionally increase the workload of one device in the mirror pair, and thus increase the odds that it will fail first. Thus, with our approach, device failure independence is preserved. Our simulation results show that across both synthetic and traced workloads, little performance overhead is induced.
, Youjip Won, Joongwoo Hwang, Sooyong Kang, and Jaehyuk Cha,
Seoul National University
Abstract: In this paper, we present a virtual machine based SSD Simulator, VSSIM (Virtual SSD Simulator). VSSIM intends to address the issues of the trace driven simulation, e.g. trace re-scaling, accurate replay, etc. VSSIM operates on top of QEMU/KVM with software based SSD module. VSSIM runs in realtime and allows the user to measure both the host performance and the SSD behavior under various design choices. VSSIM can flexibly model the various hardware components, e.g. the number of channels, the number of ways, block size, page size, planes per chip, program, erase, read latency of NAND cells, channel switch delay, and way switch delay. VSSIM can also facilitate the implementation of the SSD firmware algorithms. To demonstrate the capability of VSSIM, we performed a number of case studies. The results of the simulation study deliver an important guideline in the firmware and hardware designs of future NAND based storage devices. Followings are some of the findings: (i) as the page size increases, the performance benefit of increasing the channel parallelism against increasing the way parallelism becomes less significant, (ii) due to the bi-modality in IO size distribution, FTL should be designed to handle multiple mapping granularity, (iii) hybrid mapping does not work in four or more way SSD due to severe log block fragmentation, (iv) as a performance metric, the Write Amplification Factor can be misleading, (v) compared to sequential write, random write operation can be benefited more from the channel level parallelism and therefore in multi-channel environment, it is beneficial to categorize larger fraction of IO as random. VSSIM is validated against commodity SSD, Intel X25M SSD. VSSIM models the sequential IO performance of X25M within 3% offset.
and Ethan L. Miller,
University of California, Santa Cruz
Samsung Semiconductor Inc.
Samsung Electronics Co.
Abstract: This paper explores the benefits and limitations of in-storage processing on current Solid-State Disk (SSD) architectures. While disk-based in-storage processing has not been widely adopted, due to the characteristics of hard disks, modern SSDs provide high performance on concurrent random writes, and have powerful processors, memory, and multiple I/O channels to flash memory, enabling in-storage processing with almost no hardware changes. In addition, offloading I/O tasks allows a host system to fully utilize devices’ internal parallelism without knowing the details of their hardware configurations.
To leverage the enhanced data processing capabilities of modern SSDs, we introduce the Smart SSD model, which pairs in-device processing with a powerful host system capable of handling data-oriented tasks without modifying operating system code. By isolating the data traffic within the device, this model promises low energy consumption, high parallelism, low host memory footprint and better performance. To demonstrate these capabilities, we constructed a prototype implementing this model on a real SATA-based SSD. Our system uses an object-based protocol for low-level communication with the host, and extends the Hadoop MapReduce framework to support a Smart SSD. Our experiments show that total energy consumption is reduced by 50% due to the low-power processing inside a Smart SSD. Moreover, a system with a Smart SSD can outperform host-side processing by a factor of two or three by efficiently utilizing internal parallelism when applications have light trafic to the device DRAM under the current architecture.
Chair: Prof. Ahmed Amer,
Santa Clara University
Cache, Cache Everywhere, Flushing All Hits Down The Sink: On Exclusivity in Multilevel, Hybrid Caches (Presentation)
, David C. van Moolenbroek, and Andrew S. Tanenbaum,
Abstract: Several multilevel storage systems have been designed over the past few years that utilize RAM and flash-based SSDs in concert to cache data resident in HDD-based primary storage. The low cost/GB and non-volatility of SSDs relative to RAM have encouraged storage system designers to adopt inclusivity (between RAM and SSD) in the caching hierarchy. However, in light of recent changes in hardware landscape, we believe that in the future, multilevel caches are invariably going to be hybrid caches where 1) all/most levels are physically collocated 2) the levels differ substantially only with respect to performance and not storage density, and 3) all levels are persistent. In this paper, we will investigate the design tradeoffs involved in building exclusive, persistent, direct-attached, multilevel storage caches. In doing so, we will first present a comparative evaluation of various techniques that have been proposed to achieve exclusivity in distributed storage caches in the context of a direct-attached, hybrid cache, and show the potential performance benefits of maintaining exclusivity. We will then investigate extensions to these demand-based, read-only data caching algorithms in order to address two issues specific to direct-attached hybrid caches, namely, handling writes and managing SSD lifetime.
and Assaf Schuster,
IBM Research - Haifa
Abstract: Large scale consolidation of distributed systems introduces data sharing between consumers which are not centrally managed, but may be physically adjacent. For example, shared global data sets can be jointly used by different services of the same organization, possibly running on different virtual machines in the same data center. Similarly, neighboring CDNs provide fast access to the same content from the Internet. Cooperative caching, in which data are fetched from a neighboring cache instead of from the disk or from the Internet, can significantly improve resource utilization and performance in such scenarios.
However, existing cooperative caching approaches fail to address the selfish nature of cache owners and their conflicting objectives. This calls for a new storage model that explicitly considers the cost of cooperation, and provides a framework for calculating the utility each owner derives from its cache and from cooperating with others. We define such a model, and construct four representative cooperation approaches to demonstrate how (and when) cooperative caching can be successfully employed in such large scale systems. We present principal guidelines for cooperative caching derived from our experimental analysis. We show that choosing the best cooperative approach can decrease the system’s I/O delay by as much as 87%, while imposing cooperation when unwarranted might increase it by as much as 92%.
Improving Flash-based Disk Cache with Lazy Adaptive Replacement (Presentation)
Sai Huang and Dan Feng,
Wuhan National Lab for Optoelectronics, Huazhong University of Science and Technology
Qingsong Wei, Jianxi Chen, and Cheng Chen,
Data Storage Institute, A*STAR
Abstract: The increasing popularity of flash memory has changed storage systems. Flash-based solid state drive(SSD) is now widely deployed as cache for magnetic hard disk drives(HDD) to speed up data intensive applications. However, existing cache algorithms focus exclusively on performance improvements and ignore the write endurance of SSD. In this paper, we proposed a novel cache management algorithm for flash-based disk cache, named Lazy Adaptive Replacement Cache(LARC). LARC can filter out seldom accessed blocks and prevent them from entering cache. This avoids cache pollution and keeps popular blocks in cache for a longer period of time, leading to higher hit rate. Meanwhile, LARC reduces the amount of cache replacements thus incurs less write traffics to SSD, especially for read dominant workloads. In this way, LARC improves performance and extends SSD lifetime at the same time. LARC is self-tuning and low overhead. It has been extensively evaluated by both trace-driven simulations and a prototype implementation in flashcache. Our experiments show that LARC outperforms state-of-art algorithms and reduces write traffics to SSD by up to 94.5% for read dominant workloads, 11.2-40.8% for write dominant workloads.
Chair: Prof. Ethan Miller,
University of California, Santa Cruz
, Sang-Hoon Kim, and Seungryoul Maeng,
Jaesoo Lee and Chanik Park,
Samsung Electronics Co.
Abstract: The notion of object-based storage devices (OSDs) has been proposed to overcome the limitations of the traditional block-level interface which hinders the development of intelligent storage devices. The main idea of OSD is to virtualize the physical storage into a pool of objects and offload the burden of space management into the storage device. We explore the possibility of adopting this idea for solid state drives (SSDs).
The proposed object-based SSDs (OSSDs) allow more efficient management of the underlying flash storage, by utilizing object-aware data placement, hot/cold data separation, and QoS support for prioritized objects. We propose the software stack of OSSDs and implement an OSSD prototype using an iSCSI-based embedded storage device. Our evaluations with various scenarios show the potential benefits of the OSSD architecture.
Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions (Presentation)
and Geoffrey Noer,
Abstract: We studied file size distributions from 65 customer installations and a total of nearly 600 million files. We found that between 25% and 90% of all files are 64 Kbytes or less in size, yet these files account for less than 3 % of the capacity in most cases. In extreme cases 5% to 15% of capacity is occupied by small files. We used this information to size the ratio of SSD to HDD capacity on our latest HPC storage system. Our goal is to automatically allocate all of the block-level and file-level metadata, and all of the small files onto SSD, and use the much cheaper HDD storage for large file extents. The unique storage blade architecture of the Panasas system that couples SSD, HDD, processor, memory, and networking into a scalable building block makes this approach very effective. Response time measured by metadata intensive benchmarks is several times better in our systems that couple SSD and HDD. The paper describes the measurement methodology, the results from our customer survey, and the performance benefits of our approach.