Media Wars: Disk versus FLASH in the Struggle for Capacity and Performance
, as is our custom, will dedicate a full week to computer-storage technology, including three days day of tutorials, two days of invited papers, two days of peer-reviewed research papers, and a vendor exposition. The conference will be held, once again, on the beautiful campus of Santa Clara University, in the heart of Silicon Valley.
Many Thanks to Our Sponsors!
8:30 Mirantis OpenStack Tutorial
Stephen Nemeth, Technical Instructor, Mirantis
Abstract: OpenStack continues to gain momentum and deployments. This half-day tutorial, led by Mirantis Training instructors, will focus on the purpose of OpenStack, its ecosystem, OpenStack terminology and architecture, providing you with a high-level understanding of how to operate an OpenStack cluster.
1:30 Akanda Network Tutorial
Sean Roberts, David Lenwell, Akanda
Abstract: Akanda is an open source suite of software, services, orchestration, and tools for providing L3+ services in OpenStack. It builds on top of OpenBSD, Packet Filter, and OpenStack Neutron, and is used in production to power DreamCompute's networking capabilities. Using Akanda, an OpenStack provider can provide tenants with a rich, powerful set of L3+ services, including routing, port forwarding, firewalling, and more. Using a hands-on approach, attendees will have the opportunity to deploy and test their own system during the tutorial.
8:30 Upstream Tutorial
Sean Roberts, Rama Puranam, David Lenwell, Akanda
With over 1000 developers from 130 different companies worldwide, OpenStack is one of the largest collaborative software-development projects. Because of its size, it is characterized by a huge diversity in social norms and technical conventions. These can significantly slow down the speed at which changes by newcomers are integrated in the OpenStack project.
We've designed a training program to accelerate the speed at which new OpenStack developers are successful at integrating their own roadmap into that of the OpenStack project. We have taken a slice of the two day OpenStack Upstream Training program
and broken out the session dealing with development interaction. This 4 hour live class teaches students to navigate the intricacies of a project's technical teams and social interactions using Legos. It is a lot fun and very informative to the way upstream development teams, companies, and individual technical contributors behave and react to milestones.
1:30 Cosmos OpenSSD Tutorial
Yong Ho Song, Associate Professor, Electronic Engineering, Hanyang University
Abstract: The Cosmos OpenSSD is an open-source SSD system which can be freely used by students, researchers, and engineers. It may be used to develop and evaluate software and hardware technology related to memory-based storage solutions. The tutorial gives a brief introduction to the SSD system, and explains individual software and hardware components to help such development and evaluation.
Attendees may attend any track or move between them during the day.
8:30—5:00 RedHat Ceph Tutorial
Abstract: Red Hat Ceph Storage is a massively scalable, open source, software-defined storage system that gives you unified storage for your cloud environment. In this tutorial, basics of Ceph design and operating principles will be covered. Attendees will also have the opportunity to deploy and test their own Ceph system during the tutorial.
8:30—5:00 Swiftstack Swift Tutorial
Manzoor Brar, John Dickinson, Swiftstack
Abstract: In this full-day workshop, you'll learn why the largest public clouds today, including IBM, Rackspace, and HP are using OpenStack Swift to deliver massively scalable, multi-geographically distributed storage for their customers. You'll also learn how to get up and running with a Swift Cluster.
In the hands-on lab portion, we'll cover:
An overview on Swift Fundamentals
How easy it is to consume storage (for mega-scale!) while improving on throughput
Use Cases and Integration
Selecting and Optimizing Hardware
Hands-on deploying and interacting with a Swift cluster
More specifically, you'll leave understanding how to deploy and manage a Swift cluster, what use cases are the best fit for OpenStack Swift's object storage and how easy it is to build apps using Swift and consume their assets (videos, images, docs, pdfs, etc...).
Attendees will be given access to a remote linux machine for the purposes of this tutorial. Please make sure you have the following for the tutorial session:
- Laptop with SSH capabilities
- Chrome, Safari, or Firefox installed
8:30—5:00 Lustre Tutorial
Keith Mannthey, Intel
Abstract: Learn about Lustre, the open source, high-performance, parallel file system, that powers some of today's largest HPC systems. The tutorial will be a mix of lecture and hands-on training, with a focus on the use and administration of Lustre. The session will include a Lustre overview and architecture, Lnet, file stripping, Lustre HSM, and basic Lustre telemetry.
Data Intensive Research: Enabling and Optimising Flexible "Big Data" Workflows (Presentation
, Commonwealth Scientific and Industrial Research Organisation, Australia
Abstract: Data storage, data management and research workflows need an integrated approach. Consideration of one without the others leaves data at risk of becoming orphaned. Risks to 'big data' must be considered and treated at all levels. No component of research workflow, application development or infrastructure can be undertaken without consideration of the implications to the data. This presentation looks at what can be achieved today, based on existing technologies and real scientific computing workflows. It provides a scalable framework to survive and prosper, as datasets and workflows continue to grow.
9:30 Scaling Storage for Big Science
Session Chair: Ellen Salmon, NASA Goddard
LHC Exascale Storage Systems — Design and Implementation (Presentation
Dirk Duellmann, CERN
Abstract: CERN's storage systems capture data from the world's large
particle accelerators. In this talk, CERN's existing infrastructure
and plans for the future, including object storage and FLASH, will
, European Centre for Medium Range Weather Forecasting, United Kingdom
Abstract: Data ECMWF is a scientific organisation using large super computers to predict the weather. Over the years, it has built an archive of active data which has now extended beyond 100PB of data. This information is kept in a typical HSM environment, with less than 5% of the data residing on direct access devices, the rest being stored on tape. This presentation provides a broad overview of the archive structure and access, explaining how some of the inconvenience typically linked to the use of tape storage have been minimised, and look at some possible alternative solutions that ECMWF may want to implement in the future.
Migrating NASA Archives to Disk: Challenges and Opportunities (Presentation
Chris Harris, NASA Langley
Abstract: NASA Langley's Atmospheric Science Data Center (ASDC) is making several large transitions in its storage infrastructure to support growth, including an emphasis on disk and new backup technologies. This talk will outline these changes and how they can support future system growth.
Data-Driven Science: Advanced Storage for Genomics Workflows (Presentation
Abstract: Dr. Reaney gives a brief perspective of computational solutions for genomics analysis with an eye towards how the generation and manipulation of genomics data has both enabled and constrained the science. An overview of a few SGI customers and their workflows in the genomics research space is presented. With ever-expanding genomics workflows in mind, Dr. Reaney introduces the SGI UV system with NVMe storage as a tool capable of addressing both present and especially future workflows, enabling the science in ways not possible with other architectures.
1:30 Panel: Can Disks Replace Tape for Long-Term Storage Applications?
Moderator: Matthew O’Keefe, Hitachi Data Systems
Abstract: Due to the wide-spread use of disks in massive-scale systems, they are de-facto replacing tape for long-term archives in some installations. In this panel, participants will discuss power consumption, space, migration and tape/disk archive management challenges as disks strive for dominance in long-term storage and archival applications.
Ethan Miller, University of California, Santa Cruz / Pure Storage
3:30 Panel: Security at Massive Scale
Henry Newman, Instrumental a Seagate Company
Abstract: In the IT industry, there are reports almost weekly of security breaches impacting PII, HIPAA, and other data. Combined with critical infrastructure impacts to US Government systems, this clearly means that the current security methods (which are mostly network-centric) are completely inadequate to address today’s complex cyber environment. This panel on security will clearly define how SELinux, a security eco-system, is being used to create a secure system meeting the needs of complex applications environments.
New Architectures for Storage Scalability
Session Chair: Dan Duffy, NASA
Sam Fineberg, HP
Abstract: The massive data explosion emerging over the next decade will be so big that today’s computing infrastructure won’t be able to keep up. To address this challenge, HP is developing "The Machine", diverging from today’s computing paradigm by fusing memory and storage, flattening memory hierarchies, leveraging system on chip architectures, and using photonic interconnects. To make this work, many advances in operating systems and programming models will be required. This talk presents some of the software design challenges being addressed by researchers at HP Labs to address tomorrow’s computing challenges.
5:30 Lightning Talks
Session Chair: Dan Duffy, NASA
(Short talks, the time depending on the number of speakers. Speakers may request slots before or at the conference.)
Jim Gerry, IBM
Sharad Mehrota, Saratoga Speed
The Requirement for Automation in Large Scale Data Management (Presentation
Dave Fellinger, Data Direct Networks
Extreme Data Virtualization: Leveraging Gluster, ZFS, and Open Source in
Large-Scale Data Virtualization (Presentation
Scott Sinno, NCCS/NASA
End to End Life Cycle Management for Scientific Research Data (Presentation
Jacob Farmer, Starfish Storage and Cambridge Computer
Abstract: Starfish (*FS) associates metadata with files and directories in conventional file systems and object stores to create a virtual global namespace that is governed by a rules engine. Starfish captures metadata
throughout the scientific pipeline. Metadata are used for reporting, annotation, and rules based management.
Adrian Palmer, Seagate
Can Smart Storage be the Future of Machine Learning?
Abstract: The field of Machine Learning is currently undergoing a significant transformation. From a niche methodology comprehensible to experts, it is becoming the main stream tool available for masses. From modeling thousands of data instances, it now needs to make sense of billions. From extending sophisticated intractable models, it now has to focus on parallelizing conceptually simple optimization procedures. From prototyping in Matlab, Machine Learning practitioners are now moving to programming GPUs and FPGAs—but they still use generic storage. In this talk, I will outline requirements to designing the smart storage systems of the future—to build the foundation for the next-generation intelligent data processing.
Storage for Exascale
Session Chair: Gary Grider, Los Alamos National Laboratory
Abstract: Three emerging trends must be considered when assessing how HPC storage will operate at exascale. First, exascale simulation workflows will greatly expand the volume and complexity of data and metadata to be stored and analysed. Second, ever increasing core and node counts will require corresponding scaling of application concurrency while simultaneously increasing the frequency of hardware failure. Third, new NVRAM technologies will allow storage, accessible at extremely fine grain and low latency, to be distributed across the entire cluster fabric to exploit full cross-sectional bandwidth. This talk describes Distributed Application Object Storage (DAOS)—a new storage architecture that Intel is developing to address the scalability and resilience issues and exploit the performance opportunities presented by these emerging trends.
Wrangler: A new Generation of Data Intensive Supercomputing (Presentation
Chris Jordan, Texas Advanced Computing Center
Abstract: In this talk, we will introduce Wrangler, a new class of data intensive system truly designed from the ground up a new generation of data-intensive applications. Comprised of over 100 Intel Haswell-based compute nodes attached to over half a petabyte of NAND Flash based storage and 10PB of geographically replicated mass storage systems, Wrangler’s unique characteristics will enable many new applications for researchers with I/O intensive applications. Wrangler can provide up to 270 Million IOPs and overall bandwidth of 1 TB/s, with more than 200x the per node bandwidth of Blue Waters system. Wrangler will provide not just valuable data-intensive compute capabilities but also new modes of utilization, including persistent provision of services for data collections coupled with data management and analysis features. In this talk, we will discuss how Wrangler will enable not just scientists but also technologists, industry partners, and educators to use TACC’s cloud gateways, portal interfaces to tackle the biggest scientific challenges that we face today.
Wrangler is part of XSEDE, the world’s most advanced, powerful, and robust collection of integrated advanced digital resources and services to allow scientists to interactively share computing and data expertise. We will briefly discuss how Wrangler fits into the larger ecosystem at TACC and at the national scale, enabling complex workflows with diverse needs.
11:00 Panel: Leveraging FLASH in Integrated, Scalable Systems
Gary Grider, Los Alamos National Laboratory
Abstract: FLASH storage continues to make inroads in large system designs. In this panel, participants will state their views on how FLASH can best be used to improve system performance, power usage, density, and other parameters in large-scale systems.
Chris Jordan, Texas Advanced Computing Center
Brian Van Essen, Lawrence Livermore National Laboratory
Challenges in Enterprise Storage Scalability
Session Chair: André Brinkman, Johannes Gutenberg-University Mainz
Scalability in Large-Volume Enterprise Array Environments (Presentation
Randy Olinger, Optum/United Health Group
Abstract: Optum/UHG is an $130 billion health insurance company with a large and growing storage infrastructure. This talk will outline the solutions to managing 100s of enterprise storage arrays, while preparing for growth in large-scale object
1:30 Panel: Leveraging Flash in Large-Scale Enterprise and Web Systems
Moderator: Randy Olinger, Optum/United Health Group
Abstract: FLASH storage continues to make inroads in large system designs. In this panel, participants will state their views on how FLASH can best be used to improve performance, power usage, and density in large-scale systems involving 100s of Petabytes of storage and potentially thousands of storage arrays in one system.
Michael Hay, Hitachi Data Systems
Panel: Scalable Object Stores
Abstract: HPC, web, and enterprise storage systems are emphasizing scalable object stores (both in-house and in-cloud) for a wide variety of new applications. In this panel, we will discuss and debate challenges in scaling, programming, securing, and maintaining large object stores. Alternative designs and implementations and their respective strengths and weaknesses will be discussed and debated.
Michael Declerck, Amazon
Lessons Learned from Distributed Systems and Applied to MModern Storage Platforms
Abstract: Modern distributed systems have been around for a decade or more. But the convergence of several trends – reliable, powerful commodity infrastructure; flash and high capacity drives; 10 and 40 gig Ethernet – make distributed systems a reality for enterprises, and not just web giants. This presentation will discuss Avinash Lakshman’s experience in inventing and operating two notable distributed systems: Amazon Dynamo and Apache Cassandra. Avinash will discuss lessons learned from these NoSQL systems and how they can be applied to building more scalable and flexible storage platforms.
Panel: Limits of Scalability for Software-Defined Storage Systems
Moderator: Dan Duffy, NASA
Abstract: Software-defined storage systems are gaining traction in web, enterprise, and hpc data centers. This panel will address scalability challenges and opportunties in building and deploying software-defined storage systems.
5:30 Lightning Talks
Session Chair: Matthew O’Keefe, Hitachi Data Systems
(Short talks, the time depending on the number of speakers. Speakers may request slots before or at the conference.)
André Brinkman, Johannes Gutenberg-University Mainz
Emerging Storage Technologies in the Data Recovery Lab (Presentation
Chris Bross, DriveSavers Data Recovery
Eric Carter, Hedvig
MarsFS: A near-POSIX Scalable File System using Distributed Object Stores (Presentation
Gary Grider, Los Alamos National Laboratory
7:00 Vendor Reception
(* Indicates Presenter)
8:30 New Memory Technologies
Youyou Lu*, Jiwu Shu, Long Sun, Tsinghua University
Abstract: Persistent memory provides data persistence at main memory level and enables memory-level storage systems. To ensure consistency of the storage systems, memory writes need to be transactional and are carefully moved across the boundary between the volatile CPU cache and the persistent memory. Unfortunately, the CPU cache is hardware-controlled, and it incurs high overhead for programs to track and move data blocks from being volatile to persistent.
In this paper, we propose a software-based mechanism, Blurred Persistence, to blur the volatility-persistence boundary, so as to reduce the overhead in transaction support. Blurred Persistence consists of two techniques. First, Execution in Log executes a transaction in the log to eliminate duplicated data copies for execution. It allows the persistence of volatile uncommitted data, which can be detected by reorganizing the log structure. Second, Volatile Checkpoint with Bulk Persistence allows the committed data to aggressively stay volatile by leveraging the data durability in the log, as long as the commit order across threads is kept. By doing so, it reduces the frequency of forced persistence and improves cache efficiency. Evaluations show that our mechanism improves system performance by 56.3% to 143.7% for a variety of workloads.
A Study of Application Performance with Non-Volatile Main Memory (Paper
Yiying Zhang*, Steven Swanson, University of California, San Diego
Abstract: Attaching next-generation non-volatile memories (NVMs) to the main memory bus provides low-latency, byte-addressable access to persistent data that should signiÞcantly improve performance for a wide range of storage-intensive workloads. We present an analysis of storage application performance with non-volatile main memory (NVMM) using a hardware NVMM emulator that allows fine-grain tuning of NVMM performance parameters. Our evaluation results show that NVMM improves storage application performance signiÞcantly over ßash-based SSDs and HDDs. We also compare the performance of applications running on realistic NVMM with the performance of the same applications running on idealized NVMM with the same performance as DRAM. We Þnd that although NVMM is projected to have higher latency and lower bandwidth than DRAM, these difference have only a modest impact on application performance. A much larger drag on NVMM performance is the cost of ensuring data resides safely in the NVMM (rather than the volatile caches) so that applications can make strong guarantees about persistence and consistency. In response, we propose an optimized approach to ßushing data from CPU caches that minimizes this cost. Our evaluation shows that this technique signiÞcantly improves performance for applications that require strict durability and consistency guarantees over large regions of memory.
SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory (Paper
Ellis Giles*, Peter Varman, Rice University
Kshitij Doshi, Intel
Abstract: In-memory computing is gaining popularity as a means of sidestepping the performance bottlenecks of block storage operations. However, the volatile nature of DRAM makes these systems vulnerable to system crashes, while the need to continuously refresh massive amounts of passive memory-resident data increases power consumption.
Emerging storage-class memory (SCM) technologies combine fast DRAM-like cache-line access granularity with the persistence of storage devices like disks or SSDs, resulting in potential 10x - 100x performance gains, and low passive power consumption.
This unification of storage and memory into a single directly accessible persistent tier raises significant reliability and programmability challenges. In this paper, we present SoftWrAP, an open-source framework for Software based Write-Aside Persistence. SoftWrAP provides lightweight atomicity and durability for SCM storage transactions, while ensuring fast paths to data in processor caches, DRAM, and persistent memory tiers. We use our framework to evaluate both handcrafted SCM-based micro-benchmarks as well as existing applications, specifically the STX B+Tree library and SQLite database, backed by emulated SCM.
Our results show the ease of using the API to create atomic persistent regions and the significant benefits of SoftWrAP over existing methods such as undo logging and shadow copying, and can match non-atomic durable writes to SCM, thereby gaining atomic consistency almost for free.
10:30 Flash, Flash, Flash
Reducing MLC Flash Memory Retention Errors through Programming Initial Step Only (Paper
Wei Wang*, Tao Xie, San Diego State University
Antoine Khoueir, Youngpil Kim, Seagate Technology
Abstract: Since retention error has been recognized as the most dominant error in MLC (multi-level cell) flash, in this paper we propose a new approach called PISO (Programming Initial Step Only) to reduce its number. Unlike a normal programming operation, a PISO operation only carries out the first programming-and-verifying step on a programmed cell. As a result, a number of electrons are injected into the cell to compensate its charge loss over time without disturbing its existing data. Further, we build a model to understand the relationship between the number of PISOs and the number of reduced errors. Experimental results from 1y-nm MLC chips show that PISO can efficiently reduce the number of retention errors with a minimal overhead. On average, applying 10 PISO operations each month on a one-year-old MLC chip that has experienced 4K P/E cycles can reduce its retention errors by 21.5% after 3 months.
Improving MLC flash performance and endurance with Extended P/E Cycles (Paper
Fabio Margaglia*, André Brinkmann, University of Mainz
Abstract: The traditional usage pattern for NAND flash memory is the program/erase (P/E) cycle: the flash pages that make a flash block are all programmed in order and then the whole flash block needs to be erased before the pages can be programmed again. The erase operations are slow, wear out the medium, and require costly garbage collection procedures. Reducing their number is therefore beneficial both in terms of performance and endurance. The physical structure of flash cells unfortunately limits the number of opportunities to overcome the 1 to 1 ratio between programming and erasing pages: a bit storing a logical 0 cannot be reprogrammed to a logical 1 before the end of the P/E cycle.
This paper presents a technique to minimize the number of erase operations called extended P/E cycle. With extended P/E cycles the flash pages can be programmed many times before the whole flash block needs to be erased, dramatically reducing the number of erase operations. The paper includes the design of an FTL based on Multi Level Cell (MLC) NAND flash chips, and its implementation on the OpenSSD prototyping board. The evaluation of our prototype shows that this technique can achieve erase operations reduction as high as 85%, with latency speedups of up to 67%, with respect to a FTL with traditional P/E cycles, and a naive greedy garbage collection strategy. Our evaluation leads to valuable insights on how extended P/E cycles can be exploited by future applications.
Incremental Redundancy to Reduce Data Retention Errors in Flash-based SSDs (Paper
Heejin Park*, Donghee Lee, University of Seoul
Jaeho Kim, Sam H. Noh, Hongik University
Jongmoo Choi, Dankook University
Abstract: As the market becomes competitive, SSD manufacturers are making use of multi-bit cell flash memory such as MLC and TLC chips in their SSDs. However, these chips have lower data retention period and endurance than SLC chips. With the reduced data retention period and endurance level, retention errors occur more frequently. One solution for these retention errors is to employ strong ECC to increase error correction strength. However, employing strong ECC may result in waste of resources during the early stages of flash memory lifetime as it has high reliability and data retention errors are rare during this period. The other solution is to employ data scrubbing that periodically refreshes data by reading and then writing the data to new locations after correcting errors through ECC. Though it is a viable solution for the retention error problem, data scrubbing hurts performance and lifetime of SSDs as it incurs extra read and write requests. Targeting data retention errors, we propose incremental redundancy (IR) that incrementally reinforces error correction capabilities when the data retention error rate exceeds a certain threshold. This extends the time before data scrubbing should occur, providing a grace period in which the block may be garbage collected. We develop mathematical analyses that project the lifetime and performance of IR as well as when using conventional data scrubbing. Through mathematical analyses and experiments with both synthetic and real workloads, we compare the lifetime and performance of the two schemes. Results suggest that IR can be a promising solution to overcome data retention errors of contemporary multi-bit cell flash memory. In particular, our study shows that IR can extend the maximum data retention period by 5 to 10 times. Additionally, we show that IR can reduce the write amplification factor by half under real workloads.
1:30 Flash/NVM and File Systems
Removing the Costs and Retaining the Beneﬁts of Flash-Based SSD Virtualization with FSDV (Paper
Yiying Zhang*, University of California, San Diego
Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, University of Wisconsin, Madison
Abstract: We present the design, implementation, and evaluation of the File System De-Virtualizer (FSDV), a system that dynamically removes a layer of indirection common in modern storage stacks and decreases indirection space and performance costs. FSDV is a ßexible, light-weight tool that de-virtualizes data by changing Þle system pointers to use device physical addresses. When FSDV is not running, the Þle system and the device both maintain their virtualization layers and perform normal I/O operations. We implement FSDV with the ext3 Þle system and an emulated flash-based SSD. Our evaluation results show that FSDV can signiÞcantly reduce indirection mapping table space in a dynamic way while preserving the foreground I/O performance. We also demonstrate that FSDV only requires small changes to existing storage systems.
Luis Cavazos Quero*, Jin-Soo Kim, Sungkyunkwan University
Young-Sik Lee, KAIST
Abstract: Nowadays solid state drives (SSDs) are gaining popularity and are replacing magnetic hard disk drives (HDDs) in enterprise storage systems. As a result, extracting the maximum performance from SSDs is becoming crucial to deal with the increasing storage volume and performance needs. Active disks were introduced as a way to offload data-processing tasks from the host into disks freeing system resources and achieving better performance.
In this work, we present an active SSD architecture called Self-Sorting SSD that targets to offload sorting operations which are commonly used in data-intensive and database environments and that require heavy data transfer. Processing sorting operations directly on the SSD reduces data transfer from/to the storage devices increasing system performance and the SSD's lifetime. Experiments on a real SSD platform reveal that our proposed architecture can outperform traditional external merge sort by more than 60%, reduces energy consumption up to 58%, and eliminates all the data transfer overhead to compute sorted results.
Chris Dragga*, Douglas Santry, NetApp, Inc.
Abstract: File-system snapshots have been a key component of enterprise storage management since their inception. Creating and managing them efficiently, while maintaining flexibility and low overhead, has been a constant struggle. Although the current state-of-the-art mechanism, hierarchical reference counting, performs reasonably well for traditional small-file workloads, these workloads are increasingly vanishing from the enterprise data center, replaced instead with virtual machine and database workloads. These workloads center around a few very large files, violating the assumptions that allow hierarchical reference counting to operate efficiently. To better cope with these workloads, we introduce GCTrees, a novel method of space management that uses concepts of block lineage across snapshots, rather than explicit reference counting. As a proof of concept, we create a prototype file system, gcext4, a modified version of ext4 that uses GCTrees as a basis for snapshots and copy-on-write. In evaluating this prototype analytically, we find that, though they have a somewhat higher overhead for traditional workloads, GCTrees have dramatically lower overhead than hierarchical reference counting for large-file workloads,improving by a factor of 34 or more in some cases. Furthermore, gcext4 performs comparably to ext4 across all workloads, showing that GCTrees impose minor cost for their benefits.
Priya Sehgal, Sourav Basu*, Kiran Srinivasan, Kaladhar Voruganti, NetApp Inc.
Abstract: Emerging byte-addressable, non-volatile memory like phase-change memory, STT-MRAM, etc. brings persistence at latencies within an order of magnitude of DRAM, thereby motivating their inclusion on the memory bus. According to some recent work on NVM, traditional file systems are ineffective and sub-optimal in accessing data from this low latency media. However, there exists no systematic performance study across different file systems and their various configurations validating this point. In this work, we evaluate the performance of various legacy Linux file systems under various real world workloads on non-volatile memory (NVM) simulated using ramdisk and compare it against NVM optimized file system _- PMFS. Our results show that while the default file system configurations are mostly sub-optimal for NVM, these legacy file systems can be tuned using mount and format options to achieve performance that is comparable to NVM-aware file system such as PMFS. Our experiments show that the performance difference between PMFS and ext2/ext3 with execute-in-place (XIP) option is around 5% for many workloads (TPCC and YCSB). Furthermore, based on the learning from our performance study, we present few key file system features such as in-place update layout with XIP, and parallel metadata and data allocations, etc. that could be leveraged by file system designers to improve performance of both legacy and new file systems for NVM.
3:30 Hot and Cold Data
WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management
Yixin Luo*, Yu Cai, Saugata Ghose, Onur Mutlu, Carnegie Mellon University
Jongmoo Choi, Dankook University
Abstract: Increased NAND flash density has come at the cost of lifetime reductions. This lifetime can be extended by relaxing internal retention time, the duration for which a flash cell correctly holds data. Such relaxation cannot be exposed externally to avoid altering expected non-volatility. Reliability mechanisms, most prominently refresh, restore the non-volatility, but greatly reduce lifetime improvements by performing a greater number of write operations. We find that retention time relaxation can be achieved more efficiently by exploiting heterogeneity in write-hotness, the frequency at which each page is written.
We propose WARM, a write-hotness aware retention management policy for flash memory, that identifies and physically groups write-hot data within the flash device, allowing the flash controller to selectively perform retention time relaxation with little cost. When applied alone, WARM improves overall flash lifetime by an average of 3.2_ over conventional management policy without refresh, across a variety of real file system traces. When WARM is applied together with refresh, the average lifetime improves by 12.9x, 1.2x over naive refresh alone.
Classifying Data to Reduce Long Term Data Movement in Shingled Write Disks (Paper
Stephanie Jones*, Ethan Miller, Darrell Long, Rekha Pitchumani, Christina Strong, University of California, Santa Cruz
Ahmed Amer, Santa Clara University
Abstract: Shingled Magnetic Recording (SMR) is a means of increasing the density of hard drives that brings a new set of challenges. Due to the nature of SMR disks, updating in place is not an option. Holes left by invalidated data can only be filled if the entire band is reclaimed, and a poor band compaction algorithm could result in spending a lot of time moving blocks over the lifetime of the device. We propose using write frequency to separate blocks to reduce data movement and develop a band compaction algorithm that implements this heuristic. We demonstrate how our algorithm results in improved data management, resulting in an up to 47% reduction in required data movements when compared to naive approaches to band management.
(* Indicates Presenter)
8:30 Big Systems
Rekha Pitchumani*, Shayna Frank, Ethan L. Miller, University of California, Santa Cruz
Abstract: Benchmarks are widely used to perform apples-to-apples comparison in a controlled and reliable fashion. Benchmarks must model real world workload behavior. In recent years, to meet web scale demands, Key-Value (KV) stores have emerged as a vital component of cloud serving systems. The Yahoo! Cloud Serving Benchmark (YCSB) has emerged as the standard benchmark for evaluating key-value systems, and has been preferred by both the industry and academia. Though YCSB provides a variety of options to generate realistic workloads, like most benchmarks it has ignored the temporal characteristics of generated workloads. YCSB's constant-rate request arrival process is unrealistic and fails to capture the real world arrival patterns.
Existing workload studies on disk, filesystem, key-value system, network, and web traffic all show that they all exhibit some common temporal properties such as burstiness, self similarity, long range dependence, and diurnal activity. In this work, we show that the commonly observed traffic patterns can be modeled using the three categories of arrival processes: a)Poisson, b)Self similar, and c)Envelope-guided process. The three categories presented are a necessary and sufficient set of request arrival models that all storage benchmarks should provide. To demonstrate the ease of incorporating the models in benchmarks, we have modified YCSB to generate workloads based on all three models, and show the effect of realistic request arrivals through an example database evaluation.
Ascar: Automating Contention Management for High-Performance Storage Systems (Paper
Yan Li*, Xiaoyuan Lu, Ethan Miller, Darrell Long, University of California, Santa Cruz
Abstract: High-performance parallel storage systems, such as those used by supercomputers and data centers, can suffer from performance degradation when a large number of clients are contending for limited resources, like bandwidth. These contentions lower the efficiency of the system and cause unwanted speed variances. We present the Automatic Storage Contention Alleviation and Reduction system (ASCAR), a storage traffic management system for improving the bandwidth utilization and fairness of resource allocation. ASCAR regulates I/O traffic from the clients using a rule based algorithm that controls the congestion window and rate limit. The rule-based client controllers are fast responding to burst I/O because no runtime coordination between clients or with a central coordinator is needed; they are also autonomous so the system has no scale-out bottleneck. Finding optimal rules can be a challenging task that requires expertise and numerous experiments. ASCAR includes a SHAred-nothing Rule Producer (SHARP) that produces rules in an unsupervised manner by systematically exploring the solution space of possible rule designs and evaluating the target workload under the candidate rule sets. Evaluation shows that our ASCAR prototype can improve the throughput of all tested workloads Ð some by as much as 35%. ASCAR improves the throughput of a NASA NPB BTIO checkpoint workload by 33.5% and reduces its speed variance by 55.4% at the same time. The optimization time and controller overhead are unrelated to the scale of the system; thus, it has the potential to support future large-scale systems that can have millions of clients and thousands of servers. As a pure client-side solution, ASCAR needs no change to either the hardware or server software.
Chunbo Lai, Shiding Lin, Zhenyu Hou, Can Cui, Baidu Inc.
Song Jiang*, Wayne State University
Liqiong Yang, Guangyu Sun, Peking University
Jason Cong, University of California, Los Angeles
Abstract: Users store rapidly increasing amount of data into the cloud. Cloud storage service is often characterized as having a large data set, and few deletes. Hosting the service on a conventional system using servers of powerful CPUs and managed by either a key-value (KV) system or a file system is not efficient. First, as demand on storage capacity grows much faster than that on CPU power, existent server configuration can lead to CPU under-utilization and inadequate storage. Second, as data durability is of paramount importance and storage capacity can be limited, a data protection scheme relying on data replication is not space efficient. Third, because of the unique data size distribution (most at a few KBytes), hard disk may suffer from unnecessarily large request rate (when data are stored as KV pairs and need constant re-organization) or too many random writes (when data are stored as relative small files).
In Baidu this inefficiency has become an urgent concern as data are uploaded into the storage at an increasingly higher rate and both user group and the system are rapidly expanding. To address this challenge, we adopt a customized compact server design based on the ARM processors and replace three-copy replication for data protection with erasure coding to enable low-power and high-density storage. Furthermore, there are a huge number of objects stored in the system, such as those for photos, MP3 music, and documents, but their sizes do not allow efficient operations in the conventional KV systems. To this end we propose an innovative architecture separating metadata and data managements to enable efficient data coding and storing. The resulting production system, named as Atlas, is a highly scalable, reliable, and cost-effective KV store supporting Baidu's cloud storage service.
10:30 New Algorithms
GreenCHT: A Power-Proportional Replication Scheme for Consistent Hashing based
Key Value Storage Systems (Paper
Nannan Zhao*, Jiguang Wan, Changsheng Xie, Huazhong University of Science and Technology
Jun Wang, University of Central Florida
Abstract: Distributed key value storage systems are widely used by many popular networking corporations. Nevertheless, server power consumption has become a growing concern for key value storage system designers since the power consumption of servers contributes substantially to a data center's power bills. In this paper, we propose GreenCHT, a power-proportional replication scheme for consistent hashing based key value storage systems. GreenCHT consists of a power-aware replication strategy — multi-tier replication strategy and a centralized power control service — predictive power mode scheduler. The multi-tier replication provides power-proportionality and ensures data availability, reliability, consistency, as well as fault-tolerance of the whole system. The power predictive power mode scheduler component predicts workloads and schedules servers to be powered up and powered down without compromising the performance of the system. GreenCHT is implemented on Sheepdog, a distributed key value system that uses consistent hashing as an underlying distributed hash table. By replicating twelve typical real workload traces collected from Microsoft, the evaluation results show that GreenCHT can provide significant power savings while maintaining an acceptable performance. We observed that GreenCHT can reduce power consumption by up to 35%-61%.
Leap-based Content Defined Chunking — Theory and Implementation (Paper
Chuanshuai Yu, Chengwei Zhang, Yiping Mao*, Fulu Li, Huawei Technologies Co., Ltd.
Abstract: Content Defined Chunking (CDC) has been one of the key technologies in data deduplication, which affects both the deduplication efficiency, i.e., the deduplication ratio, as well as the deduplication performance, i.e., the computing speed. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performances are limited in certain application scenarios since they have to compute the judgment function once at each byte for almost half of the whole data stream. In this paper we present a leap-based CDC algorithm which provides significant improvement in deduplication performance due to the novel leap technology introduced in our algorithm while still provides the same deduplication efficiency. Compared to the sliding-window-based CDC algorithm, our algorithm shows about 1X improvement in performance.
Dongwoo Kang*, Seungjae Baek, Jongmoo Choi, Dankook University
Donghee Lee, University of Seoul
Sam H. Noh, Hongik university
Onur Mutlu, Carnegie Mellon University
Abstract: One characteristic of non-volatile memory (NVM) is that, even though it supports non-volatility, its retention capability is limited. To handle this issue, previous studies have focused on refreshing or advanced error correction code (ECC). In this paper, we take a different approach that makes use of the limited retention capability to our advantage. Specifically, we employ NVM as a file cache and devise a new scheme called amnesic cache management (ACM). The scheme is motivated by our observation that most data in a cache are evicted within a short time period after they have been entered into the cache, implying that they can be written with the relaxed retention capability. This retention relaxation can enhance the overall cache performance in terms of latency and energy since the data retention capability is proportional to the write latency. In addition, to prevent the retention relaxation from degrading the hit ratio, we estimate the future reference intervals based on the inter-reference gap (IRG) model and manage data adaptively. Experimental results with real-world workloads show that our scheme can reduce write latency by up to 40% (30% on average) and save energy consumption by up to 49% (37% on average) compared with the conventional LRU based cache management scheme.
MinCounter: An Efficient Cuckoo Hashing Scheme for Cloud Storage Systems (Paper
Yuanyuan Sun*, Yu Hua, Dan Feng, Ling Yang, Pengfei Zuo, Shunde Cao, Huazhong University of Science and Technology
Abstract: With the rapid growth of the amount of information, cloud computing servers need to process and analyze large amounts of high-dimensional and unstructured data timely and accurately, which usually requires many query operations. Due to simplicity and ease of use, cuckoo hashing schemes have been widely used in real-world cloud-related applications. However, due to the potential hash collisions, the cuckoo hashing suffers from endless loops and high insertion latency, even high risks of re-construction of entire hash table. In order to address this problem, we propose a cost-efficient cuckoo hashing scheme, called MinCounter. The idea behind MinCounter is to alleviate the occurrence of endless loops in the data insertion. MinCounter selects the ""cold"" (infrequently accessed) buckets to handle hash collisions rather than random buckets. MinCounter has the salient features of offering efficient insertion and query services and obtaining performance improvements in cloud servers, as well as enhancing the experiences for cloud users. We have implemented MinCounter in a large-scale cloud testbed and examined the performance by using three real-world traces. Extensive experimental results demonstrate the efficacy and efficiency of MinCounter.
1:30 Systems and Performance
Reducing CPU and network overhead for small I/O requests in network storage
protocols over raw Ethernet (Paper
Pilar Gonzalez-Ferez*, ICS-FORTH and University of Murcia
Angelos Bilas, FORTH-ICS and University of Crete
Abstract: Small I/O requests are important for a large number of modern workloads in the data center. Traditionally, storage systems have been able to achieve low I/O rates for small I/O operations because of hard disk drive (HDD) limitations that are capable of about 100-150 IOPS (I/O operations per second) per spindle. Therefore, the host CPU processing capacity and network link throughput have been relatively abundant for providing these low rates. With new storage device technologies, such as NAND Flash Solid State Drives (SSDs) and non-volatile memory (NVM), it is becoming common to design storage systems that are able to support millions of small IOPS. At these rates, however, both server CPU and network protocol are emerging as the main bottlenecks for achieving large rates for small I/O requests.
Most storage systems in datacenters deliver I/O operations over some network protocol. Although there has been extensive work in low-latency and high-throughput networks, such as Infiniband, Ethernet has dominated the datacenter. In this work we examine how networked storage protocols over raw Ethernet can achieve low, host CPU overhead and increase network link efficiency for small I/O requests. We first analyze in detail the latency and overhead of a networked storage protocol directly over Ethernet and we point out the main inefficiencies. Then, we examine how storage protocols can take advantage of context switch elimination and adaptive batching to reduce CPU and network overhead.
Our results show that raw Ethernet is appropriate for supporting fast storage systems. For 4kB requests we reduce server CPU overhead by up to 45%, we improve link utilization by up to 56%, achieving more than 88% of the theoretical link throughput. Effectively, our techniques serve 56% more I/O operations over a 10Gbits/s link than a baseline protocol that does not include our optimizations at the same CPU utilization. Overall, to the best of our knowledge, this is the first work to present a system that is able to achieve 14us host CPU overhead on both initiator and target for small networked I/Os over raw Ethernet without hardware support. In addition, our approach is able to achieve 287K 4kB IOPS out of the 315K IOPS that are theoretically possible over a 1.2GBytes/s link.
Wei Zhang*, Pure Storage
Daniel Agun, Tao Yang, Rich Wolski, University of California, Santa Barbara
Hong Tang, Alibaba Inc.
Abstract: Data deduplication is important for snapshot backup of virtual machines (VMs) because of excessive redundant content. Fingerprint search for source-side duplicate detection is resource intensive when the backup service for VMs is co-located with other cloud services. This paper presents the design and analysis of a fast VM-centric backup service with a tradeoff for a competitive deduplication efficiency while using small computing resources, suitable for running on a converged cloud architecture that cohosts many other services. The design consideration includes VM-centric file system block management for the increased VM snapshot availability. This paper describes an evaluation of this VM-centric scheme to assess its deduplication efficiency, resource usage, and fault tolerance.
Improving Performance by Bridging the Semantic Gap between Multi-queue
SSD and I/O Virtualization Framework (Paper
Tae Yong Kim*, Dong Hyun Kang, Dongwoo Lee, Young Ik Eom, Sungkyunkwan University
Abstract: Virtualization has become one of the most helpful techniques, and today it is prevalent in several computing environments including desktops, data-centers, and enterprises. However, an I/O scalability issue in virtualized environments still needs to be addressed because I/O layers are implemented to be oblivious to the I/O behaviors on virtual machines (VM). In particular, when a multi-queue solid state drive (SSD) is used as a secondary storage, each VM reveals semantic gap that degrades the overall performance of the VM by up to 74%. This is due to two key problems. First, the multi-queue SSD accelerates the possibility of lock contentions. Second, even though both the host machine and the multi-queue SSD provide multiple I/O queues for I/O parallelism, existing Virtio-Blk-Data-Plane supports only one I/O queue by an I/O thread for submitting all I/O requests. In this paper, we propose a novel approach, including the design of virtual CPU (vCPU)-dedicated queues and I/O threads, which efficiently distributes the lock contentions and addresses the parallelism issue of Virtio-Blk-Data-Plane in virtualized environments. We design our approach based on the above principle, which allocates a dedicated queue and an I/O thread for each vCPU to reduce the semantic gap. We also implement our approach based on Linux 3.17, and modify both the Virtio-Blk frontend driver of guest OS and the Virtio-Blk backend driver of Quick Emulator (QEMU) 2.1.2. Our experimental results with various I/O traces clearly show that our design improves the I/O operations per second (IOPS) in virtualized environments by up to 167% over existing QEMU.
Joel Frank*, Shayna Frank, Lincoln Thurlow, Ethan Miller, Darrell Long, University of California, Santa Cruz
Thomas Kroeger, Sandia National Laboratories
Abstract: Maintaining information privacy is challenging when sharing data across a distributed long-term datastore. In such applications, secret splitting the data across independent sites has been shown to be a superior alternative to fixed-key encryption; it improves reliability, reduces the risk of insider threat, and removes the issues surrounding key management. However, the inherent security of such a datastore normally precludes it from being directly searched without reassembling the data; this, however, is neither computationally feasible nor without risk since reassembly introduces a single point of compromise. As a result, the secret-split data must be pre-indexed in some way in order to facilitate searching. Previously, fixed-key encryption has also been used to securely pre-index the data, but in addition to key management issues, it is not well suited for long term applications.
To meet these needs, we have developed Percival: a novel system that enables searching a secret-split datastore while maintaining information privacy. We leverage salted hashing, performed within hardware security modules, to access pre-recorded queries that have been secret split and stored in a distributed environment; this keeps the bulk of the work on each client, and the data custodians blinded to both the contents of a query as well as its results. Furthermore, Percival does not rely on the datastore's exact implementation. The result is a flexible design that can be applied to both new and existing secret-split datastores. When testing Percival on a corpus of approximately one million files, it was found that the average search operation completed in less than one second.
SecDep: A User-Aware Efficient Fine-Grained Secure Deduplication Scheme with
Multi-Level Key Management (Paper
Yukun Zhou*, Dan Feng, Wen Xia, Min Fu, Fangting Huang,
Yucheng Zhang, Chunguang Li, Huazhong University of Science and Technology
Abstract: Nowadays, many customers and enterprises backup their data to cloud storage that performs deduplication to save storage space and network bandwidth. Hence, how to perform secure deduplication becomes a critical challenge for cloud storage. According to our analysis, the state-of-the-art secure deduplication methods are not suitable for cross-user fine-grained data deduplication. They either suffer brute-force attacks that can recover files falling into a known set, or incur large computation (time) overheads. Moreover, existing approaches of convergent key management incur large space overheads because of the huge number of chunks shared among users.
Our observation that cross-user redundant data are mainly from the duplicate files, motivates us to propose an efficient secure deduplication scheme SecDep. SecDep employs User-Aware Convergent Encryption (UACE) and Multi-Level Key management (MLK) approaches. (1) UACE combines cross-user file-level and inside-user chunk-level deduplication, and exploits different secure policies among and inside users to minimize the computation overheads. Specifically, both of file-level and chunk-level deduplication use variants of Convergent Encryption (CE) to resist brute-force attacks. The major difference is that the file-level CE keys are generated by using a server-aided method to ensure security of cross-user deduplication, while the chunk-level keys are generated by using a user-aided method with lower computation overheads. (2) To reduce key space overheads, MLK uses file-level key to encrypt chunk-level keys so that the key space will not increase with the number of sharing users. Furthermore, MLK splits the file-level keys into share-level keys and distributes them to multiple key servers to ensure security and reliability of file-level keys.
Our security analysis demonstrates that SecDep ensures data confidentiality and key security. Our experiment results based on several large real-world datasets show that SecDep is more time-efficient and key-space-efficient than the state-of-the-art secure deduplication approaches.
|| Dr. Sam Coleman
|| Sean Roberts
|| Dr. Matthew O'Keefe
|Research General Chair
|| Dr. Ahmed Amer
| Research Program Chairs
|| James Hughes, Dr. Peter Desnoyers
|Research Program Committee
|| Dr. Ahmed Amer
|| Dr. JoAnne Holliday, Yi Fang