Archive for the ‘AWS’ Category

What is AWS S3?

S3 stands for Simple Storage Service.

Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, and inexpensive data storage infrastructure.

Unlike the other storage systems like Unix file system, HDFS (i.e. Hadoop Distributed File System), etc which are based on having folders & files, the S3 is based on a concept of a “key” and a “object“. Amazon S3 stores data as objects within a bucket, which is a logical unit of storage. An object consists of a file and optionally any metadata that describes that file.

To store an object in Amazon S3, you upload the file you want to store to a bucket. When you upload a file, you can set permissions on the object as well as any metadata. Buckets are the containers and you control access per bucket, view access logs for it and its objects, and choose the geographical region where Amazon S3 will store the bucket and its contents. Customers are not charged for creating buckets, but are charged for storing objects in a bucket and for transferring objects in and out of buckets.

Amazon S3 data model is a flat structure, and there is no hierarchy of sub-buckets or sub-folders. You can infer logical hierarchy using key name prefixes and delimiters as the Amazon S3 console supports a concept of folders for example

documents/csv/datafeed.csv

Each Amazon S3 object has data (e.g. a file), a key, and metadata (e.g. object creation date, privacy classification like protected,sensitive, public etc).  A key uniquely identifies the object in a bucket.  Object metadata is a set of name-value pairs. You can set object metadata when you upload it. Metadata cannot be modified after uploading, but you can make a copy of the object and set the new metadata.

Advantage of S3

  • Elasticity

If you were to use HDFS on Amazaon EC2 (i.e. Elastic Compute Cloud) infrastructure, and if your storage requirements grow you need to add AWS EBS (i.e. Elastic Block Storage) and other resources in the EC2 infrastructure to scale up. You also need to take additional steps for monitoring, back ups & disaster recovery.

The S3 decouples compute against the storage requirements. This decoupling allows you to easily (i.e. elastically) scale up or down the storage requirements.

S3’s opt-in versioning feature automatically maintains backups of modified or deleted files, making it easy to recover from accidental data deletion.

  • Cost

S3 is 3 to 5 cheaper that WS EBS (i.e. Elastic Block Storage) used by HDFS.

  • Performance

S3 consumers don’t have the data locally, hence all reads need to transfer data across the network, and S3 performance tuning itself is a black box. Since HDFS data is more local to it, it is much faster(e.g. 3 to 5 times) than S3.  S3 has a higher read/write latency than HDFS.

  • Availability & Durability & Security

Availability guarentees system uptime and Durability guarantees that the data that gets written will survive permanently.  S3 claims 99.999999999% durability and 99.99% availability as opposed to HDFS on EBS gives an availability of 99.9%

S3’s cross-region replication feature can be used for disaster recovery & enhances its strong availability by withstanding the complete outage of an AWS region.

S3 has easy-to-configure audit logging and access control capabilities. These features along with multiple types of encryption makes S3 easy to meet regulatory compliance needs such as PCI (i.e. Payment Card Industry) or HIPAA (i.e. Health Insurance Portability and Accountability Act) compliance.

  • Multipart Upload

You can now break your larger objects (e.g. > 100 MB) into chunks and upload a number of chunks in parallel. If the upload of a chunk fails, you can simply restart it.
You’ll be able to improve your overall upload speed by taking advantage of parallelism.

For example, you can break a 10 GB file into as many as 1024 separate parts and upload each one independently, as long as each part has a size of 5 MB or more.
If an upload of a part fails it can be restarted without affecting any of the other parts.
S3 will return an ETag in response to each part uploaded. Once you have uploaded all of the parts you can ask S3 to assemble the full object with another call to S3.

 

Advertisements