s3.md

# Simple Storage Service (S3)

S3 is private by default.

- Only the account root user that created the bucket has initial access to the bucket.

S3 Security is controlled via a combination of Identity Policies, Bucket Policies (Resource Policies) and Legacy Bucket and Object ACLs

## Considerations

- Default limit of the number of S3 buckets in an AWS account: 100
- Object Max is 5TB and No limit on number objects in a bucket.

---

## S3 Bucket Policies

- A Type of **Resource policy**
- Just like identity policy but are attached to a bucket instead of identities
- One bucket can have only one bucket policy
- One bucket policy can have multiple statments

### Identity policy limitation

- With identity policy you define what that identity can control.
- Identity policies can only be attached to identities in your own account. So they can control security only inside your account.
- There is no way to provide access to identies outside your own account.

### Resource Perspective permission

- With resource policy you define who can access that resource.
- You can ALLOW/DENY who can access the resource from the same account or from a different account
- Resource policy can define access no matter what the source of access is.
- Resource policy can allow or deny ANONYMOUS principles.
- Resource policy can allow access without even having authentication from AWS.

### Principal

- Resource policy differ from identity policy based on the presense of explicit `Principal` in the bucket policy.
- Principal defines which principals are affected by the bucket policy

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicRead",
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["s3:GetObject", "s3:GetObjectVersion"],
      "Resource": ["arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"]
    }
  ]
}
```

- In an identity policy Principal is not defined, as its implied that this policy applies to the principal its attached to.

### Effective Permission

- If an identity is accessing bucket within the same AWS account, then the effective permission is a combination of all the applicable identity policies and bucket policy.
- If access to bucket is by an anonymous identity, then only bucket policy applies. No identity policy applies.
- If an identity in an external AWS account tries to access the bucket in your account, your bucket policy applies as well all the applicable identity policies in their account.
  - Its a two step process. The external identity should be able to access S3 and your bucket. And your bucket should allow access from the external account.

---

## S3 Access Control Lists (ACL)

This is another form of S3 security. This is now replaced by bucket policies.

- ACL are ways to apply security to object and bucket
- They are inflexible and support only very simple permissions

---

## Block Public Access

Recently S3 have started to block public access to anonymous account by default.

- These settings only applies to anonymous principals
- These settings can be set when you create the bucket or afterwards

Options:

- Block Public Access `(Blocks any public access, no matter what the bucket policy is)`
  - Block public access to buckets and objects granted through _new_ ACL `(Any existing public access granted by existing ACL is allowed. But blocks access granted by new ACL)`
  - Block public access to buckets and objects granted through any access control lists `(Any public access is denied whether enabled before or after block publich access settings were enabled)`
  - Block public access to buckets and objects granted through new public bucket or access point policies `(Any existing public access granted by existing ACL or bucket policies is allowed. But blocks access granted by new ones)`
  - Block public and cross-account access to buckets and objects through any public bucket or access point policies `(Blocks existing and new bucket policies or ACL from granting public access)`

---

## Choosing between Identity policy, Bucket Policy and ACL

- If you are granting or denying permission on lots of different resources across an AWS account, use Identity Policies `(Not every resource supports resource policy. Also, you would need resource policy for every service if using resource policy instead.)`
- If you prefer to manage all the permission in one single place then that has to be through IAM. Identity policy would make sense here. `(You can use resource policies at time but use identity policy all the time)`
- If you are only working with one single account, IAM will be able to manage the policies. `(IAM needs to work with identities that you control in your account.)`

- If you want to directly allow external identies or anonymous identities, the use Resource Policies.
- Never use ACLs, unless you must.

---

# S3 Static Hosting

- Normal access is via AWS APIs. This is done via HTTP call.
- API allows setting up static websites
- Static hosting needs to be enabled while setting the `Index` and `Error` document. Both of them needs to be HTML document.
- When accessing a specific page, it delivers that specific page.
- When you dont specify a page, `Index` page is delivered to you.
- When something goes wrong, `Error` page is delivered.
- AWS creates a `Website Endpoint` from which the assets in the bucket can be accessed using HTTP.
- The endpoint name is choosen based on the:
  - bucket name and
  - region name
- You can use custom domains, only if the bucket name matches with the domain name.

For Static Hosting to work:

- Versioning is not required
- Block Public Access has to be disabled
- The above option alone is not sufficient, as S3 is private by default
- In addition to the above step, you can either select all the objects in the bucket and choose `Make Public`. This would add ACLs to the objects but is not the recommended way
- Instead we can choose to add a Bucket Policy

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicRead",
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::www.example.com/*"]
    }
  ]
}
```

Static hosting use cases:

- Offloading
- Out-of-band pages

---

## Offloading

Use S3 to deliver any media instead of the compute service.

Delivery via S3 is much cheaper when compared to compute service.

So the compute service can return a HTML page, which references static assets hosted in S3.

---

## Out-of-band pages

If the compute service delivering the HTML pages is under maintainence, it will not let us show any page.

So, under such maintainence phases, we use something called out-of-band pages. These pages are generally used to show status page or the support page.

---

# S3 Pricing

- Storage cost
  - Pricing is based on every GB of data stored
  - And based on every month its stored under S3
- Data Transfer cost
  - Transfer of data into S3 is always free.
  - For every GB of data that you transfer out of S3, there is a cost associated.
  - And price per 1000 requests
- Static hosting charges
  - Every PUT, COPY, POST and LIST is charged per 1000 requests

## Free Tier

As part of the Free Tier:

- 5GB of standard storage is provided
- 20,000 GET requests
- 2,000 PUT requests

---

# S3 Versioning

Versioning is off by default. Once it is turned on, it cannot be turned off.

- You can only suspend versioning, it cannot go back to disabled state.
- When suspended, old versions still exist. And you will be still billed for them.
- So to save cost you can move the latest version to a new bucket or purge the older versions from the existing bucket.

```

Disabled (Default)  --     Enabled --    Suspended
                            ^           |
                            |           |
                            |___________|

```

Without versioning an object is solely identified by its key.

```

key = image.jpg
id  = null

```

Versioning lets you store multiple versions of objects within a bucket. Operations which would modify objects, generate a new version.

```

key = image.jpg        --         key = image.jpg
id  = 111111                    id  = 222222

                                key = image.jpg
                                id  = 111111


```

- Latest Version or Current Version will be returned if no id is specified in the request.

## Deletion with Versioning

When an object is deleted, AWS puts a delete marker on the object and hides all previous versions. You could delete this marker to enable the item.

- To delete an object, you must delete all the versions of that object using their version marker.

```
key = image.jpg                     {Delete Marker}
id  = 222222
                       Delete
key = image.jpg        ------         key = image.jpg
id  = 111111                        id  = 222222

                                    key = image.jpg
                                    id  = 111111


```

## MFA Delete

Enabled within version configuration in a bucket. This means:

- MFA is required to change bucket versioning state.
- MFA is required to delete versions.
- You pass the serial number (MFA) and the code passed with API calls.

---

# S3 Performance Optimisation

## Single PUT Upload

By default once an object is uploaded `(s3:PutObject)` to S3, it is sent as a single stream of data.

- If a stream fails, the whole upload fails.
- This requires a full restart of the data transfer.

While using Single PUT Upload, you are limited to 5GB data.

## Multipart Upload

Data is broken up into smaller parts.

- We start by breaking down the original blob of data into parts.
- The minimum size of original data should be at least 100MB, to use multipart upload.

Orignal blob can be split into maximum of 10,000 parts.

- Each part can be between 5MB to 5GB
- The last part can be smaller than 5MB

Parts can fail in isolation and be restarted in isolation.

## S3 Transfer Acceleration

While transferring data from one region to another (across geographies), the data has to travel on the public internet before it reaches the public part of AWS network. And using the public internet is not the optimal way of transferring data between destinations.

- To solve this issue we can use S3 Transfer Acceleration.
- This uses AWS Edge Locations. It transfers the data from upload location to the nearest best performing AWS Edge Location.
- The data is then transferred on AWS global network.
- S3 Transfer Acceleration is disabled by default for a bucket.

Restrictions:

- Bucketname cannot contain periods
- Must be DNS compatable in the naming

Transfer acceleration for the S3 bucket needs to be enabled.

- Once enabled, use the `Accelerated Endpoint` for faster data transfers.

---

# S3 Storage Classes

S3 is a **region resillent** service which means it can tolerate the failure of an availability zone.

This is done by replicating objects to at least 3+ AZs when they are uploaded.

---

## S3 Standard

- The default AWS storage class that's used in S3.
- This has 99.999999999% (11, 9's) for Object Durability and 99.99% (4, 9's) for availability.
- Content-MD5 checksums and Cyclic Redundancy Checks are used to detect and fix any data corruption.
- Objects are replicated across at least 3 AZs in an AWS region.
- All of the other storage classes trade some compromises for another.
- Low latency (in milliseconds)

S3 Standard Pricing

- Storage cost
  - Pricing is based on every GB of data stored
  - And based on every month its stored under S3
- Data Transfer cost
  - Transfer of data into S3 is always free.
  - For every GB of data that you transfer out of S3, there is a cost associated.
  - And price per 1000 requests

---

## S3 Standard-IA

Same as S3 Standard

Differences:

- 99.9% (3, 9's) availability, slightly lower than standard S3.
- This is approximately 54% cheaper for the base rate.
- Storage Cost
  - Cheaper than S3 standard
- Data Transfer cost
  - Overall cost increases with frequent data access
- Minimum duration charge of 30 days applies
- Minimum object size charged is 128KB (Makes it costly if you have lots of tiny objects)

---

## S3 One Zone-IA

Same as S3 Standard-IA

- Doesn't provide multi-AZ resilience model of Standard or Standard-IA.
- Only one AZ is used within the region.
- This has 99.999999999% (11, 9's) for Object Durability unless the AZ fails
- Provides 99.5% availability.
- 80% of the base cost of Standard-IA.

---

## S3 Glacier

Similar to S3 Standard it provides the following:

- This has 99.999999999% (11, 9's) for Object Durability and 99.99% (4, 9's) for availability.
- Content-MD5 checksums and Cyclic Redundancy Checks are used to detect and fix any data corruption.
- Objects are replicated across at least 3 AZs in an AWS region.

Costs 1/5 of that of S3 Standard.

You can see the objects in the S3 bucket but they are just pointers to the actual object. Access to these requires a retrieval process.

- During the retrieval process the objects are stored in S3 Standard-IA class storage on a temporary basis.
- Objects are removed once they are retrieved.
- Objects can be retained permanantly if you change the storage class.

90 days minimum billable storage duration charge.

Types of retrieval:

- Expedited `(1 - 5 minutes, most expensive)`
- Standard `(3 - 5 hours)`
- Bulk retreivals `(5 - 12 hours, cheapest)`

  Objects cannot be made publicly available.

---

## S3 Glacier Deep Archive

Similar to S3 Glacier and offers:

- 1/4th the price of S3 Glacier
- 1/20th the price of S3 Standard

Designed for long term backups and as a **tape-drive** replacement.

40KB minimum object capacity charge

180 days minimum storage duration charge.

Types of retrieval:

- Standard `(12 hours)`
- Bulk retreivals `(Upto 48 hours)`

  Objects cannot be made publicly available.

---

## S3 Intelligent-Tiering

Tiers:

- Frequent Access (Similar to S3 Standard)
- Infrequent Access (Similar to S3 Standard-IA)
- Archive (Similar to S3 Glacier)
- Deep Archive (Similar to S3 Glacier Deep Archive)

This monitors and automatically moves any object not accessed for 30 days to a low cost infrequent access tier and eventually to archive or deep archive tiers.

Cost remains the same as per the tiers stored items are mapped to.

- Only an additional monitoring and automation fees per 1000 objects is billable

Only move the data to Archive or Deep Archive if the data isnt required on immediate basis.

30 days minimum billable period.

---

# S3 Lifecycle Configuration

- Lifecycle configuration is a set of rules and these rules consist of actions.
- Actions can apply to the whole bucket or groups of objects.

Rules scope can be choosen between:

- Limit the scope of the rule using one or more filters
- Or rule applies to all objects in the bucket

Types of Actions:

- Transistion (Change the storage class of the bucket or objects)
- Expiration (Delete the items after certain amount of time)

Rules cant be based on access pattern. Only Intelligent Tier is used for this use case.

## Transitions

Think of lifecycle transitions as `waterfall`.

```
S3 Standard

    S3 Standard-IA

    S3 Intelligent Tiering

    S3 One Zone-IA

        S3 Glacier

        S3 Glacier Deep Archive
```

Objects must step down their storage class, they can't step up the storage class.

## Considerations

- Smaller objects can cost more, due to minimum object size billing
- Object needs to remain on S3 Standard for a minimum of 30 days before transistion
  - You can instead directly upload to a differnt storage class
  - For automatic transistion to happen minimum duration before transistion is applicable
- Also, you will have to wait 30 days minimum to transition between any of `S3 Standard-IA, S3 Intelligent Tiering or S3 One Zone-IA` and then into any of `S3 Glacier or S3 Glacier Deep Archive` if you have **a single rule** for this transistion to happen.
  - You can work around this by defining two differnt rules. But you will still be charged for mimimum duration for each transition.

---

# S3 Replication

S3 has to replication features which allow objects to be replicated between a SOURCE and DESTINATION buckets in the same or different AWS accounts

- Replication is done over SSL

Two types of replication supported by S3:

- Cross-Region Replication (CRR) is the process used when Source and Destination are in different AWS regions

- Same-Region Replication (SRR) is used when the buckets are in the same region.

---

## Role based replication

In both cases the replication configuration is applied to the SOURCE bucket. This config specifies

- DESTINATION bucket to use
- IAM role to be used. Role is defined for the S3 bucket to assume it.
  - Role's permission policy gives it permission to read objects on the source object and replicate that to destination bucket.

### Source and Destination in same account

When replication happens within the same AWS account, both buckets are owned by same AWS account and they both trust the same AWS account. They both trust the same AWS account they are in, thus they trust the same IAM role.

- Here the same role will have access to both source and destination as long as the roles permission policy grants access

### Source and Destination in different account

Whereas when replication between different account happens, IAM role configured in source for replication isn't trusted by the destination account.

- In this case, you need to add a bucket policy on the destination bucket to allow the role in source account to replicate objects from destination bucket.
- Bucket policy in this case is defining that the role in diffent account is allowed to replicate the content from this bucket.

---

## S3 Replication Options

- All objects or a subset
- Storage Class
  - Default is to use same class as existing data
  - You could also use a better storage as source and a cheaper one as destination
- Ownership
  - If both buckets in same account, you will have the same ownership between both
  - If both buckets in differnt aws accounts, by default the objects replicated in destination bucket will be owned by source account. In this case, destination bucket will not have access to the replicated objects as they are owned by a different aws account
  - So, using the ownership you can allow the desination bucket to read any object created in it
- Replication Time Control (RTC)
  - Without this option set, the replication is best effort policy
  - This option sets the SLA for sync between source and destination

---

## S3 Replication Considerations

- Objects existing prior to replication being enabled on the source bucket will not be replicated to destination bucket. Only objects added afterwards will be part of replication.
- `Versioning` should be enabled on both source and destination bucket.
- `One-way replication` from source to destination. If you add any objects on the destination bucket, they will not be replicated to source bucket.
- Replication can handle objects that are unencrypted.
- Replication can also handle objects that are encrypted using SSE-S3 & SSE-KMS (this requires extra config).
- Replication cannot handle objects that were encrypted using SSE-C, as S3 does not store keys for this encryption.
- No deletes are replicated by default. Enable `Delete Marker Replication` to replicate deletions.

### Ownership

If you create a bucket in an account and you add the objects to it, the same aws account will own those objects.

If you grant cross account access to a bucket, its possible that source account will not own some of those objects.

- Only objects owned by the source account will be replicated

### Limitation

- Any changes made by S3 lifecycle management will not be replicated.
  - Only user events are replicated
  - No system events are replicated
- Cant be used with Glacier or Glacier Deep Archive
- Delete markers created by lifecycle rules will not be replicated even with `Delete Marker Replication` enabled

---

## Use Cases

- SRR - Log Aggregation `(Aggregate logs from multiple buckets into one bucket)`
- SRR - Sync production and test accounts `(Replication between environments)`
- SRR - Resilience with strict sovereignty requirements `(To meet the region and account related compliance requirements)`
- CRR - Global resilience improvements `(Cross Region Replication to backup data between different aws regions)`
- CRR - Latency reduction `(Replicate data to another region to reduce latency for users in that region)`

---

# Presigned URLs

Consider an S3 bucket that doesnt have any public access configured and has the default private configuration.

So, in order to access object in this bucket we have the following options:

- An IAM admin can make a request. The credentials are then used to authenticate with AWS and access the object/bucket.
- And unauthenticated user doesnt have any way to specify credential. To overcome this you could:
  - Give AWS identity to unauthenticated user
  - Give AWS credentials(username/password) to unauthenticated user
  - Make the bucket public

For short term access the above options are not recommended.

## Generating Presigned URLs

iamadmin can make a reqeust to S3 to **generate presigned URL**

The user must provide:

- security credentials
- bucket name
- object key
- expiry date and time
- indicate how the object or bucket will be accessed

S3 will respond with a custom URL with all the details encoded including
the expiration of the URL.

## Operations supported

Presigned URLs support both GET (download) and PUT (upload) to the S3 Bucket.

## Considerations

- You can create a URL for an object you have no access to
  - The object will not allow access because your user does not have it.
  - But when the user gets his access, the same URL will be functional.
- When using the URL it matches the current permissions of the identity using it.
  - If the current identity's iam permissions change, the signed URL will also reflect that change.
  - If you get an access deny it means the ID never had access or his policy permissions has been updated.
- Don't generate with a role.
  - Roles are short term lived entity and when it expires the URL that was generated with role will expire as well.

## Generating through CLI

```
aws s3 presign <s3-uri    --expires-in <seconds>
```

---

## S3 Select and Glacier Select

This provides a ways to retrieve parts of objects and not the entire object.

If you retrieve a 5TB object, it takes time and consumes 5TB of data.
Filtering at the client side doesn't reduce this cost.

S3 and Glacier select lets you use SQL-like statement.

The filtering happens at the S3 bucket source

File formats supported under this:

- CSV
- JSON
- Parquet

It can also use Bzip compression for CSV and JSON

---

# Cross-Origin Resource Sharing (CORS)

Consider a scenario where user is accessing a website `catagram.io` statically hosted on a bucket named `catagram.io`.

While the resources requested from the website are being fetched from `catagram.io`, the request in such case is called same origin request.

## Cross Origin Requests

Say, the website also hosts images that it uses on `catagram-img.io` bucket. So, `catagram.io` requesting images from `catagram-img.io` bucket is called cross origin request.

In this case, bucket need to enable cross-origin resource sharing (CORS) so that request from origin's other than `catagram-img.io` can be served.

## CORS configuration

- CORS configurations are run in order
- The first matching configuration is used

```json
[
  {
    "AllowedHeaders": ["*"],
    "AllowedMethods": ["PUT", "POST", "DELETE"],
    "AllowedOrigins": ["http://catagram.io"],
    "ExposeHeaders": []
  },
  {
    "AllowedHeaders": [],
    "AllowedMethods": ["GET"],
    "AllowedOrigins": ["*"],
    "ExposeHeaders": []
  }
]
```

## Types of requests

- Simple Requests `(Some requests don't trigger a CORS preflight. Those are called simple requests)`
- Preflight and Preflighted requests `(The browser will make a request using the OPTIONS method to the resource on the other origin, in order to determine if the actual request is safe to send)`

---

# S3 Event Notifications

Notification generated when events occur in a bucket can be delivered to:

- SNS
- SQS
- Lambda functions

Events:

- Object created (Put, Post, Copy, CompleteMultiPartUpload)
- Object delete (Delete, DeleteMarketCreated)
- Object restore (Post Initiated, Post Completed)
- Replication (OperationMissedThreshold, OperationReplicatedAfterThreshold, OperationNotTracked, OperationFailedReplication)

To enable notifications, you must first add a `event notification configuration` that identifies the events you want Amazon S3 to publish and the destinations where you want Amazon S3 to send the notifications.

- You store this configuration in the notification subresource that is associated with a bucket

## Permission

Since events are generated from S3, we must add resource policy on destination(SNS, SQS or Lambda function) to allow S3 service to interact with them.

## Example

The following notification configuration contains a queue configuration identifying an Amazon SQS queue for Amazon S3 to publish events of the `s3:ObjectCreated:Put` type. The events are published whenever an object that has a prefix of `images/` and a `jpg` suffix is PUT to a bucket.

```xml
<NotificationConfiguration>
  <QueueConfiguration>
      <Id>1</Id>
      <Filter>
          <S3Key>
              <FilterRule>
                  <Name>prefix</Name>
                  <Value>images/</Value>
              </FilterRule>
              <FilterRule>
                  <Name>suffix</Name>
                  <Value>jpg</Value>
              </FilterRule>
          </S3Key>
     </Filter>
     <Queue>arn:aws:sqs:us-west-2:444455556666:s3notificationqueue</Queue>
     <Event>s3:ObjectCreated:Put</Event>
  </QueueConfiguration>
</NotificationConfiguration>
```

    EventBridge is an alternative and supports more types of events and more services. And is a recommended option instead of event notification.

---

# S3 Access Logs

S3 access logging provides detailed records for the requests that are made to a bucket.

- A target bucket is used to capture the access logs
- Access logging of source bucket can be enabled through console UI or via PUT bucket logging (Using CLI or API)

Logging is managed by S3 Log Delivery Group, which reads the logging configuration set on the source bucket.

Enabling of access logs or logs reflecting in destination bucket can take few hours and is done based on `best effort log delivery`.

---

# S3 Requester Pays

In general, bucket owners pay for all Amazon S3 storage and data transfer costs associated with their bucket.

- A bucket owner, however, can configure a bucket to be a Requester Pays bucket.
- With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download (Transfer OUT of S3).
- The bucket owner always pays the cost of storing data.

For Requester Pays to work:

- Unauthenticated requests aren't supported
- Authenticated identites are required for billing
- Requesters must add `x-amx-request-payer` header to confirm payment responsibility