# Amazon S3

Amazon S3 is a highly scalable and reliable object storage service. Its flexibility makes it a perfect source for storing customer-related data, such as marketing and sales logs, event records, or customer attributes. HockeyStack integrates seamlessly with S3 to ingest your data, offering the flexibility you need to unify and analyze all of your customer insights in one place.

## Data Ingestion Approach <a href="#block-3ddf20eae53e4c15aa49008d28902ddc" id="block-3ddf20eae53e4c15aa49008d28902ddc"></a>

We recommend using CSV or Parquet files to store the data you want HockeyStack to process. This ensures efficient ingestion and parsing of large datasets.

HockeyStack maintains an internal Last Sync Date to manage incremental updates. We only pull new or updated objects where the Last Sync Date is earlier than the object’s Timestamp. If you anticipate adding historical data after the initial sync, include an **Added or Updated At** column to ensure proper incremental loading of past records.

**Initial Backfill:**\
We start by pulling your historical data — often from the past few years, depending on your requirements — to create a comprehensive baseline of your datasets.

**Incremental Syncs:**\
After the initial backfill, HockeyStack retrieves only the daily deltas (differences) so your analytics stay up-to-date without unnecessary overhead.

## Methods for Pulling Data from S3

Depending on your technical environment and preferences, we offer multiple ways to integrate with S3.

### 1. Direct S3 Bucket Access via ClickHouse <a href="#block-39b8489a6e05475e9e6d995cb1ad6181" id="block-39b8489a6e05475e9e6d995cb1ad6181"></a>

**Method Overview:**\
Expose your S3 buckets directly for ingestion through the [ClickHouse S3 connector](https://clickhouse.com/docs/en/integrations/s3). ClickHouse is our primary analytical database, which stores every datapoint about your customers for high-performance querying.

**Table Requirements:**

* **Timestamp:** A column indicating when the event or record occurred.
* **Identity (Email):** A unique identifier (e.g., email) to link records to individual customers or entities.
* **Action Data:** Additional columns representing activities, attributes, or metrics you want to analyze.

### 2. Using Amazon Athena <a href="#block-04cb9d6fdc774cd1910ca3a6f368a887" id="block-04cb9d6fdc774cd1910ca3a6f368a887"></a>

**Method Overview:**\
[Amazon Athena](https://aws.amazon.com/athena/) provides a serverless, SQL-based interface to query your S3 data. HockeyStack uses the [athena-express](https://github.com/ghdna/athena-express) NPM package to handle the connection and querying process.

**Considerations:**

* **Cost Control:** Athena charges per query and data scanned, so we’ll work with you to set query limits or schedules that keep costs predictable.
* **Schema & Structure:** Ensure the queried data contains the required Timestamp, Identity, and action-related columns.

### IAM & Permissions Setup

For both direct S3 access and Athena-based ingestion, you’ll need to:

* **Create an IAM User:**\
  Provide HockeyStack with `AccessKeyID` and `SecretAccessKey` for secure programmatic access.
* **Permissions:**\
  Grant the IAM user `AmazonS3FullAccess` or a more restricted, bucket-level read policy that still allows HockeyStack to retrieve the necessary files.\
  If using Athena, also include `AmazonAthenaFullAccess` or equivalent permissions so HockeyStack can run queries against your S3 data.

### Data Schema Documentation

Once you’ve chosen a method, tested the connection, and confirmed access, provide HockeyStack with a short description of each object type, file structure, and any notable fields within your S3 data. For example:

* **Column Descriptions:**
  * `timestamp`: Event occurrence time
  * `email`: User’s email address
  * `page_view_count`: Number of pages viewed in a session
  * `added_at`/`updated_at`: When the record was inserted or modified

This information ensures HockeyStack can accurately interpret, map, and utilize your S3 datasets.

## Advanced Option: Custom Data Pipeline to ClickHouse

For customers seeking even more control, there’s a third approach:

* **Custom Data Pipeline:**\
  Build a custom pipeline that pushes data directly into a dedicated ClickHouse cluster managed by HockeyStack.
  * **Who’s It For?**: Technical teams comfortable with custom development.
  * **Benefits:** Fine-grained control over data ingestion frequency, batch sizes, and data handling logic.
  * **Considerations:** Requires additional engineering resources on your end and a call with our team to discuss schema and requirements.

***

**Whether you choose direct S3 access, Athena, or a custom pipeline, we’re here to guide you through the integration.** Once set up, HockeyStack will pull historical and incremental data from S3, enabling advanced analytics and deeper insights into your customer journeys. If you have any questions or need assistance, reach out to our support team for tailored guidance.

### Overview

Amazon S3 is a highly scalable and reliable object storage service. Its flexibility makes it a perfect source for storing customer-related data, such as marketing and sales logs, event records, or customer attributes. HockeyStack integrates seamlessly with S3 to ingest your data, offering the flexibility you need to unify and analyze all of your customer insights in one place.

### Data Ingestion Approach <a href="#block-3ddf20eae53e4c15aa49008d28902ddc" id="block-3ddf20eae53e4c15aa49008d28902ddc"></a>

We recommend using CSV or Parquet files to store the data you want HockeyStack to process. This ensures efficient ingestion and parsing of large datasets.

HockeyStack maintains an internal Last Sync Date to manage incremental updates. We only pull new or updated objects where the Last Sync Date is earlier than the object’s Timestamp. If you anticipate adding historical data after the initial sync, include an **Added or Updated At** column to ensure proper incremental loading of past records.

**Initial Backfill:**\
We start by pulling your historical data — often from the past few years, depending on your requirements — to create a comprehensive baseline of your datasets.

**Incremental Syncs:**\
After the initial backfill, HockeyStack retrieves only the daily deltas (differences) so your analytics stay up-to-date without unnecessary overhead.

### Methods for Pulling Data from S3

Depending on your technical environment and preferences, we offer multiple ways to integrate with S3.

#### 1. Direct S3 Bucket Access via ClickHouse <a href="#block-39b8489a6e05475e9e6d995cb1ad6181" id="block-39b8489a6e05475e9e6d995cb1ad6181"></a>

**Method Overview:**\
Expose your S3 buckets directly for ingestion through the [ClickHouse S3 connector](https://clickhouse.com/docs/en/integrations/s3). ClickHouse is our primary analytical database, which stores every datapoint about your customers for high-performance querying.

**Table Requirements:**

* **Timestamp:** A column indicating when the event or record occurred.
* **Identity (Email):** A unique identifier (e.g., email) to link records to individual customers or entities.
* **Action Data:** Additional columns representing activities, attributes, or metrics you want to analyze.

#### 2. Using Amazon Athena <a href="#block-04cb9d6fdc774cd1910ca3a6f368a887" id="block-04cb9d6fdc774cd1910ca3a6f368a887"></a>

**Method Overview:**\
[Amazon Athena](https://aws.amazon.com/athena/) provides a serverless, SQL-based interface to query your S3 data. HockeyStack uses the [athena-express](https://github.com/ghdna/athena-express) NPM package to handle the connection and querying process.

**Considerations:**

* **Cost Control:** Athena charges per query and data scanned, so we’ll work with you to set query limits or schedules that keep costs predictable.
* **Schema & Structure:** Ensure the queried data contains the required Timestamp, Identity, and action-related columns.

#### IAM & Permissions Setup

For both direct S3 access and Athena-based ingestion, you’ll need to:

* **Create an IAM User:**\
  Provide HockeyStack with `AccessKeyID` and `SecretAccessKey` for secure programmatic access.
* **Permissions:**\
  Grant the IAM user `AmazonS3FullAccess` or a more restricted, bucket-level read policy that still allows HockeyStack to retrieve the necessary files.\
  If using Athena, also include `AmazonAthenaFullAccess` or equivalent permissions so HockeyStack can run queries against your S3 data.

#### Data Schema Documentation

Once you’ve chosen a method, tested the connection, and confirmed access, provide HockeyStack with a short description of each object type, file structure, and any notable fields within your S3 data. For example:

* **Column Descriptions:**
  * `timestamp`: Event occurrence time
  * `email`: User’s email address
  * `page_view_count`: Number of pages viewed in a session
  * `added_at`/`updated_at`: When the record was inserted or modified

This information ensures HockeyStack can accurately interpret, map, and utilize your S3 datasets.

### Advanced Option: Custom Data Pipeline to ClickHouse

For customers seeking even more control, there’s a third approach:

* **Custom Data Pipeline:**\
  Build a custom pipeline that pushes data directly into a dedicated ClickHouse cluster managed by HockeyStack.
  * **Who’s It For?**: Technical teams comfortable with custom development.
  * **Benefits:** Fine-grained control over data ingestion frequency, batch sizes, and data handling logic.
  * **Considerations:** Requires additional engineering resources on your end and a call with our team to discuss schema and requirements.

***

**Whether you choose direct S3 access, Athena, or a custom pipeline, we’re here to guide you through the integration.** Once set up, HockeyStack will pull historical and incremental data from S3, enabling advanced analytics and deeper insights into your customer journeys. If you have any questions or need assistance, reach out to our support team for tailored guidance.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.hockeystack.com/integrations/analytics-and-data-warehouse/amazon-s3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
