Amazon S3

Amazon S3 is a highly scalable and reliable object storage service. Its flexibility makes it a perfect source for storing customer-related data, such as marketing and sales logs, event records, or customer attributes. HockeyStack integrates seamlessly with S3 to ingest your data, offering the flexibility you need to unify and analyze all of your customer insights in one place.

Data Ingestion Approach

We recommend using CSV or Parquet files to store the data you want HockeyStack to process. This ensures efficient ingestion and parsing of large datasets.

HockeyStack maintains an internal Last Sync Date to manage incremental updates. We only pull new or updated objects where the Last Sync Date is earlier than the object’s Timestamp. If you anticipate adding historical data after the initial sync, include an Added or Updated At column to ensure proper incremental loading of past records.

Initial Backfill: We start by pulling your historical data — often from the past few years, depending on your requirements — to create a comprehensive baseline of your datasets.

Incremental Syncs: After the initial backfill, HockeyStack retrieves only the daily deltas (differences) so your analytics stay up-to-date without unnecessary overhead.

Methods for Pulling Data from S3

Depending on your technical environment and preferences, we offer multiple ways to integrate with S3.

1. Direct S3 Bucket Access via ClickHouse

Method Overview: Expose your S3 buckets directly for ingestion through the ClickHouse S3 connector. ClickHouse is our primary analytical database, which stores every datapoint about your customers for high-performance querying.

Table Requirements:

  • Timestamp: A column indicating when the event or record occurred.

  • Identity (Email): A unique identifier (e.g., email) to link records to individual customers or entities.

  • Action Data: Additional columns representing activities, attributes, or metrics you want to analyze.

2. Using Amazon Athena

Method Overview: Amazon Athena provides a serverless, SQL-based interface to query your S3 data. HockeyStack uses the athena-express NPM package to handle the connection and querying process.

Considerations:

  • Cost Control: Athena charges per query and data scanned, so we’ll work with you to set query limits or schedules that keep costs predictable.

  • Schema & Structure: Ensure the queried data contains the required Timestamp, Identity, and action-related columns.

IAM & Permissions Setup

For both direct S3 access and Athena-based ingestion, you’ll need to:

  • Create an IAM User: Provide HockeyStack with AccessKeyID and SecretAccessKey for secure programmatic access.

  • Permissions: Grant the IAM user AmazonS3FullAccess or a more restricted, bucket-level read policy that still allows HockeyStack to retrieve the necessary files. If using Athena, also include AmazonAthenaFullAccess or equivalent permissions so HockeyStack can run queries against your S3 data.

Data Schema Documentation

Once you’ve chosen a method, tested the connection, and confirmed access, provide HockeyStack with a short description of each object type, file structure, and any notable fields within your S3 data. For example:

  • Column Descriptions:

    • timestamp: Event occurrence time

    • email: User’s email address

    • page_view_count: Number of pages viewed in a session

    • added_at/updated_at: When the record was inserted or modified

This information ensures HockeyStack can accurately interpret, map, and utilize your S3 datasets.

Advanced Option: Custom Data Pipeline to ClickHouse

For customers seeking even more control, there’s a third approach:

  • Custom Data Pipeline: Build a custom pipeline that pushes data directly into a dedicated ClickHouse cluster managed by HockeyStack.

    • Who’s It For?: Technical teams comfortable with custom development.

    • Benefits: Fine-grained control over data ingestion frequency, batch sizes, and data handling logic.

    • Considerations: Requires additional engineering resources on your end and a call with our team to discuss schema and requirements.


Whether you choose direct S3 access, Athena, or a custom pipeline, we’re here to guide you through the integration. Once set up, HockeyStack will pull historical and incremental data from S3, enabling advanced analytics and deeper insights into your customer journeys. If you have any questions or need assistance, reach out to our support team for tailored guidance.

Overview

Amazon S3 is a highly scalable and reliable object storage service. Its flexibility makes it a perfect source for storing customer-related data, such as marketing and sales logs, event records, or customer attributes. HockeyStack integrates seamlessly with S3 to ingest your data, offering the flexibility you need to unify and analyze all of your customer insights in one place.

Data Ingestion Approach

We recommend using CSV or Parquet files to store the data you want HockeyStack to process. This ensures efficient ingestion and parsing of large datasets.

HockeyStack maintains an internal Last Sync Date to manage incremental updates. We only pull new or updated objects where the Last Sync Date is earlier than the object’s Timestamp. If you anticipate adding historical data after the initial sync, include an Added or Updated At column to ensure proper incremental loading of past records.

Initial Backfill: We start by pulling your historical data — often from the past few years, depending on your requirements — to create a comprehensive baseline of your datasets.

Incremental Syncs: After the initial backfill, HockeyStack retrieves only the daily deltas (differences) so your analytics stay up-to-date without unnecessary overhead.

Methods for Pulling Data from S3

Depending on your technical environment and preferences, we offer multiple ways to integrate with S3.

1. Direct S3 Bucket Access via ClickHouse

Method Overview: Expose your S3 buckets directly for ingestion through the ClickHouse S3 connector. ClickHouse is our primary analytical database, which stores every datapoint about your customers for high-performance querying.

Table Requirements:

  • Timestamp: A column indicating when the event or record occurred.

  • Identity (Email): A unique identifier (e.g., email) to link records to individual customers or entities.

  • Action Data: Additional columns representing activities, attributes, or metrics you want to analyze.

2. Using Amazon Athena

Method Overview: Amazon Athena provides a serverless, SQL-based interface to query your S3 data. HockeyStack uses the athena-express NPM package to handle the connection and querying process.

Considerations:

  • Cost Control: Athena charges per query and data scanned, so we’ll work with you to set query limits or schedules that keep costs predictable.

  • Schema & Structure: Ensure the queried data contains the required Timestamp, Identity, and action-related columns.

IAM & Permissions Setup

For both direct S3 access and Athena-based ingestion, you’ll need to:

  • Create an IAM User: Provide HockeyStack with AccessKeyID and SecretAccessKey for secure programmatic access.

  • Permissions: Grant the IAM user AmazonS3FullAccess or a more restricted, bucket-level read policy that still allows HockeyStack to retrieve the necessary files. If using Athena, also include AmazonAthenaFullAccess or equivalent permissions so HockeyStack can run queries against your S3 data.

Data Schema Documentation

Once you’ve chosen a method, tested the connection, and confirmed access, provide HockeyStack with a short description of each object type, file structure, and any notable fields within your S3 data. For example:

  • Column Descriptions:

    • timestamp: Event occurrence time

    • email: User’s email address

    • page_view_count: Number of pages viewed in a session

    • added_at/updated_at: When the record was inserted or modified

This information ensures HockeyStack can accurately interpret, map, and utilize your S3 datasets.

Advanced Option: Custom Data Pipeline to ClickHouse

For customers seeking even more control, there’s a third approach:

  • Custom Data Pipeline: Build a custom pipeline that pushes data directly into a dedicated ClickHouse cluster managed by HockeyStack.

    • Who’s It For?: Technical teams comfortable with custom development.

    • Benefits: Fine-grained control over data ingestion frequency, batch sizes, and data handling logic.

    • Considerations: Requires additional engineering resources on your end and a call with our team to discuss schema and requirements.


Whether you choose direct S3 access, Athena, or a custom pipeline, we’re here to guide you through the integration. Once set up, HockeyStack will pull historical and incremental data from S3, enabling advanced analytics and deeper insights into your customer journeys. If you have any questions or need assistance, reach out to our support team for tailored guidance.

Last updated