Amazon S3
Amazon S3 is a highly scalable and reliable object storage service. Its flexibility makes it a perfect source for storing customer-related data, such as marketing and sales logs, event records, or customer attributes. HockeyStack integrates seamlessly with S3 to ingest your data, offering the flexibility you need to unify and analyze all of your customer insights in one place.
Data Ingestion Approach
We recommend using CSV or Parquet files to store the data you want HockeyStack to process. This ensures efficient ingestion and parsing of large datasets.
HockeyStack maintains an internal Last Sync Date to manage incremental updates. We only pull new or updated objects where the Last Sync Date is earlier than the object’s Timestamp. If you anticipate adding historical data after the initial sync, include an Added or Updated At column to ensure proper incremental loading of past records.
Initial Backfill: We start by pulling your historical data — often from the past few years, depending on your requirements — to create a comprehensive baseline of your datasets.
Incremental Syncs: After the initial backfill, HockeyStack retrieves only the daily deltas (differences) so your analytics stay up-to-date without unnecessary overhead.
Methods for Pulling Data from S3
Depending on your technical environment and preferences, we offer multiple ways to integrate with S3.
1. Direct S3 Bucket Access via ClickHouse
Method Overview: Expose your S3 buckets directly for ingestion through the ClickHouse S3 connector. ClickHouse is our primary analytical database, which stores every datapoint about your customers for high-performance querying.
Table Requirements:
Timestamp: A column indicating when the event or record occurred.
Identity (Email): A unique identifier (e.g., email) to link records to individual customers or entities.
Action Data: Additional columns representing activities, attributes, or metrics you want to analyze.
2. Using Amazon Athena
Method Overview: Amazon Athena provides a serverless, SQL-based interface to query your S3 data. HockeyStack uses the athena-express NPM package to handle the connection and querying process.
Considerations:
Cost Control: Athena charges per query and data scanned, so we’ll work with you to set query limits or schedules that keep costs predictable.
Schema & Structure: Ensure the queried data contains the required Timestamp, Identity, and action-related columns.
IAM & Permissions Setup
For both direct S3 access and Athena-based ingestion, you’ll need to:
Create an IAM User: Provide HockeyStack with
AccessKeyID
andSecretAccessKey
for secure programmatic access.Permissions: Grant the IAM user
AmazonS3FullAccess
or a more restricted, bucket-level read policy that still allows HockeyStack to retrieve the necessary files. If using Athena, also includeAmazonAthenaFullAccess
or equivalent permissions so HockeyStack can run queries against your S3 data.
Data Schema Documentation
Once you’ve chosen a method, tested the connection, and confirmed access, provide HockeyStack with a short description of each object type, file structure, and any notable fields within your S3 data. For example:
Column Descriptions:
timestamp
: Event occurrence timeemail
: User’s email addresspage_view_count
: Number of pages viewed in a sessionadded_at
/updated_at
: When the record was inserted or modified
This information ensures HockeyStack can accurately interpret, map, and utilize your S3 datasets.
Advanced Option: Custom Data Pipeline to ClickHouse
For customers seeking even more control, there’s a third approach:
Custom Data Pipeline: Build a custom pipeline that pushes data directly into a dedicated ClickHouse cluster managed by HockeyStack.
Who’s It For?: Technical teams comfortable with custom development.
Benefits: Fine-grained control over data ingestion frequency, batch sizes, and data handling logic.
Considerations: Requires additional engineering resources on your end and a call with our team to discuss schema and requirements.
Whether you choose direct S3 access, Athena, or a custom pipeline, we’re here to guide you through the integration. Once set up, HockeyStack will pull historical and incremental data from S3, enabling advanced analytics and deeper insights into your customer journeys. If you have any questions or need assistance, reach out to our support team for tailored guidance.
Overview
Amazon S3 is a highly scalable and reliable object storage service. Its flexibility makes it a perfect source for storing customer-related data, such as marketing and sales logs, event records, or customer attributes. HockeyStack integrates seamlessly with S3 to ingest your data, offering the flexibility you need to unify and analyze all of your customer insights in one place.
Data Ingestion Approach
We recommend using CSV or Parquet files to store the data you want HockeyStack to process. This ensures efficient ingestion and parsing of large datasets.
HockeyStack maintains an internal Last Sync Date to manage incremental updates. We only pull new or updated objects where the Last Sync Date is earlier than the object’s Timestamp. If you anticipate adding historical data after the initial sync, include an Added or Updated At column to ensure proper incremental loading of past records.
Initial Backfill: We start by pulling your historical data — often from the past few years, depending on your requirements — to create a comprehensive baseline of your datasets.
Incremental Syncs: After the initial backfill, HockeyStack retrieves only the daily deltas (differences) so your analytics stay up-to-date without unnecessary overhead.
Methods for Pulling Data from S3
Depending on your technical environment and preferences, we offer multiple ways to integrate with S3.
1. Direct S3 Bucket Access via ClickHouse
Method Overview: Expose your S3 buckets directly for ingestion through the ClickHouse S3 connector. ClickHouse is our primary analytical database, which stores every datapoint about your customers for high-performance querying.
Table Requirements:
Timestamp: A column indicating when the event or record occurred.
Identity (Email): A unique identifier (e.g., email) to link records to individual customers or entities.
Action Data: Additional columns representing activities, attributes, or metrics you want to analyze.
2. Using Amazon Athena
Method Overview: Amazon Athena provides a serverless, SQL-based interface to query your S3 data. HockeyStack uses the athena-express NPM package to handle the connection and querying process.
Considerations:
Cost Control: Athena charges per query and data scanned, so we’ll work with you to set query limits or schedules that keep costs predictable.
Schema & Structure: Ensure the queried data contains the required Timestamp, Identity, and action-related columns.
IAM & Permissions Setup
For both direct S3 access and Athena-based ingestion, you’ll need to:
Create an IAM User: Provide HockeyStack with
AccessKeyID
andSecretAccessKey
for secure programmatic access.Permissions: Grant the IAM user
AmazonS3FullAccess
or a more restricted, bucket-level read policy that still allows HockeyStack to retrieve the necessary files. If using Athena, also includeAmazonAthenaFullAccess
or equivalent permissions so HockeyStack can run queries against your S3 data.
Data Schema Documentation
Once you’ve chosen a method, tested the connection, and confirmed access, provide HockeyStack with a short description of each object type, file structure, and any notable fields within your S3 data. For example:
Column Descriptions:
timestamp
: Event occurrence timeemail
: User’s email addresspage_view_count
: Number of pages viewed in a sessionadded_at
/updated_at
: When the record was inserted or modified
This information ensures HockeyStack can accurately interpret, map, and utilize your S3 datasets.
Advanced Option: Custom Data Pipeline to ClickHouse
For customers seeking even more control, there’s a third approach:
Custom Data Pipeline: Build a custom pipeline that pushes data directly into a dedicated ClickHouse cluster managed by HockeyStack.
Who’s It For?: Technical teams comfortable with custom development.
Benefits: Fine-grained control over data ingestion frequency, batch sizes, and data handling logic.
Considerations: Requires additional engineering resources on your end and a call with our team to discuss schema and requirements.
Whether you choose direct S3 access, Athena, or a custom pipeline, we’re here to guide you through the integration. Once set up, HockeyStack will pull historical and incremental data from S3, enabling advanced analytics and deeper insights into your customer journeys. If you have any questions or need assistance, reach out to our support team for tailored guidance.
Last updated