AEP: Play with schema and dataset
In this post I will call few of the basic of datasets in AEP. To start with here is the schema I created
This is based on a custom class if Record Type (this class comes with one attribute Identifier shown as locked in the schema).
I made the identifier as primary identity.
Created a dataset with the same name as class.
Data Loading
1. Delimitted File
Created a simple delimitted file as below and created a workflow to load it into the dataset
Sample Data:
id,dataset_name,last_snapshot_id,process_timestamp,process_status,failure_reason
journey_step_events_27878,journey_step_events,27878,2024-09-12T19:19:50.036Z,SUCCESSFUL,Not applicable
aa_stitched_events_23451,aa_stitched_events,23451,2024-09-17T19:19:50.036Z,FAILED,There was no snapshot available
journey_step_events_27891,journey_step_events,27891,2024-09-16T19:19:50.036Z,In Process,Not applicable
2. Using JSON
Download the JSON format from the schma page and create JSON file using the format. For multiple records it needs to be array as below.
[{
"_mytechnologys": {
"dataset_name": "aa_stitched_events",
"failure_reason": "Not applicable JSON",
"last_snapshot_id": 23049,
"process_status": "WIP",
"process_timestamp": "2018-11-12T20:20:39+00:00"
},
"_id": "aa_stitched_events_23047"
}]
3. Using SQL
You can insert the record using SQL as well. Here is the sample SQL I was using for this schema
INSERT INTO tg_checkpoint_log
SELECT
'journey_step_events_27880' as _id,struct('journey_step_events' AS dataset_name,
27880 as last_snapshot_id,cast(CURRENT_TIMESTAMP AS TIMESTAMP) as process_timestamp, 'WIP' as process_status) as _mytechnologys;
Note: the step and 2 can also be done using API. For file loading, it is recommended to use Parquet file. I didn't try to load csv file using API though.
Here the data is always inserted. Even after making the _id field as primary identifier, while inserting with same values, it does not update records. If you select, you will find duplciate records in the schema.
Comments
Post a Comment