文章目录
- 长序列特征的例子
- 1. Event-level features
- 2. Sequence-level features
- Aggregation Features
- Session-based Features
- Temporal Order Features
- 3. User-level features
- 4. Interaction features (between user and item/context)
- how to store the long term user behaviro sequence features in offline data lake storage?
- how to update this behavior_sequence field efficiently when there is new behavior for the same user?
- 参考资料
长序列特征的例子
For example, a user’s sequence could look like this:
[ Electronics, Clothing, Books, Home & Kitchen, Electronics, Books, Electronics, Sports & Outdoors, Electronics ]
The interactions could be further refined by adding the type of behavior (e.g., [Electronics:view, Clothing:click, Books:purchase, Home & Kitchen:add_to_cart, …] ).
1. Event-level features
Categorical Encoding: Convert event types (e.g., “click”, “add to cart”, “purchase”, “view”) or item categories into numerical representations using techniques like one-hot encoding or embedding methods.
Temporal Features: Extract time-based features from timestamps, such as hour of day, day of the week, month, and time elapsed since previous interaction.
Interaction-Specific Features: Capture attributes specific to each interaction, like product price, rating, duration of video watched, etc.
2. Sequence-level features
Aggregation Features
Count of specific events: Number of clicks, purchases, or searches in the past week.
Average value of numerical features: Average product price of items viewed or purchased.
Time-based statistics: Maximum, minimum, or average time between consecutive interactions.
Frequency of interactions: Number of interactions per hour or day.
Session-based Features
Session length: Number of events or duration of the session.
Session activity type: Percentage of clicks, purchases, or searches within the session.
Sequence of items/events within a session: Representing the order of actions taken by the user, for example, viewing product A, then B, then adding B to the cart.
Temporal Order Features
Lag features: Features from previous interactions (e.g., the last item viewed, the type of the second-to-last event). GeeksforGeeks notes that lag features are a fundamental technique for time-series data.
Positional embeddings: Add positional information to sequence embeddings to capture the order of events.
3. User-level features
Long-Term Preference Features: Summarize user preferences over a long period:
Most frequently purchased categories: Top categories a user interacts with.
Overall spending patterns: Average purchase value, total purchases, etc.
Average interaction count: Average number of interactions per day or week.
User Embeddings
4. Interaction features (between user and item/context)
User-Item Similarity: Calculate the similarity between the current item and previous items the user interacted with.
Time Since Last Interaction with Item: Capturing recency of interest in a particular item.
how to store the long term user behaviro sequence features in offline data lake storage?
- Schema design: see following
- File formats:Columnar Formats (Parquet or ORC)
- Partitioning strategies:Date-Based Partitioning,User ID/Device ID Partitioning
- Data ingestion and processing:Batch Ingestion,Data Enrichment and Transformation
- Lifecycle management and cost optimization:Retention Policies
Schema design:
user_id: string
name: string
gender: string
behavior_sequence: array<struct<timestamp: timestamp,category_id: int,action_type: string,product_id: string,price: double>
>
how to update this behavior_sequence field efficiently when there is new behavior for the same user?
Merge Operations (Upserts/MERGE SQL): This allows you to efficiently update existing records (the user_id and its behavior_sequence) and insert new ones
MERGE INTO target_delta_table AS target
USING source_data AS source
ON target.user_id = source.user_id
WHEN MATCHED THENUPDATE SET target.behavior_sequence = array_append(target.behavior_sequence, source.new_behavior)
WHEN NOT MATCHED THENINSERT (user_id, name, behavior_sequence)VALUES (source.user_id, source.name, array(source.new_behavior))
参考资料
和Google的对话记录