PHP-ETL - Understand the ETL
Item types
ETL Item Types
This ETL framework processes data through various “operations,” each receiving and returning “items” in a chain. Each item type serves a specific function in the ETL process, allowing for control over data extraction, transformation, and loading, as well as file handling and flow control.
Legend
- 📥 - Can be received as input
- 📤 - Can be returned as output
1. DataItem 📥|📤
Encapsulates the data passed between operations. A DataItem
contains data to be transformed or processed within each step in the chain.
Usage: Allows data to flow from one operation to the next.
2. FileExtractedItem 📥|📤
Generated after an operation has finished reading a file. This item is essential for post-read operations like file archival or deletion. Each line or data entry in the file generates a DataItem
before the final FileExtractedItem
is returned.
Usage: Signals the completion of file reading, enabling downstream operations to handle the file.
3. FileLoadedItem 📥|📤
Returned once all DataItems
have been written to a file and the file handler is closed. This allows post-write operations like moving the file to external storage (e.g., SFTP, FTP, cloud storage) or archiving it.
Usage: Marks file load completion, enabling file transfers or other final transformations.
Complex Items
These items are either generated or consumed by the ETL’s internal ChainProcessor
, adding further control to data flow and item management within the chain.
4. GroupedItem ❌|📤
Contains an iterator, allowing data to be processed incrementally without loading the entire dataset into memory. Although an operation can return a GroupedItem
, it cannot receive one as input.
Usage: Allows iterative data extraction, keeping memory usage low by handling data as individual items downstream.
5. ChainBreak ❌|📤
Used to stop the chain for a specific item. When an operation returns a ChainBreak
, the ChainProcessor
halts further processing for the associated DataItem
.
Usage: Prevents specific DataItems
from proceeding further in the ETL chain.
6. MixItem ❌|📤
Enables returning multiple item types simultaneously. For instance, if an operation needs to return a GroupedItem
along with a FileExtractedItem
, it can use a MixItem
. Although it can be returned, it is not a valid input type for operations.
Usage: Supports complex outputs by encapsulating various item types together.
7. StopItem 📥|📤
Signifies the end of data in the ETL chain and cannot be newly instantiated. A StopItem
is sent through the chain when the ETL input is exhausted. Operations like file loading handle the StopItem
by performing cleanup tasks (e.g., closing file handles). Once processed, a StopItem
is returned either directly or within a MixItem
to signal the ETL chain’s end.
Usage: Marks the end of data processing, triggering cleanup or finalization steps within operations.
8. AsynchronousItem ❌|📤
The AsynchronousItem
is returned by an operation when processing occurs asynchronously, enabling non-blocking, background tasks within the ETL chain. This item type allows an operation to initiate a process that completes outside the main processing thread, letting the chain continue to execute other operations in parallel.
The ChainProcessor
monitors each AsynchronousItem
periodically and, once an asynchronous task is completed, the chain resumes with the item encapsulated within the AsynchronousItem
. By default, the ChainProcessor
can handle up to 10 asynchronous jobs concurrently; once this limit is reached, further processing pauses until at least one asynchronous task completes.
Usage: Facilitates background processing, enabling parallel execution of time-intensive operations without holding up the ETL chain.
Example: HttpClient Operation
The native HttpClient
operation is a prime example of an asynchronous process that returns an AsynchronousItem
. When HttpClient
is used to make external API requests or perform network-related tasks, it returns an AsynchronousItem
so the ETL process can proceed without waiting for the network response to complete. The ChainProcessor
will monitor the HttpClient
task and continue processing other items. Once the network response is received, the chain resumes processing with the data received from the HttpClient
operation.