Welcome to the first lesson of "Building an Async CLI Tool for ETL Pipelines in Python", and congratulations on reaching this milestone! So far, you've completed four comprehensive courses in this learning path, building expertise in advanced Python techniques such as dunder methods, dataclasses, descriptors, metaclasses, functional patterns, pattern matching, and async I/O. That's a remarkable achievement, and you've earned the right to tackle this capstone project, where we'll integrate everything you've learned into a complete, production-ready command-line application.
Over the next five lessons, we'll build LedgerLift, a small but robust asynchronous ETL (Extract, Transform, Load) tool designed for processing financial transactions. This isn't a toy example; we'll implement industrial-strength patterns: self-validating domain models, streaming parsers, structural pattern matching for routing, async pipelines with natural backpressure, and a polished CLI interface. By the end, you'll have a portfolio piece that demonstrates advanced Python engineering skills.
Today's lesson focuses on the foundation: Domain Model & Validation. We'll design a robust, self-validating domain model using frozen, slotted dataclasses that enforce business rules at initialization time. You'll implement a Money value type with automatic currency normalization and decimal precision, create a reusable Range descriptor for constrained fields, and build a Transaction dataclass that integrates both. These components will validate every field, normalize inputs, and reject invalid data with clear error messages, ensuring that only well-formed transactions enter our pipeline. This validation layer is the bedrock of data quality; once it's in place, downstream components can operate with confidence, knowing the data is always correct. Let's begin by understanding why domain models matter in data pipelines.
ETL systems move data between systems: extracting from sources, transforming it, and loading it into destinations. The transformation step is where business logic lives, and that logic operates on a domain model: a structured representation of your business entities with their rules and constraints. Without a strong domain model, you're processing raw dictionaries or tuples, relying on scattered validation checks and hoping you didn't miss edge cases.
A well-designed domain model brings three critical benefits. First, centralized validation ensures that business rules are enforced in one place rather than scattered across parsing, transformation, and loading code. If a transaction requires a positive amount, that rule lives in the Transaction class itself; every code path that creates a transaction automatically enforces this constraint. Second, type safety gives you confidence that once a Transaction object exists, it's valid and complete. You don't need defensive checks throughout your codebase because the model guarantees correctness. Third, clear error boundaries mean that invalid data is rejected early with specific error messages. Instead of a cryptic failure deep in the pipeline, users see "amount must be > 0.00" at the entry point, where they can fix it.
In this lesson, we'll build domain models using frozen dataclasses with slots. Frozen dataclasses are immutable, which is perfect for data pipelines where records shouldn't change once created; this immutability prevents accidental modifications and makes the code easier to reason about. Slots reduce memory overhead by storing attributes in a fixed structure rather than a dictionary; when processing thousands of transactions, this efficiency adds up. The combination of frozen and slots gives us robust, efficient value types that feel native to Python.
Financial applications require precise decimal arithmetic and currency awareness. Floating-point numbers introduce rounding errors that compound over thousands of transactions; the Decimal type solves this by providing arbitrary-precision arithmetic. Currency codes ensure we never mix incompatible units; adding USD to EUR without conversion is a business logic error that should be caught immediately.
Let's implement a Money dataclass that encapsulates an amount and currency with automatic validation and normalization:
The frozen=True parameter makes instances immutable after creation; slots=True stores attributes efficiently; kw_only=True requires keyword arguments at construction time for clarity. These parameters create a robust value type that cannot be modified accidentally. The __post_init__ method runs immediately after __init__, providing a hook for validation and normalization. Since the dataclass is frozen, we cannot assign attributes directly; instead, we use object.__setattr__(self, name, value) to bypass the immutability restriction during initialization.
The validation logic normalizes the currency by converting it to uppercase (accepting "usd", "USD", or "Usd" as equivalent), then checks that it's exactly three alphabetic characters. This catches common errors like empty strings, numeric codes, or typos. The amount normalization converts any input (string, int, float) to a , then quantizes it to exactly two decimal places using banker's rounding (). This rounding mode is standard in financial systems because it minimizes cumulative bias; amounts ending in .5 round to the nearest even number, so 2.5 becomes 2, but 3.5 becomes 4.
To make Money objects readable in logs and output, we implement __str__:
This simple method returns a formatted string like "10.50 USD" or "5.00 EUR", making it easy to include money values in print statements and error messages. The __repr__ method (automatically generated by the dataclass) still shows the full constructor form for debugging, while __str__ provides a human-friendly representation for user-facing output.
Descriptors are Python's mechanism for reusable attribute logic. You built a Range validator in a previous course; now we'll apply it to constrain transaction quantities. A descriptor implements __get__, __set__, and __delete__ to intercept attribute access and enforce rules centrally.
Let's examine the Range descriptor structure:
The Range descriptor accepts optional min and max bounds plus a coercion function (defaulting to int). The coercion function normalizes incoming values to the expected type; for example, if someone passes a string "5", it converts to the integer 5. The min and max values are also coerced at initialization to ensure type consistency. The _name and _storage attributes will be set by __set_name__ to track the descriptor's name in the owner class and the private attribute where the value is stored.
The descriptor protocol consists of __get__, __set__, and __delete__. We won't need __delete__ for this demonstration, since our frozen dataclasses don't support attribute deletion. We also implement __set_name__, a helper method that Python calls automatically when the descriptor is assigned to a class attribute, which lets us discover the attribute name:
The __set_name__ method is called when the class is created, receiving the owner class and the attribute name. We store the name for error messages and create a storage attribute name by prefixing with an underscore; if the attribute is qty, the storage is _qty. The __get__ method retrieves the value from the private storage attribute; when accessed on the class itself (instance is None), it returns the descriptor object for introspection.
The __set__ method enforces validation when a value is assigned. It coerces the value to the expected type, checks if it violates the minimum bound (if set), checks the maximum bound (if set), and raises a descriptive for violations. If validation passes, it stores the value using to bypass frozen dataclass restrictions. This pattern allows descriptors to work seamlessly with immutable classes by using the lower-level setattr mechanism during controlled initialization.
Now we integrate Money and Range into a Transaction dataclass representing a single ledger entry:
The Transaction class uses the same frozen and slots parameters as Money for immutability and efficiency. Each transaction has a unique id, an operation type (op), an account name, a Money amount, and a quantity. The quantity management is subtle: qty_in is the input parameter (with repr=False to hide it from the string representation), _qty is the private storage (with init=False so it's not a constructor parameter), and qty is the descriptor that manages validation.
This structure might seem complex, but it solves a specific problem with frozen dataclasses. The descriptor needs to validate and store the quantity, but in a frozen dataclass, we cannot assign to attributes directly. By declaring qty as a class attribute (the descriptor), as the initialization parameter, and as the storage, we create a clean interface where users pass at construction, the descriptor validates it, and accessing returns the validated value.
The __post_init__ method enforces all business rules:
The validation unfolds in three phases. First, we normalize inputs: the operation is converted to lowercase and stripped of whitespace, ensuring that "Add", " add ", and "ADD" all become "add". The account name is collapsed to single spaces between words and stripped, so " Sales North " becomes "Sales North". Second, we validate business rules: the operation must be either "add" or "refund" (catching typos like "ad" or "remove"), the account must not be empty after normalization, and the amount must be positive (zero or negative amounts are business logic errors).
Third, we assign the normalized values using object.__setattr__ to bypass frozen restrictions. The final line invokes the descriptor's __set__ method explicitly by accessing it on the class (type(self).qty) and calling it with the instance and input value. This triggers the Range validation, which checks that qty_in is at least 1 and stores the validated value in _qty. This explicit invocation is necessary because descriptors are designed for normal attribute assignment (self.qty = value), which frozen dataclasses disallow; by calling __set__ directly, we leverage descriptor validation during the initialization window when modification is still permitted.
The qty_total_amount method multiplies the amount by quantity and quantizes the result to two decimal places using banker's rounding. This ensures that the total has the same precision as the original amount, preventing rounding errors from accumulating. The total method returns a new Money object with the computed amount and the original currency, providing a clean, type-safe interface for calculating totals.
With our domain model complete, let's see it in action parsing CSV data. The main script opens a CSV file and attempts to create Transaction objects from each row:
The script reads the data file path from an environment variable for flexibility, opens it with UTF-8 encoding, and creates a DictReader that yields each row as a dictionary. For each row, we attempt to construct a Transaction object inside a try block. If construction succeeds, we print the normalized operation, account, amount, quantity, and computed total. If any validation fails, the except block catches the exception and prints an error message with the row ID and the specific validation error.
This error handling strategy demonstrates fail-fast validation with clear diagnostics. Invalid rows are reported immediately without crashing the entire process, and the error messages come directly from the domain model's validation logic, making it easy to identify and fix data quality issues. The domain model acts as a gatekeeper: only valid transactions proceed through the pipeline, while invalid data is caught at the entry point.
Let's examine the output to see how validation and normalization work:
The first eight lines show successful parsing and validation. Notice how operations are normalized to lowercase ("add", "refund") and account names maintain proper spacing. The amounts are formatted to two decimal places, and totals are computed correctly: transaction 1 with 10.50 USD at quantity 3 yields 31.50 USD. Transaction 5 with 2.50 EUR at quantity 2 yields 5.00 EUR, demonstrating that multiplication maintains currency precision.
The error lines reveal validation failures with clear diagnostics. Transaction 9 fails with "amount must be > 0.00", indicating a zero or negative amount in the input. Transaction 10 has an invalid operation type (perhaps "debit" or "credit" instead of "add" or "refund"). Transaction 11 has an empty account name after normalization. Transaction 12 encounters a decimal conversion error, likely a non-numeric amount like "abc" that cannot be parsed as a Decimal. The final line shows that even unusual currency codes like "XYZ" are accepted as long as they're three alphabetic characters, demonstrating that our validation is precise but not overly restrictive.
Congratulations on completing the first lesson of this capstone course! You've built a production-grade domain model that validates every field, normalizes inputs, and enforces business rules at initialization time. The Money value type provides currency-aware decimal arithmetic; the Range descriptor offers reusable field validation that works seamlessly with frozen dataclasses; and the Transaction dataclass integrates both while maintaining immutability and efficiency through slots.
This validation layer is the foundation of data quality in LedgerLift. By catching errors at the entry point with clear, specific messages, we prevent invalid data from flowing through the pipeline and causing subtle bugs downstream. The domain model also serves as living documentation: reading the Transaction class definition immediately reveals the business rules (operations must be 'add' or 'refund', accounts cannot be empty, amounts must be positive, quantities must be at least 1).
Looking ahead, the next lesson will build parsers that stream CSV and JSON Lines formats, converting raw records into validated Transaction objects using the domain model we built today. You'll implement robust parsing routines that handle malformed input gracefully, reporting errors without crashing and producing a stream of clean, strongly-typed transactions ready for processing. The combination of streaming parsers and validated models will form a resilient ingestion layer for our ETL tool.
Now it's time to put your knowledge into practice! The upcoming exercises will challenge you to implement these patterns yourself: you'll complete the Money validation logic, build the Range descriptor's core validation, implement Transaction business rules, integrate the descriptor properly, and wire up the error-handling loop in the main script. These hands-on tasks will solidify your understanding and prepare you to apply these techniques confidently in your own projects. Let's build something robust together!
