DataSinkMergeCDCOptions

`laktory.models.DataSinkMergeCDCOptions` ¤

Bases: BaseModel

Options for merging a change data capture (CDC).

They are also used to build the target using apply_changes method when using Databricks DLT.

Examples:

from laktory import models

df = spark.createDataFrame(
    [
        {"id": 1, "value": 3.0},
        {"id": 2, "value": 2.3},
        {"id": 3, "value": 7.7},
    ]
)

sink = models.FileDataSink(
    path="./my_table/",
    format="DELTA",
    mode="MERGE",
    merge_cdc_options={
        "scd_type": 1,
        "primary_keys": ["id"],
    },
)
# sink.write(df)

References

PARAMETER	DESCRIPTION
`variables`	Dict of variables to be injected in the model at runtime TYPE: `dict[str, Any]` DEFAULT: `{}`
`delete_where`	Specifies when a CDC event should be treated as a DELETE rather than an upsert. TYPE: `str \| VariableType` DEFAULT: `None`
`end_at_column_name`	When using SCD type 2, name of the column storing the end time (or sequencing index) during which a row is active. This attribute is not used when using Databricks DLT which does not allow column rename. TYPE: `str \| VariableType` DEFAULT: `'__end_at'`
`exclude_columns`	A subset of columns to exclude in the target table. TYPE: `list[Union[str, VariableType]] \| VariableType` DEFAULT: `None`
`ignore_null_updates`	Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and ignore_null_updates is `True`, columns with a null will retain their existing values in the target. This also applies to nested columns with a value of null. When ignore_null_updates is `False`, existing values will be overwritten with null values. TYPE: `bool \| VariableType` DEFAULT: `False`
`include_columns`	A subset of columns to include in the target table. Use `include_columns` to specify the complete list of columns to include. TYPE: `list[Union[str, VariableType]] \| VariableType` DEFAULT: `None`
`order_by`	The column name specifying the logical order of CDC events in the source data. Used to handle change events that arrive out of order. TYPE: `str \| VariableType` DEFAULT: `None`
`primary_keys`	The column or combination of columns that uniquely identify a row in the source data. This is used to identify which CDC events apply to specific records in the target table. TYPE: `list[Union[str, VariableType]] \| VariableType` DEFAULT: `None`
`scd_type`	Whether to store records as SCD type 1 or SCD type 2. TYPE: `Literal[1, 2] \| VariableType` DEFAULT: `1`
`start_at_column_name`	When using SCD type 2, name of the column storing the start time (or sequencing index) during which a row is active. This attribute is not used when using Databricks DLT which does not allow column rename. TYPE: `str \| VariableType` DEFAULT: `'__start_at'`

METHOD	DESCRIPTION
`execute`	Merge source into target delta from sink
`inject_vars`	Inject model variables values into a model attributes.
`inject_vars_into_dump`	Inject model variables values into a model dump.
`model_validate_json_file`	Load model from json file object
`model_validate_yaml`	Load model from yaml file object using laktory.yaml.RecursiveLoader. Supports
`push_vars`	Push variable values to all child recursively
`validate_assignment_disabled`	Updating a model attribute inside a model validator when `validate_assignment`

`execute(source)` ¤

Merge source into target delta from sink

PARAMETER	DESCRIPTION
`source`	Source DataFrame to merge into target (sink). TYPE: `AnyFrame`

Source code in laktory/models/datasinks/mergecdcoptions.py

def execute(self, source: AnyFrame):
    """
    Merge source into target delta from sink

    Parameters
    ----------
    source:
        Source DataFrame to merge into target (sink).
    """

    dataframe_backend = DataFrameBackends.from_nw_implementation(
        source.implementation
    )
    if dataframe_backend not in SUPPORTED_BACKENDS:
        raise NotImplementedError(
            f"DataFrame provided is of {dataframe_backend} backend, which is not currently implemented for merge operations."
        )

    source = source.to_native()

    from delta.tables import DeltaTable

    self._source_schema = source.schema
    spark = source.sparkSession

    if self.target_path:
        if not DeltaTable.isDeltaTable(spark, self.target_path):
            self._init_target(source)
    else:
        try:
            spark.catalog.getTable(self.target_name)
        except Exception:
            self._init_target(source)

    if source.isStreaming:
        if self.sink is None:
            raise ValueError("Sink value required to fetch checkpoint location.")

        if self.sink and self.sink.checkpoint_path is None:
            raise ValueError(
                f"Checkpoint location not specified for sink '{self.sink}'"
            )

        query = (
            source.writeStream.foreachBatch(
                lambda batch_df, batch_id: self._execute(source=batch_df)
            )
            .trigger(availableNow=True)
            .options(
                checkpointLocation=self.sink.checkpoint_path,
            )
            .start()
        )
        query.awaitTermination()

    else:
        self._execute(source=source)

`inject_vars(inplace=False, vars=None)` ¤

Inject model variables values into a model attributes.

PARAMETER	DESCRIPTION
`inplace`	If `True` model is modified in place. Otherwise, a new model instance is returned. TYPE: `bool` DEFAULT: `False`
`vars`	A dictionary of variables to be injected in addition to the model internal variables. TYPE: `dict` DEFAULT: `None`

RETURNS	DESCRIPTION
	Model instance.

Examples:

from typing import Union

from laktory import models


class Cluster(models.BaseModel):
    name: str = None
    size: Union[int, str] = None


c = Cluster(
    name="cluster-${vars.my_cluster}",
    size="${{ 4 if vars.env == 'prod' else 2 }}",
    variables={
        "env": "dev",
    },
).inject_vars()
print(c)
# > variables={'env': 'dev'} name='cluster-${vars.my_cluster}' size=2

References

variables

Source code in laktory/models/basemodel.py

def inject_vars(self, inplace: bool = False, vars: dict = None):
    """
    Inject model variables values into a model attributes.

    Parameters
    ----------
    inplace:
        If `True` model is modified in place. Otherwise, a new model
        instance is returned.
    vars:
        A dictionary of variables to be injected in addition to the
        model internal variables.


    Returns
    -------
    :
        Model instance.

    Examples
    --------
    ```py
    from typing import Union

    from laktory import models


    class Cluster(models.BaseModel):
        name: str = None
        size: Union[int, str] = None


    c = Cluster(
        name="cluster-${vars.my_cluster}",
        size="${{ 4 if vars.env == 'prod' else 2 }}",
        variables={
            "env": "dev",
        },
    ).inject_vars()
    print(c)
    # > variables={'env': 'dev'} name='cluster-${vars.my_cluster}' size=2
    ```

    References
    ----------
    * [variables](https://www.laktory.ai/concepts/variables/)
    """

    # Fetching vars
    if vars is None:
        vars = {}

    vars = deepcopy(vars)
    vars.update(self.variables)

    # TODO: Review implementation as it results in serious performance hits
    # from laktory.models.pipeline import Pipeline
    # from laktory.models.pipeline import PipelineNode
    #
    # if isinstance(self, Pipeline):
    #     vars["_pl"] = self
    #
    # if isinstance(self, PipelineNode):
    #     vars["_pl_node"] = self

    # Create copy
    if not inplace:
        self = self.model_copy(deep=True)

    # Inject into field values
    for k in list(self.model_fields_set):
        if k == "variables":
            continue
        o = getattr(self, k)

        if isinstance(o, BaseModel) or isinstance(o, dict) or isinstance(o, list):
            # Mutable objects will be updated in place
            _resolve_values(o, vars)
        else:
            # Simple objects must be updated explicitly
            setattr(self, k, _resolve_value(o, vars))

    # Inject into child resources
    if hasattr(self, "core_resources"):
        for r in self.core_resources:
            if r == self:
                continue
            r.inject_vars(vars=vars, inplace=True)

    if not inplace:
        return self

`inject_vars_into_dump(dump, inplace=False, vars=None)` ¤

Inject model variables values into a model dump.

PARAMETER	DESCRIPTION
`dump`	Model dump (or any other general purpose mutable object) TYPE: `dict[str, Any]`
`inplace`	If `True` model is modified in place. Otherwise, a new model instance is returned. TYPE: `bool` DEFAULT: `False`
`vars`	A dictionary of variables to be injected in addition to the model internal variables. TYPE: `dict[str, Any]` DEFAULT: `None`

RETURNS	DESCRIPTION
	Model dump with injected variables.

Examples:

from laktory import models

m = models.BaseModel(
    variables={
        "env": "dev",
    },
)
data = {
    "name": "cluster-${vars.my_cluster}",
    "size": "${{ 4 if vars.env == 'prod' else 2 }}",
}
print(m.inject_vars_into_dump(data))
# > {'name': 'cluster-${vars.my_cluster}', 'size': 2}

References

variables

Source code in laktory/models/basemodel.py

def inject_vars_into_dump(
    self, dump: dict[str, Any], inplace: bool = False, vars: dict[str, Any] = None
):
    """
    Inject model variables values into a model dump.

    Parameters
    ----------
    dump:
        Model dump (or any other general purpose mutable object)
    inplace:
        If `True` model is modified in place. Otherwise, a new model
        instance is returned.
    vars:
        A dictionary of variables to be injected in addition to the
        model internal variables.


    Returns
    -------
    :
        Model dump with injected variables.


    Examples
    --------
    ```py
    from laktory import models

    m = models.BaseModel(
        variables={
            "env": "dev",
        },
    )
    data = {
        "name": "cluster-${vars.my_cluster}",
        "size": "${{ 4 if vars.env == 'prod' else 2 }}",
    }
    print(m.inject_vars_into_dump(data))
    # > {'name': 'cluster-${vars.my_cluster}', 'size': 2}
    ```

    References
    ----------
    * [variables](https://www.laktory.ai/concepts/variables/)
    """

    # Setting vars
    if vars is None:
        vars = {}
    vars = deepcopy(vars)
    vars.update(self.variables)

    # Create copy
    if not inplace:
        dump = copy.deepcopy(dump)

    # Inject into field values
    _resolve_values(dump, vars)

    if not inplace:
        return dump

`model_validate_json_file(fp)` `classmethod` ¤

Load model from json file object

PARAMETER	DESCRIPTION
`fp`	file object structured as a json file TYPE: `TextIO`

RETURNS	DESCRIPTION
`Model`	Model instance

Source code in laktory/models/basemodel.py

@classmethod
def model_validate_json_file(cls: Type[Model], fp: TextIO) -> Model:
    """
    Load model from json file object

    Parameters
    ----------
    fp:
        file object structured as a json file

    Returns
    -------
    :
        Model instance
    """
    data = json.load(fp)
    return cls.model_validate(data)

`model_validate_yaml(fp, vars=None)` `classmethod` ¤

Load model from yaml file object using laktory.yaml.RecursiveLoader. Supports reference to external yaml and sql files using !use, !extend and !update tags. Path to external files can be defined using model or environment variables.

Referenced path should always be relative to the file they are referenced from.

Custom Tags

!use {filepath}: Directly inject the content of the file at filepath
- !extend {filepath}: Extend the current list with the elements found in the file at filepath. Similar to python list.extend method.
<<: !update {filepath}: Merge the current dictionary with the content of the dictionary defined at filepath. Similar to python dict.update method.

PARAMETER	DESCRIPTION
`fp`	file object structured as a yaml file TYPE: `TextIO`
`vars`	Dict of variables available when parsing filepaths references in yaml files i.e. `!use catalog_${vars.env}.yaml` DEFAULT: `None`

RETURNS	DESCRIPTION
`Model`	Model instance

Examples:

businesses:
  apple:
    symbol: aapl
    address: !use addresses.yaml
    <<: !update common.yaml
    emails:
      - jane.doe@apple.com
      - extend! emails.yaml
  amazon:
    symbol: amzn
    address: !use addresses.yaml
    <<: update! common.yaml
    emails:
      - john.doe@amazon.com
      - extend! emails.yaml

Source code in laktory/models/basemodel.py

@classmethod
def model_validate_yaml(cls: Type[Model], fp: TextIO, vars=None) -> Model:
    """
    Load model from yaml file object using laktory.yaml.RecursiveLoader. Supports
    reference to external yaml and sql files using `!use`, `!extend` and `!update` tags.
    Path to external files can be defined using model or environment variables.

    Referenced path should always be relative to the file they are referenced from.

    Custom Tags
    -----------
    - `!use {filepath}`:
        Directly inject the content of the file at `filepath`

    - `- !extend {filepath}`:
        Extend the current list with the elements found in the file at `filepath`.
        Similar to python list.extend method.

    - `<<: !update {filepath}`:
        Merge the current dictionary with the content of the dictionary defined at
        `filepath`. Similar to python dict.update method.

    Parameters
    ----------
    fp:
        file object structured as a yaml file
    vars:
        Dict of variables available when parsing filepaths references in yaml files
        i.e. `!use catalog_${vars.env}.yaml`

    Returns
    -------
    :
        Model instance

    Examples
    --------
    ```yaml
    businesses:
      apple:
        symbol: aapl
        address: !use addresses.yaml
        <<: !update common.yaml
        emails:
          - jane.doe@apple.com
          - extend! emails.yaml
      amazon:
        symbol: amzn
        address: !use addresses.yaml
        <<: update! common.yaml
        emails:
          - john.doe@amazon.com
          - extend! emails.yaml
    ```
    """

    data = RecursiveLoader.load(fp, vars=vars)
    return cls.model_validate(data)

`push_vars(update_core_resources=False)` ¤

Push variable values to all child recursively

Source code in laktory/models/basemodel.py

def push_vars(self, update_core_resources=False) -> Any:
    """Push variable values to all child recursively"""

    def _update_model(m):
        if not isinstance(m, BaseModel):
            return
        for k, v in self.variables.items():
            m.variables[k] = m.variables.get(k, v)
        m.push_vars()

    def _push_vars(o):
        if isinstance(o, list):
            for _o in o:
                _push_vars(_o)
        elif isinstance(o, dict):
            for _o in o.values():
                _push_vars(_o)
        else:
            _update_model(o)

    for k in self.model_fields.keys():
        _push_vars(getattr(self, k))

    if update_core_resources and hasattr(self, "core_resources"):
        for r in self.core_resources:
            if r != self:
                _push_vars(r)

    return None

`validate_assignment_disabled()` ¤

Updating a model attribute inside a model validator when validate_assignment is True causes an infinite recursion by design and must be turned off temporarily.

Source code in laktory/models/basemodel.py

@contextmanager
def validate_assignment_disabled(self):
    """
    Updating a model attribute inside a model validator when `validate_assignment`
    is `True` causes an infinite recursion by design and must be turned off
    temporarily.
    """
    original_state = self.model_config["validate_assignment"]
    self.model_config["validate_assignment"] = False
    try:
        yield
    finally:
        self.model_config["validate_assignment"] = original_state

DataSinkMergeCDCOptions

laktory.models.DataSinkMergeCDCOptions ¤

execute(source) ¤

inject_vars(inplace=False, vars=None) ¤

inject_vars_into_dump(dump, inplace=False, vars=None) ¤

model_validate_json_file(fp) classmethod ¤

model_validate_yaml(fp, vars=None) classmethod ¤

push_vars(update_core_resources=False) ¤

validate_assignment_disabled() ¤

`laktory.models.DataSinkMergeCDCOptions` ¤

`execute(source)` ¤

`inject_vars(inplace=False, vars=None)` ¤

`inject_vars_into_dump(dump, inplace=False, vars=None)` ¤

`model_validate_json_file(fp)` `classmethod` ¤

`model_validate_yaml(fp, vars=None)` `classmethod` ¤

`push_vars(update_core_resources=False)` ¤

`validate_assignment_disabled()` ¤