TableDataSink

`laktory.models.datasinks.TableDataSink` ¤

Bases: BaseDataSink

PARAMETER	DESCRIPTION
`catalog_name`	Sink table catalog name TYPE: `str \| None \| VariableType` DEFAULT: `None`
`checkpoint_path_`	Path to which the checkpoint file for which a streaming dataframe should be written. TYPE: `str \| Path \| VariableType` DEFAULT: `None`
`custom_writer`	Custom writer that fully replaces Laktory's built-in write logic. Laktory manages the streaming query lifecycle (foreachBatch, trigger, checkpoint, start/await). Can be set as a plain string (func_name only) or a full CustomWriter object with func_name, func_args, and func_kwargs. Mutually exclusive with `mode` and `merge_cdc_options`. TYPE: `CustomWriter \| None \| VariableType` DEFAULT: `None`
`databricks_quality_monitor`	Databricks Quality Monitor TYPE: `Literal[None] \| VariableType` DEFAULT: `None`
`format`	Storage format for data table. TYPE: `Literal['PARQUET', 'DELTA'] \| VariableType` DEFAULT: `'DELTA'`
`is_quarantine`	Sink used to store quarantined results from a pipeline node expectations. TYPE: `bool \| VariableType` DEFAULT: `False`
`merge_cdc_options`	Merge options to handle input DataFrames that are Change Data Capture (CDC). Only used when `MERGE` mode is selected. TYPE: `DataSinkMergeCDCOptions \| VariableType` DEFAULT: `None`
`metadata`	Table and columns metadata. TYPE: `TableDataSinkMetadata \| VariableType` DEFAULT: `None`
`mode`	Write mode. Spark¤ OVERWRITE: Overwrite existing data. APPEND: Append contents of this DataFrame to existing data. ERROR: Throw an exception if data already exists. IGNORE: Silently ignore this operation if data already exists. Spark Streaming¤ APPEND: Only the new rows in the streaming DataFrame/Dataset will be written to the sink. COMPLETE: All the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. UPDATE: Only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. Polars Delta¤ OVERWRITE: Overwrite existing data. APPEND: Append contents of this DataFrame to existing data. ERROR: Throw an exception if data already exists. IGNORE: Silently ignore this operation if data already exists. Laktory¤ MERGE: Append, update and optionally delete records. Only supported for DELTA format. Requires cdc specification. TYPE: `Literal['UPDATE', 'ERRORIFEXISTS', 'OVERWRITE', 'COMPLETE', 'APPEND', 'MERGE', 'ERROR', 'IGNORE'] \| None \| VariableType` DEFAULT: `None`
`schema_definition`	Explicit table schema used when creating the table. If not set, schema is inferred from the transformer output DataFrame. TYPE: `DataFrameSchema \| VariableType` DEFAULT: `None`
`schema_name`	Sink table schema name TYPE: `str \| None \| VariableType` DEFAULT: `None`
`table_name`	Sink table name. Also supports fully qualified name (`{catalog}.{schema}.{table}`). In this case, `catalog_name` and `schema_name` arguments are ignored. TYPE: `str \| VariableType`
`table_type`	Type of table. 'TABLE' and 'VIEW' are currently supported. TYPE: `Literal['TABLE', 'VIEW'] \| VariableType` DEFAULT: `'TABLE'`
`type`	Name of the data sink type TYPE: `Literal['FILE', 'HIVE_METASTORE', 'UNITY_CATALOG'] \| VariableType`
`writer_kwargs`	Keyword arguments passed directly to dataframe backend writer. Passed to `.options()` method when using PySpark. TYPE: `dict[str \| VariableType, Any \| VariableType] \| VariableType` DEFAULT: `{}`
`writer_methods`	DataFrame backend writer methods. TYPE: `list[ReaderWriterMethod \| VariableType] \| VariableType` DEFAULT: `[]`

METHOD	DESCRIPTION
`as_source`	Generate a table data source with the same properties as the sink.
`create`	Creates an empty table with the expected schema if it does not already exist.
`purge`	Delete sink data and checkpoints
`read`	Read dataframe from sink.
`write`	Write dataframe into sink.

ATTRIBUTE	DESCRIPTION
`dlt_apply_changes_kwargs`	Keyword arguments for dlt.apply_changes function TYPE: `dict[str, str]`
`dlt_pre_merge_view_name`	DLT view applying node transformer prior to applying CDC changes.
`full_name`	Table full name {catalog_name}.{schema_name}.{table_name} TYPE: `str`

`dlt_apply_changes_kwargs` `property` ¤

Keyword arguments for dlt.apply_changes function

`dlt_pre_merge_view_name` `property` ¤

DLT view applying node transformer prior to applying CDC changes.

`full_name` `property` ¤

Table full name {catalog_name}.{schema_name}.{table_name}

`as_source(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

Generate a table data source with the same properties as the sink.

PARAMETER	DESCRIPTION
`as_stream`	If `True`, sink will be read as stream. DEFAULT: `None`
`reader_kwargs`	Keyword arguments passed to the dataframe backend reader. DEFAULT: `None`
`reader_methods`	DataFrame backend reader methods. DEFAULT: `None`

RETURNS	DESCRIPTION
`TableDataSource`	Table Data Source

Source code in laktory/models/datasinks/tabledatasink.py

def as_source(
    self, as_stream=None, reader_kwargs=None, reader_methods=None
) -> TableDataSource:
    """
    Generate a table data source with the same properties as the sink.

    Parameters
    ----------
    as_stream:
        If `True`, sink will be read as stream.
    reader_kwargs:
        Keyword arguments passed to the dataframe backend reader.
    reader_methods:
        DataFrame backend reader methods.

    Returns
    -------
    :
        Table Data Source
    """
    source = TableDataSource(
        catalog_name=self.catalog_name,
        table_name=self.table_name,
        schema_name=self.schema_name,
        type=self.type,
        dataframe_backend=self.dataframe_backend,
    )

    if as_stream:
        source.as_stream = as_stream
    if reader_kwargs:
        source.reader_kwargs.update(reader_kwargs)
    if reader_methods:
        source.reader_methods.extend(reader_methods)

    if self.dataframe_backend_:
        source.dataframe_backend_ = self.dataframe_backend_
    source.parent = self.parent

    return source

`create(df=None)` ¤

Creates an empty table with the expected schema if it does not already exist.

Returns True if the table was created, False otherwise. Schema is taken from schema_definition if set, otherwise inferred from df.

Source code in laktory/models/datasinks/tabledatasink.py

def create(self, df=None) -> bool:
    """
    Creates an empty table with the expected schema if it does not already exist.

    Returns True if the table was created, False otherwise.
    Schema is taken from `schema_definition` if set, otherwise inferred from `df`.
    """
    logger.info(f"Table '{self.full_name}' creation initiated.")

    # Skip for views
    if self.table_type == "VIEW":
        logger.info("Table is view. Skipping.")
        return False

    if self.exists():
        logger.info("Table exists. Skipping.")
        return False

    self._update_backend_from_df(df)
    schema = self._get_create_schema(df)

    if schema is None:
        logger.info("Schema is empty and `df` is None. Skipping table.")
        return False

    logger.info(f"Creating empty table '{self.full_name}'.")
    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        from laktory import get_spark_session

        spark = get_spark_session()

        kwargs = {}
        path = self.writer_kwargs.get("path", None)
        if path:
            kwargs["path"] = path

        df_empty = spark.createDataFrame(data=[], schema=schema)
        df_empty.write.format(self.format.lower()).mode("ignore").options(
            **kwargs
        ).saveAsTable(self.full_name)

    else:
        raise NotImplementedError(
            f"Table Data Sink for '{self.dataframe_backend}' is not yet supported."
        )

    logger.info(f"Table '{self.full_name}' creation completed.")

    return True

`purge()` ¤

Delete sink data and checkpoints

Source code in laktory/models/datasinks/tabledatasink.py

def purge(self):
    """
    Delete sink data and checkpoints
    """

    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        from laktory import get_spark_session

        spark = get_spark_session()

        # Remove Data
        logger.info(
            f"Dropping {self.table_type} {self.full_name}",
        )
        spark.sql(f"DROP {self.table_type} IF EXISTS {self.full_name}")

        path = self.writer_kwargs.get("path", None)
        if path:
            path = Path(path)
            if path.exists():
                is_dir = path.is_dir()
                if is_dir:
                    logger.info(f"Deleting data dir {path}")
                    shutil.rmtree(path)
                else:
                    logger.info(f"Deleting data file {path}")
                    os.remove(path)

        # Remove Checkpoint
        self._purge_checkpoint()

    else:
        raise TypeError(
            f"DataFrame backend {self.dataframe_backend} is not supported."
        )

`read(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

Read dataframe from sink.

PARAMETER	DESCRIPTION
`as_stream`	If `True`, dataframe read as stream. DEFAULT: `None`
`reader_kwargs`	Keyword arguments passed to the dataframe backend reader. DEFAULT: `None`
`reader_methods`	DataFrame backend reader methods. DEFAULT: `None`

RETURNS	DESCRIPTION
`AnyFrame`	DataFrame

Source code in laktory/models/datasinks/basedatasink.py

def read(self, as_stream=None, reader_kwargs=None, reader_methods=None):
    """
    Read dataframe from sink.

    Parameters
    ----------
    as_stream:
        If `True`, dataframe read as stream.
    reader_kwargs:
        Keyword arguments passed to the dataframe backend reader.
    reader_methods:
        DataFrame backend reader methods.

    Returns
    -------
    AnyFrame
        DataFrame
    """
    return self.as_source(
        as_stream=as_stream,
        reader_kwargs=reader_kwargs,
        reader_methods=reader_methods,
    ).read()

`write(df=None, view_definition=None, mode=None)` ¤

Write dataframe into sink.

PARAMETER	DESCRIPTION
`df`	Input dataframe. TYPE: `AnyFrame` DEFAULT: `None`
`mode`	Write mode overwrite of the sink default mode. TYPE: `str` DEFAULT: `None`
`view_definition`	View definition for table data sinks of `VIEW` type TYPE: `str` DEFAULT: `None`

Source code in laktory/models/datasinks/basedatasink.py

def write(
    self,
    df: AnyFrame = None,
    view_definition: str = None,
    mode: str = None,
) -> None:
    """
    Write dataframe into sink.

    Parameters
    ----------
    df:
        Input dataframe.
    mode:
        Write mode overwrite of the sink default mode.
    view_definition:
        View definition for table data sinks of `VIEW` type
    """

    logger.info("Write initiated.")

    if getattr(self, "table_type", None) == "VIEW":
        if view_definition is None:
            raise ValueError(f"`view_definition` for '{self._id}' is `None`")

        from laktory.models.dataframe.dataframeexpr import DataFrameExpr

        if not isinstance(view_definition, DataFrameExpr):
            view_definition = DataFrameExpr(expr=view_definition)

        if self.dataframe_backend == DataFrameBackends.PYSPARK:
            self._write_spark_view(view_definition)
        elif self.dataframe_backend == DataFrameBackends.POLARS:
            self._write_polars_view(view_definition)
        else:
            raise ValueError(
                f"DataFrame backend '{self.dataframe_backend}' is not supported"
            )
        return

    if not isinstance(df, (nw.DataFrame, nw.LazyFrame)):
        df = nw.from_native(df)
    self._update_backend_from_df(df)

    # Custom Writer
    if self.custom_writer:
        df_native = df.to_native()

        # Special Treatment for Spark Streaming
        if (
            self.dataframe_backend == DataFrameBackends.PYSPARK
            and df_native.isStreaming
        ):
            if self.checkpoint_path is None:
                raise ValueError(
                    f"Checkpoint location not specified for sink '{self._id}'"
                )
            # Build context before the foreachBatch lambda so that _parent
            # references are captured while intact. Inside foreachBatch on
            # Databricks, the lambda closure is serialized via cloudpickle
            # and _parent attributes may not survive the round-trip.
            from laktory.models.laktorycontext import LaktoryContext

            _context = LaktoryContext(
                node=self.parent_pipeline_node,
                pipeline=self.parent_pipeline,
                sink=self,
            )
            query = (
                df_native.writeStream.foreachBatch(
                    lambda batch_df, _: self.custom_writer.execute(
                        batch_df, context=_context
                    )
                )
                .trigger(availableNow=True)
                .options(checkpointLocation=self.checkpoint_path)
                .start()
            )
            query.awaitTermination()

        else:
            self.custom_writer.execute(df)

        logger.info("Write completed.")
        return

    if mode is None:
        mode = self.mode

    self._validate_mode(mode, df)
    self._validate_format()

    if mode and mode.lower() == "merge":
        self.merge_cdc_options.execute(source=df)
        logger.info("Write completed.")
        return

    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        self._write_spark(df=df, mode=mode)
    elif self.dataframe_backend == DataFrameBackends.POLARS:
        self._write_polars(df=df, mode=mode)
    else:
        raise ValueError(
            f"DataFrame backend '{self.dataframe_backend}' is not supported"
        )

    logger.info("Write completed.")

TableDataSink

laktory.models.datasinks.TableDataSink ¤

Spark¤

Spark Streaming¤

Polars Delta¤

Laktory¤

dlt_apply_changes_kwargs property ¤

dlt_pre_merge_view_name property ¤

full_name property ¤

as_source(as_stream=None, reader_kwargs=None, reader_methods=None) ¤

create(df=None) ¤

purge() ¤

read(as_stream=None, reader_kwargs=None, reader_methods=None) ¤

write(df=None, view_definition=None, mode=None) ¤

`laktory.models.datasinks.TableDataSink` ¤

`dlt_apply_changes_kwargs` `property` ¤

`dlt_pre_merge_view_name` `property` ¤

`full_name` `property` ¤

`as_source(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

`create(df=None)` ¤

`purge()` ¤

`read(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

`write(df=None, view_definition=None, mode=None)` ¤