HiveMetastoreDataSink

`laktory.models.datasinks.HiveMetastoreDataSink` ¤

Bases: TableDataSink

Data sink writing to a Hive Metastore data table.

Examples:

import laktory as lk

df = spark.createDataFrame([{"x": 1}, {"x": 2}, {"x": 3}])

sink = lk.models.HiveMetastoreDataSink(
    schema_name="default",
    table_name="my_table",
    mode="APPEND",
)
# sink.write(df)

References

Data Sources and Sinks

PARAMETER	DESCRIPTION
`as_stream`	If `True` output DataFrame is written as Streaming DataFrame. If `None`, write mode is derived fromDataFrame. TYPE: `bool \| None \| VariableType` DEFAULT: `None`
`catalog_name`	Sink table catalog name TYPE: `str \| None \| VariableType` DEFAULT: `None`
`checkpoint_path_`	Path to which the checkpoint file for which a streaming dataframe should be written. TYPE: `str \| Path \| VariableType` DEFAULT: `None`
`custom_writer`	Custom writer that fully replaces Laktory's built-in write logic. Laktory manages the streaming query lifecycle (foreachBatch, trigger, checkpoint, start/await). Can be set as a plain string (func_name only) or a full CustomWriter object with func_name, func_args, and func_kwargs. Mutually exclusive with `mode` and `merge_cdc_options`. TYPE: `CustomWriter \| None \| VariableType` DEFAULT: `None`
`databricks_data_profiling_config`	Databricks Data Quality Monitor data profiling configuration TYPE: `Literal[None] \| VariableType` DEFAULT: `None`
`format`	Storage format for data table. TYPE: `Literal['PARQUET', 'DELTA', 'ORC', 'AVRO'] \| VariableType` DEFAULT: `'DELTA'`
`is_quarantine`	Sink used to store quarantined results from a pipeline node expectations. TYPE: `bool \| VariableType` DEFAULT: `False`
`merge_cdc_options`	Merge options to handle input DataFrames that are Change Data Capture (CDC). Only used when `MERGE` mode is selected. TYPE: `DataSinkMergeCDCOptions \| VariableType` DEFAULT: `None`
`metadata`	Table and columns metadata. TYPE: `TableDataSinkMetadata \| VariableType` DEFAULT: `None`
`mode`	Write mode. Spark¤ OVERWRITE: Overwrite existing data. APPEND: Append contents of this DataFrame to existing data. ERROR: Throw an exception if data already exists. IGNORE: Silently ignore this operation if data already exists. Spark Streaming¤ APPEND: Only the new rows in the streaming DataFrame/Dataset will be written to the sink. COMPLETE: All the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. UPDATE: Only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. Polars Delta¤ OVERWRITE: Overwrite existing data. APPEND: Append contents of this DataFrame to existing data. ERROR: Throw an exception if data already exists. IGNORE: Silently ignore this operation if data already exists. Laktory¤ MERGE: Append, update and optionally delete records. Only supported for DELTA format. Requires cdc specification. TYPE: `Literal['OVERWRITE', 'APPEND', 'MERGE', 'IGNORE', 'COMPLETE', 'ERROR', 'UPDATE', 'ERRORIFEXISTS'] \| None \| VariableType` DEFAULT: `None`
`schema_definition`	Explicit table schema used when creating the table. If not set, schema is inferred from the transformer output DataFrame. TYPE: `DataFrameSchema \| VariableType` DEFAULT: `None`
`schema_name`	Sink table schema name TYPE: `str \| None \| VariableType` DEFAULT: `None`
`table_name`	Sink table name. Also supports fully qualified name (`{catalog}.{schema}.{table}`). In this case, `catalog_name` and `schema_name` arguments are ignored. TYPE: `str \| VariableType`
`table_type`	Type of table. 'TABLE' and 'VIEW' are currently supported. TYPE: `Literal['TABLE', 'VIEW'] \| VariableType` DEFAULT: `'TABLE'`
`type`	Sink Type TYPE: `Literal['HIVE_METASTORE']` DEFAULT: `'HIVE_METASTORE'`
`writer_kwargs`	Keyword arguments passed directly to dataframe backend writer. Passed to `.options()` method when using PySpark. TYPE: `dict[str \| VariableType, Any \| VariableType] \| VariableType` DEFAULT: `{}`
`writer_methods`	DataFrame backend writer methods. TYPE: `list[ReaderWriterMethod \| VariableType] \| VariableType` DEFAULT: `[]`

METHOD	DESCRIPTION
`as_source`	Generate a table data source with the same properties as the sink.
`create`	Creates an empty table with the expected schema if it does not already exist.
`is_streaming`	Return `True` if the write should use Spark Structured Streaming.
`purge`	Delete sink data and checkpoints
`read`	Read dataframe from sink.
`write`	Write dataframe into sink.

ATTRIBUTE	DESCRIPTION
`full_name`	Table full name {catalog_name}.{schema_name}.{table_name} TYPE: `str`
`ldp_auto_cdc_flow_kwargs`	Keyword arguments for dp.create_auto_cdc_flow function TYPE: `dict[str, str]`
`sdp_pre_merge_view_name`	SPD view applying node transformer prior to applying CDC changes.

`full_name` `property` ¤

Table full name {catalog_name}.{schema_name}.{table_name}

`ldp_auto_cdc_flow_kwargs` `property` ¤

Keyword arguments for dp.create_auto_cdc_flow function

`sdp_pre_merge_view_name` `property` ¤

SPD view applying node transformer prior to applying CDC changes.

`as_source(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

Generate a table data source with the same properties as the sink.

PARAMETER	DESCRIPTION
`as_stream`	If `True`, sink will be read as stream. DEFAULT: `None`
`reader_kwargs`	Keyword arguments passed to the dataframe backend reader. DEFAULT: `None`
`reader_methods`	DataFrame backend reader methods. DEFAULT: `None`

RETURNS	DESCRIPTION
`TableDataSource`	Table Data Source

Source code in laktory/models/datasinks/tabledatasink.py

def as_source(
    self, as_stream=None, reader_kwargs=None, reader_methods=None
) -> TableDataSource:
    """
    Generate a table data source with the same properties as the sink.

    Parameters
    ----------
    as_stream:
        If `True`, sink will be read as stream.
    reader_kwargs:
        Keyword arguments passed to the dataframe backend reader.
    reader_methods:
        DataFrame backend reader methods.

    Returns
    -------
    :
        Table Data Source
    """
    source = TableDataSource(
        catalog_name=self.catalog_name,
        table_name=self.table_name,
        schema_name=self.schema_name,
        type=self.type,
        dataframe_backend=self.dataframe_backend,
    )

    if as_stream:
        source.as_stream = as_stream
    if reader_kwargs:
        source.reader_kwargs.update(reader_kwargs)
    if reader_methods:
        source.reader_methods.extend(reader_methods)

    if self.dataframe_backend_:
        source.dataframe_backend_ = self.dataframe_backend_
    source.parent = self.parent

    return source

`create(df=None)` ¤

Creates an empty table with the expected schema if it does not already exist.

Returns True if the table was created, False otherwise. Schema is taken from schema_definition if set, otherwise inferred from df.

Source code in laktory/models/datasinks/tabledatasink.py

def create(self, df=None) -> bool:
    """
    Creates an empty table with the expected schema if it does not already exist.

    Returns True if the table was created, False otherwise.
    Schema is taken from `schema_definition` if set, otherwise inferred from `df`.
    """
    logger.info(f"Table '{self.full_name}' creation initiated.")

    # Skip for views
    if self.table_type == "VIEW":
        logger.info("Table is view. Skipping.")
        return False

    if self.exists():
        logger.info("Table exists. Skipping.")
        return False

    self._update_backend_from_df(df)

    # For merge CDC sinks, delegate to _init_target() which builds the correct
    # schema (including SCD2 extra columns). Using the raw df schema here would
    # produce a table without those columns, causing the first merge to fail.
    if self.merge_cdc_options is not None:
        if df is None:
            logger.info("Schema is empty and `df` is None. Skipping table.")
            return False
        native_df = df.to_native()
        self.merge_cdc_options._source_schema = native_df.schema
        self.merge_cdc_options._init_target(native_df)
        return True

    schema = self._get_create_schema(df)

    if schema is None:
        logger.info("Schema is empty and `df` is None. Skipping table.")
        return False

    logger.info(f"Creating empty table '{self.full_name}'.")
    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        from laktory import get_spark_session

        spark = get_spark_session()

        kwargs = {}
        path = self.writer_kwargs.get("path", None)
        if path:
            kwargs["path"] = path

        df_empty = spark.createDataFrame(data=[], schema=schema)
        df_empty.write.format(self.format.lower()).mode("ignore").options(
            **kwargs
        ).saveAsTable(self.full_name)

    else:
        raise NotImplementedError(
            f"Table Data Sink for '{self.dataframe_backend}' is not yet supported."
        )

    logger.info(f"Table '{self.full_name}' creation completed.")

    return True

`is_streaming(df=None)` ¤

Return True if the write should use Spark Structured Streaming.

Resolution order: 1. If a Narwhals-wrapped PySpark DataFrame is provided, read its native isStreaming attribute. 2. Fall back to self.as_stream (explicit sink configuration). 3. Fall back to the parent node's source as_stream flag. 4. Default to False (static write).

If both the DataFrame state and the configuration are set and they disagree, a TypeError is raised to surface the misconfiguration early.

PARAMETER	DESCRIPTION
`df`	Optional Narwhals DataFrame or LazyFrame. Must be passed before calling `.to_native()` so that the Narwhals `implementation` attribute is still available. DEFAULT: `None`

Source code in laktory/models/datasinks/basedatasink.py

def is_streaming(self, df=None) -> bool:
    """
    Return `True` if the write should use Spark Structured Streaming.

    Resolution order:
    1. If a Narwhals-wrapped PySpark DataFrame is provided, read its native
       ``isStreaming`` attribute.
    2. Fall back to ``self.as_stream`` (explicit sink configuration).
    3. Fall back to the parent node's source ``as_stream`` flag.
    4. Default to ``False`` (static write).

    If both the DataFrame state and the configuration are set and they
    disagree, a ``TypeError`` is raised to surface the misconfiguration
    early.

    Parameters
    ----------
    df:
        Optional Narwhals DataFrame or LazyFrame. Must be passed before
        calling ``.to_native()`` so that the Narwhals ``implementation``
        attribute is still available.
    """
    # Check if DataFrame is streaming
    df_is_streaming = None
    if df is not None:
        df = nw.from_native(df)
        dataframe_backend = DataFrameBackends(df.implementation)
        if dataframe_backend == DataFrameBackends.PYSPARK:
            df_is_streaming = df.to_native().isStreaming

    # Check if configured as stream from writer or source
    configured_as_stream = self.as_stream
    if configured_as_stream is None:
        node = self.parent_pipeline_node
        if node is not None and node.sources:
            configured_as_stream = node.has_streaming_source

    # Resolve conflict
    if df_is_streaming is not None and configured_as_stream is not None:
        if df_is_streaming != configured_as_stream:
            if df_is_streaming:
                raise TypeError(
                    "Sink configured as static, but received dataframe is streaming."
                )
            else:
                raise TypeError(
                    "Sink configured as stream, but received dataframe is not streaming."
                )

    is_streaming = df_is_streaming or configured_as_stream or False

    return is_streaming

`purge()` ¤

Delete sink data and checkpoints

Source code in laktory/models/datasinks/tabledatasink.py

def purge(self):
    """
    Delete sink data and checkpoints
    """

    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        from laktory import get_spark_session

        spark = get_spark_session()

        # Remove Data
        logger.info(
            f"Dropping {self.table_type} {self.full_name}",
        )
        spark.sql(f"DROP {self.table_type} IF EXISTS {self.full_name}")

        path = self.writer_kwargs.get("path", None)
        if path:
            path = Path(path)
            if path.exists():
                is_dir = path.is_dir()
                if is_dir:
                    logger.info(f"Deleting data dir {path}")
                    shutil.rmtree(path)
                else:
                    logger.info(f"Deleting data file {path}")
                    os.remove(path)

        # Remove Checkpoint
        self._purge_checkpoint()

    else:
        raise TypeError(
            f"DataFrame backend {self.dataframe_backend} is not supported."
        )

`read(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

Read dataframe from sink.

PARAMETER	DESCRIPTION
`as_stream`	If `True`, dataframe read as stream. DEFAULT: `None`
`reader_kwargs`	Keyword arguments passed to the dataframe backend reader. DEFAULT: `None`
`reader_methods`	DataFrame backend reader methods. DEFAULT: `None`

RETURNS	DESCRIPTION
`AnyFrame`	DataFrame

Source code in laktory/models/datasinks/basedatasink.py

def read(self, as_stream=None, reader_kwargs=None, reader_methods=None):
    """
    Read dataframe from sink.

    Parameters
    ----------
    as_stream:
        If `True`, dataframe read as stream.
    reader_kwargs:
        Keyword arguments passed to the dataframe backend reader.
    reader_methods:
        DataFrame backend reader methods.

    Returns
    -------
    AnyFrame
        DataFrame
    """
    return self.as_source(
        as_stream=as_stream,
        reader_kwargs=reader_kwargs,
        reader_methods=reader_methods,
    ).read()

`write(df=None, view_definition=None, mode=None)` ¤

Write dataframe into sink.

PARAMETER	DESCRIPTION
`df`	Input dataframe. TYPE: `AnyFrame` DEFAULT: `None`
`mode`	Write mode overwrite of the sink default mode. TYPE: `str` DEFAULT: `None`
`view_definition`	View definition for table data sinks of `VIEW` type TYPE: `str` DEFAULT: `None`

Source code in laktory/models/datasinks/basedatasink.py

def write(
    self,
    df: AnyFrame = None,
    view_definition: str = None,
    mode: str = None,
) -> None:
    """
    Write dataframe into sink.

    Parameters
    ----------
    df:
        Input dataframe.
    mode:
        Write mode overwrite of the sink default mode.
    view_definition:
        View definition for table data sinks of `VIEW` type
    """

    logger.info("Write initiated.")

    if getattr(self, "table_type", None) == "VIEW":
        if view_definition is None:
            raise ValueError(f"`view_definition` for '{self._id}' is `None`")

        from laktory.models.dataframe.dataframeexpr import DataFrameExpr

        if not isinstance(view_definition, DataFrameExpr):
            view_definition = DataFrameExpr(expr=view_definition)

        if self.dataframe_backend == DataFrameBackends.PYSPARK:
            self._write_spark_view(view_definition)
        elif self.dataframe_backend == DataFrameBackends.POLARS:
            self._write_polars_view(view_definition)
        else:
            raise ValueError(
                f"DataFrame backend '{self.dataframe_backend}' is not supported"
            )
        return

    if not isinstance(df, (nw.DataFrame, nw.LazyFrame)):
        df = nw.from_native(df)
    self._update_backend_from_df(df)

    # Custom Writer
    if self.custom_writer:
        df_native = df.to_native()

        # Special Treatment for Spark Streaming
        if (
            self.dataframe_backend == DataFrameBackends.PYSPARK
            and self.is_streaming(df=df)
        ):
            if self.checkpoint_path is None:
                raise ValueError(
                    f"Checkpoint location not specified for sink '{self._id}'"
                )
            # Build context before the foreachBatch lambda so that _parent
            # references are captured while intact. Inside foreachBatch on
            # Databricks, the lambda closure is serialized via cloudpickle
            # and _parent attributes may not survive the round-trip.
            from laktory.models.laktorycontext import LaktoryContext

            _context = LaktoryContext(
                node=self.parent_pipeline_node,
                pipeline=self.parent_pipeline,
                sink=self,
            )
            query = (
                df_native.writeStream.foreachBatch(
                    lambda batch_df, _: self.custom_writer.execute(
                        batch_df, context=_context
                    )
                )
                .trigger(availableNow=True)
                .options(checkpointLocation=self.checkpoint_path)
                .start()
            )
            query.awaitTermination()

        else:
            self.custom_writer.execute(df)

        logger.info("Write completed.")
        return

    if mode is None:
        mode = self.mode

    self._validate_mode(mode, df)
    self._validate_format()

    if mode and mode.lower() == "merge":
        self.merge_cdc_options.execute(source=df)
        logger.info("Write completed.")
        return

    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        self._write_spark(df=df, mode=mode)
    elif self.dataframe_backend == DataFrameBackends.POLARS:
        self._write_polars(df=df, mode=mode)
    else:
        raise ValueError(
            f"DataFrame backend '{self.dataframe_backend}' is not supported"
        )

    logger.info("Write completed.")

HiveMetastoreDataSink

laktory.models.datasinks.HiveMetastoreDataSink ¤

Spark¤

Spark Streaming¤

Polars Delta¤

Laktory¤

full_name property ¤

ldp_auto_cdc_flow_kwargs property ¤

sdp_pre_merge_view_name property ¤

as_source(as_stream=None, reader_kwargs=None, reader_methods=None) ¤

create(df=None) ¤

is_streaming(df=None) ¤

purge() ¤

read(as_stream=None, reader_kwargs=None, reader_methods=None) ¤

write(df=None, view_definition=None, mode=None) ¤

`laktory.models.datasinks.HiveMetastoreDataSink` ¤

`full_name` `property` ¤

`ldp_auto_cdc_flow_kwargs` `property` ¤

`sdp_pre_merge_view_name` `property` ¤

`as_source(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

`create(df=None)` ¤

`is_streaming(df=None)` ¤

`purge()` ¤

`read(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

`write(df=None, view_definition=None, mode=None)` ¤