FileDataSink

`laktory.models.datasinks.FileDataSink` ¤

Bases: BaseDataSink

Data sink writing to disk file(s) as csv, parquet or Delta Table.

Examples:

Write polars DataFrame as CSV

import polars as pl

import laktory as lk

df = pl.DataFrame({"x": [0, 1]})

sink = lk.models.FileDataSink(
    path="./dataframe.csv", format="CSV", writer_kwargs={"separator": ";"}
)
sink.write(df)

Write Spark Streaming DataFrame as Delta ```python tag:skip-run from laktory import models

df = spark.readStream(...) # skip

sink = models.FileDataSink( path="./delta_table/", format="DELTA", mode="APPEND", checkpoint_path="./delta_table/checkpoint/", ) sink.write(df) ``` References

Data Sources and Sinks

PARAMETER	DESCRIPTION
`as_stream`	If `True` output DataFrame is written as Streaming DataFrame. If `None`, write mode is derived fromDataFrame. TYPE: `bool \| None \| VariableType` DEFAULT: `None`
`checkpoint_path_`	Path to which the checkpoint file for which a streaming dataframe should be written. TYPE: `str \| Path \| VariableType` DEFAULT: `None`
`custom_writer`	Custom writer that fully replaces Laktory's built-in write logic. Laktory manages the streaming query lifecycle (foreachBatch, trigger, checkpoint, start/await). Can be set as a plain string (func_name only) or a full CustomWriter object with func_name, func_args, and func_kwargs. Mutually exclusive with `mode` and `merge_cdc_options`. TYPE: `CustomWriter \| None \| VariableType` DEFAULT: `None`
`databricks_data_profiling_config`	Databricks Data Quality Monitor data profiling configuration TYPE: `Literal[None] \| VariableType` DEFAULT: `None`
`format`	Format of the data files. TYPE: `str \| VariableType`
`is_quarantine`	Sink used to store quarantined results from a pipeline node expectations. TYPE: `bool \| VariableType` DEFAULT: `False`
`merge_cdc_options`	Merge options to handle input DataFrames that are Change Data Capture (CDC). Only used when `MERGE` mode is selected. TYPE: `DataSinkMergeCDCOptions \| VariableType` DEFAULT: `None`
`metadata`	Table and columns metadata. TYPE: `Literal[None] \| VariableType` DEFAULT: `None`
`mode`	Write mode. Spark¤ OVERWRITE: Overwrite existing data. APPEND: Append contents of this DataFrame to existing data. ERROR: Throw an exception if data already exists. IGNORE: Silently ignore this operation if data already exists. Spark Streaming¤ APPEND: Only the new rows in the streaming DataFrame/Dataset will be written to the sink. COMPLETE: All the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. UPDATE: Only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. Polars Delta¤ OVERWRITE: Overwrite existing data. APPEND: Append contents of this DataFrame to existing data. ERROR: Throw an exception if data already exists. IGNORE: Silently ignore this operation if data already exists. Laktory¤ MERGE: Append, update and optionally delete records. Only supported for DELTA format. Requires cdc specification. TYPE: `Literal['COMPLETE', 'IGNORE', 'MERGE', 'ERRORIFEXISTS', 'OVERWRITE', 'UPDATE', 'ERROR', 'APPEND'] \| None \| VariableType` DEFAULT: `None`
`path`	File path on a local disk, remote storage or Databricks volume. TYPE: `str \| VariableType`
`schema_definition`	Explicit table schema used when creating the table. If not set, schema is inferred from the transformer output DataFrame. TYPE: `DataFrameSchema \| VariableType` DEFAULT: `None`
`type`	Source Type TYPE: `Literal['FILE'] \| VariableType` DEFAULT: `'FILE'`
`writer_kwargs`	Keyword arguments passed directly to dataframe backend writer. Passed to `.options()` method when using PySpark. TYPE: `dict[str \| VariableType, Any \| VariableType] \| VariableType` DEFAULT: `{}`
`writer_methods`	DataFrame backend writer methods. TYPE: `list[ReaderWriterMethod \| VariableType] \| VariableType` DEFAULT: `[]`

METHOD	DESCRIPTION
`as_source`	Generate a file data source with the same path as the sink.
`create`	Creates an empty Delta table at `self.path` if the path does not already exist.
`is_streaming`	Return `True` if the write should use Spark Structured Streaming.
`purge`	Delete sink data and checkpoints
`read`	Read dataframe from sink.
`write`	Write dataframe into sink.

ATTRIBUTE	DESCRIPTION
`ldp_auto_cdc_flow_kwargs`	Keyword arguments for dp.create_auto_cdc_flow function TYPE: `dict[str, str]`
`sdp_pre_merge_view_name`	SPD view applying node transformer prior to applying CDC changes.

`ldp_auto_cdc_flow_kwargs` `property` ¤

Keyword arguments for dp.create_auto_cdc_flow function

`sdp_pre_merge_view_name` `property` ¤

SPD view applying node transformer prior to applying CDC changes.

`as_source(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

Generate a file data source with the same path as the sink.

PARAMETER	DESCRIPTION
`as_stream`	If `True`, sink will be read as stream. TYPE: `bool` DEFAULT: `None`
`reader_kwargs`	Keyword arguments passed to the dataframe backend reader. DEFAULT: `None`
`reader_methods`	DataFrame backend reader methods. DEFAULT: `None`

RETURNS	DESCRIPTION
`FileDataSource`	File Data Source

Source code in laktory/models/datasinks/filedatasink.py

def as_source(
    self, as_stream: bool = None, reader_kwargs=None, reader_methods=None
) -> FileDataSource:
    """
    Generate a file data source with the same path as the sink.

    Parameters
    ----------
    as_stream:
        If `True`, sink will be read as stream.
    reader_kwargs:
        Keyword arguments passed to the dataframe backend reader.
    reader_methods:
        DataFrame backend reader methods.

    Returns
    -------
    :
        File Data Source
    """

    source = FileDataSource(
        path=self.path, format=self.format, dataframe_backend=self.dataframe_backend
    )

    if as_stream:
        source.as_stream = as_stream
    if reader_kwargs:
        source.reader_kwargs.update(reader_kwargs)
    if reader_methods:
        source.reader_methods.extend(reader_methods)

    # if self.dataframe_backend:
    #     source.dataframe_backend = self.dataframe_backend
    source.parent = self.parent

    return source

`create(df=None)` ¤

Creates an empty Delta table at self.path if the path does not already exist.

Returns True if the table was created, False otherwise. Schema is taken from schema_definition if set, otherwise inferred from df.

Source code in laktory/models/datasinks/filedatasink.py

def create(self, df=None) -> bool:
    """
    Creates an empty Delta table at `self.path` if the path does not already exist.

    Returns True if the table was created, False otherwise.
    Schema is taken from `schema_definition` if set, otherwise inferred from `df`.
    """

    # Create table only for formats that supports append/insert/merge/delete modes
    if self.format.lower() not in ["delta", "parquet", "iceberg"]:
        return False

    if self.exists():
        return False

    self._update_backend_from_df(df)

    # For merge CDC sinks, delegate to _init_target() which builds the correct
    # schema (including SCD2 extra columns). Using the raw df schema here would
    # produce a table without those columns, causing the first merge to fail.
    if self.merge_cdc_options is not None:
        if df is None:
            logger.info(
                f"Schema is empty and `df` is None. Skipping DataFrame '{self.path}' creation."
            )
            return False
        native_df = df.to_native()
        self.merge_cdc_options._source_schema = native_df.schema
        self.merge_cdc_options._init_target(native_df)
        return True

    schema = self._get_create_schema(df)

    if schema is None:
        logger.info(
            f"Schema is empty and `df` is None. Skipping DataFrame '{self.path}' creation."
        )
        return False

    # TODO: Add logging of schema
    logger.info(f"Creating empty DataFrame at '{self.path}'")

    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        from laktory import get_spark_session

        spark = get_spark_session()

        df_empty = spark.createDataFrame(data=[], schema=schema)
        df_empty.write.format(self.format).mode("overwrite").save(self.path)

    elif self.dataframe_backend == DataFrameBackends.POLARS:
        import polars as pl

        df_empty = pl.DataFrame(schema=schema)

        if self.format.lower() == "delta":
            df_empty.write_delta(self.path, mode="overwrite")
        elif self.format.lower() == "parquet":
            df_empty.write_parquet(self.path)
        else:
            raise NotImplementedError(
                f"Format '{self.format}' is not support for {self.dataframe_backend} backend."
            )

    else:
        raise NotImplementedError(
            f"DataFrame creation is not implemented for '{self.dataframe_backend}' backend"
        )

    return True

`is_streaming(df=None)` ¤

Return True if the write should use Spark Structured Streaming.

Resolution order: 1. If a Narwhals-wrapped PySpark DataFrame is provided, read its native isStreaming attribute. 2. Fall back to self.as_stream (explicit sink configuration). 3. Fall back to the parent node's source as_stream flag. 4. Default to False (static write).

If both the DataFrame state and the configuration are set and they disagree, a TypeError is raised to surface the misconfiguration early.

PARAMETER	DESCRIPTION
`df`	Optional Narwhals DataFrame or LazyFrame. Must be passed before calling `.to_native()` so that the Narwhals `implementation` attribute is still available. DEFAULT: `None`

Source code in laktory/models/datasinks/basedatasink.py

def is_streaming(self, df=None) -> bool:
    """
    Return `True` if the write should use Spark Structured Streaming.

    Resolution order:
    1. If a Narwhals-wrapped PySpark DataFrame is provided, read its native
       ``isStreaming`` attribute.
    2. Fall back to ``self.as_stream`` (explicit sink configuration).
    3. Fall back to the parent node's source ``as_stream`` flag.
    4. Default to ``False`` (static write).

    If both the DataFrame state and the configuration are set and they
    disagree, a ``TypeError`` is raised to surface the misconfiguration
    early.

    Parameters
    ----------
    df:
        Optional Narwhals DataFrame or LazyFrame. Must be passed before
        calling ``.to_native()`` so that the Narwhals ``implementation``
        attribute is still available.
    """
    # Check if DataFrame is streaming
    df_is_streaming = None
    if df is not None:
        df = nw.from_native(df)
        dataframe_backend = DataFrameBackends(df.implementation)
        if dataframe_backend == DataFrameBackends.PYSPARK:
            df_is_streaming = df.to_native().isStreaming

    # Check if configured as stream from writer or source
    configured_as_stream = self.as_stream
    if configured_as_stream is None:
        node = self.parent_pipeline_node
        if node is not None and node.sources:
            configured_as_stream = node.has_streaming_source

    # Resolve conflict
    if df_is_streaming is not None and configured_as_stream is not None:
        if df_is_streaming != configured_as_stream:
            if df_is_streaming:
                raise TypeError(
                    "Sink configured as static, but received dataframe is streaming."
                )
            else:
                raise TypeError(
                    "Sink configured as stream, but received dataframe is not streaming."
                )

    is_streaming = df_is_streaming or configured_as_stream or False

    return is_streaming

`purge()` ¤

Delete sink data and checkpoints

Source code in laktory/models/datasinks/filedatasink.py

def purge(self):
    """
    Delete sink data and checkpoints
    """
    # Remove Data
    if self.exists():
        is_dir = os.path.isdir(self.path)
        if is_dir:
            logger.info(f"Deleting data dir {self.path}")
            shutil.rmtree(self.path)
        else:
            logger.info(f"Deleting data file {self.path}")
            os.remove(self.path)

    # TODO: Add support for Databricks dbfs / workspace / Volume?

    # Remove Checkpoint
    self._purge_checkpoint()

`read(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

Read dataframe from sink.

PARAMETER	DESCRIPTION
`as_stream`	If `True`, dataframe read as stream. DEFAULT: `None`
`reader_kwargs`	Keyword arguments passed to the dataframe backend reader. DEFAULT: `None`
`reader_methods`	DataFrame backend reader methods. DEFAULT: `None`

RETURNS	DESCRIPTION
`AnyFrame`	DataFrame

Source code in laktory/models/datasinks/basedatasink.py

def read(self, as_stream=None, reader_kwargs=None, reader_methods=None):
    """
    Read dataframe from sink.

    Parameters
    ----------
    as_stream:
        If `True`, dataframe read as stream.
    reader_kwargs:
        Keyword arguments passed to the dataframe backend reader.
    reader_methods:
        DataFrame backend reader methods.

    Returns
    -------
    AnyFrame
        DataFrame
    """
    return self.as_source(
        as_stream=as_stream,
        reader_kwargs=reader_kwargs,
        reader_methods=reader_methods,
    ).read()

`write(df=None, view_definition=None, mode=None)` ¤

Write dataframe into sink.

PARAMETER	DESCRIPTION
`df`	Input dataframe. TYPE: `AnyFrame` DEFAULT: `None`
`mode`	Write mode overwrite of the sink default mode. TYPE: `str` DEFAULT: `None`
`view_definition`	View definition for table data sinks of `VIEW` type TYPE: `str` DEFAULT: `None`

Source code in laktory/models/datasinks/basedatasink.py

def write(
    self,
    df: AnyFrame = None,
    view_definition: str = None,
    mode: str = None,
) -> None:
    """
    Write dataframe into sink.

    Parameters
    ----------
    df:
        Input dataframe.
    mode:
        Write mode overwrite of the sink default mode.
    view_definition:
        View definition for table data sinks of `VIEW` type
    """

    logger.info("Write initiated.")

    if getattr(self, "table_type", None) == "VIEW":
        if view_definition is None:
            raise ValueError(f"`view_definition` for '{self._id}' is `None`")

        from laktory.models.dataframe.dataframeexpr import DataFrameExpr

        if not isinstance(view_definition, DataFrameExpr):
            view_definition = DataFrameExpr(expr=view_definition)

        if self.dataframe_backend == DataFrameBackends.PYSPARK:
            self._write_spark_view(view_definition)
        elif self.dataframe_backend == DataFrameBackends.POLARS:
            self._write_polars_view(view_definition)
        else:
            raise ValueError(
                f"DataFrame backend '{self.dataframe_backend}' is not supported"
            )
        return

    if not isinstance(df, (nw.DataFrame, nw.LazyFrame)):
        df = nw.from_native(df)
    self._update_backend_from_df(df)

    # Custom Writer
    if self.custom_writer:
        df_native = df.to_native()

        # Special Treatment for Spark Streaming
        if (
            self.dataframe_backend == DataFrameBackends.PYSPARK
            and self.is_streaming(df=df)
        ):
            if self.checkpoint_path is None:
                raise ValueError(
                    f"Checkpoint location not specified for sink '{self._id}'"
                )
            # Build context before the foreachBatch lambda so that _parent
            # references are captured while intact. Inside foreachBatch on
            # Databricks, the lambda closure is serialized via cloudpickle
            # and _parent attributes may not survive the round-trip.
            from laktory.models.laktorycontext import LaktoryContext

            _context = LaktoryContext(
                node=self.parent_pipeline_node,
                pipeline=self.parent_pipeline,
                sink=self,
            )
            query = (
                df_native.writeStream.foreachBatch(
                    lambda batch_df, _: self.custom_writer.execute(
                        batch_df, context=_context
                    )
                )
                .trigger(availableNow=True)
                .options(checkpointLocation=self.checkpoint_path)
                .start()
            )
            query.awaitTermination()

        else:
            self.custom_writer.execute(df)

        logger.info("Write completed.")
        return

    if mode is None:
        mode = self.mode

    self._validate_mode(mode, df)
    self._validate_format()

    if mode and mode.lower() == "merge":
        self.merge_cdc_options.execute(source=df)
        logger.info("Write completed.")
        return

    if self.dataframe_backend == DataFrameBackends.PYSPARK:
        self._write_spark(df=df, mode=mode)
    elif self.dataframe_backend == DataFrameBackends.POLARS:
        self._write_polars(df=df, mode=mode)
    else:
        raise ValueError(
            f"DataFrame backend '{self.dataframe_backend}' is not supported"
        )

    logger.info("Write completed.")

FileDataSink

laktory.models.datasinks.FileDataSink ¤

Spark¤

Spark Streaming¤

Polars Delta¤

Laktory¤

ldp_auto_cdc_flow_kwargs property ¤

sdp_pre_merge_view_name property ¤

as_source(as_stream=None, reader_kwargs=None, reader_methods=None) ¤

create(df=None) ¤

is_streaming(df=None) ¤

purge() ¤

read(as_stream=None, reader_kwargs=None, reader_methods=None) ¤

write(df=None, view_definition=None, mode=None) ¤

`laktory.models.datasinks.FileDataSink` ¤

`ldp_auto_cdc_flow_kwargs` `property` ¤

`sdp_pre_merge_view_name` `property` ¤

`as_source(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

`create(df=None)` ¤

`is_streaming(df=None)` ¤

`purge()` ¤

`read(as_stream=None, reader_kwargs=None, reader_methods=None)` ¤

`write(df=None, view_definition=None, mode=None)` ¤