Getting ValueError: ndarray is not supported by dataframe_to_mds when converting an Apache Spark DataFrame to MDS format using Mosaic Streaming

Properly pass the data type of the elements of the array column in the mds_kwargs.

Written by jessica.santos

Last published at: January 16th, 2025

Problem

You are attempting to convert an Apache Spark DataFrame that contains a 1D array column to MDS format using Mosaic Streaming. In your code, you are calling the dataframe_to_mds function, specifying the mds_kwargs parameter containing the columns field with your array column type as ndarray.

 

Example code

from streaming.base.converters import dataframe_to_mds

mds_kwargs = {'out': "<dest-filepath>", 'columns': {'id_col':'float64', 'array_data': 'ndarray'}}
dataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)

 

When you run this code, you get a value error.

ValueError: ndarray is not supported by dataframe_to_mds

 

Cause

When specifying an array column as ndarray type in the columns field of the mds_kwargs parameter, it is necessary to append the data type of the elements of the array, which can be one of the following:

  • ArrayType(ShortType()): 'ndarray:int16'
  • ArrayType(IntegerType()): 'ndarray:int32’
  • ArrayType(LongType()): 'ndarray:int64'
  • ArrayType(FloatType()): 'ndarray:float32'
  • ArrayType(DoubleType()): ‘ndarray:float64’

 

Solution

Properly pass the data type of the elements of the array column in the mds_kwargs. In this example code, float64 is specified as the data type. This resolves the issue.

 

Example code

from streaming.base.converters import dataframe_to_mds

mds_kwargs = {'out': "<dest-filepath>", 'columns': {'id_col':'float64', 'array_data': 'ndarray:float64'}}
dataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)

 

Note

Ensure you are using the mosaicml-streaming package version 0.7.6 or above. Support for ndarray was added to in version 0.7.6.