Problem
You are attempting to convert an Apache Spark DataFrame that contains a 1D array column to MDS format using Mosaic Streaming. In your code, you are calling the dataframe_to_mds function, specifying the mds_kwargs parameter containing the columns field with your array column type as ndarray.
Example code
from streaming.base.converters import dataframe_to_mds
mds_kwargs = {'out': "<dest-filepath>", 'columns': {'id_col':'float64', 'array_data': 'ndarray'}}
dataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)
When you run this code, you get a value error.
ValueError: ndarray is not supported by dataframe_to_mds
Cause
When specifying an array column as ndarray type in the columns field of the mds_kwargs parameter, it is necessary to append the data type of the elements of the array, which can be one of the following:
ArrayType(ShortType()): 'ndarray:int16'ArrayType(IntegerType()): 'ndarray:int32’ArrayType(LongType()): 'ndarray:int64'ArrayType(FloatType()): 'ndarray:float32'ArrayType(DoubleType()): ‘ndarray:float64’
Solution
Properly pass the data type of the elements of the array column in the mds_kwargs. In this example code, float64 is specified as the data type. This resolves the issue.
Example code
from streaming.base.converters import dataframe_to_mds
mds_kwargs = {'out': "<dest-filepath>", 'columns': {'id_col':'float64', 'array_data': 'ndarray:float64'}}
dataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)
Note
Ensure you are using the mosaicml-streaming package version 0.7.6 or above. Support for ndarray was added to in version 0.7.6.