Problem
You are attempting to convert an Apache Spark DataFrame that contains a 1D array column to MDS format using Mosaic Streaming. In your code, you are calling the dataframe_to_mds
function, specifying the mds_kwargs
parameter containing the columns field with your array column type as ndarray
.
Example code
from streaming.base.converters import dataframe_to_mds
mds_kwargs = {'out': "<dest-filepath>", 'columns': {'id_col':'float64', 'array_data': 'ndarray'}}
dataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)
When you run this code, you get a value error.
ValueError: ndarray is not supported by dataframe_to_mds
Cause
When specifying an array column as ndarray
type in the columns field of the mds_kwargs
parameter, it is necessary to append the data type of the elements of the array, which can be one of the following:
ArrayType(ShortType()): 'ndarray:int16'
ArrayType(IntegerType()): 'ndarray:int32’
ArrayType(LongType()): 'ndarray:int64'
ArrayType(FloatType()): 'ndarray:float32'
ArrayType(DoubleType()): ‘ndarray:float64’
Solution
Properly pass the data type of the elements of the array column in the mds_kwargs
. In this example code, float64
is specified as the data type. This resolves the issue.
Example code
from streaming.base.converters import dataframe_to_mds
mds_kwargs = {'out': "<dest-filepath>", 'columns': {'id_col':'float64', 'array_data': 'ndarray:float64'}}
dataframe_to_mds(df, merge_index=True, mds_kwargs=mds_kwargs)
Note
Ensure you are using the mosaicml-streaming
package version 0.7.6 or above. Support for ndarray
was added to in version 0.7.6.