Ho un DataFrame in Apache Spark con una matrice di numeri interi, la sorgente è un insieme di immagini. Alla fine voglio fare PCA su di esso, ma sto avendo problemi solo creando una matrice dai miei array. Come posso creare una matrice da un RDD?Apache Spark: come creare una matrice da un DataFrame?
> imagerdd = traindf.map(lambda row: map(float, row.image))
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
Traceback (most recent call last):
File "<ipython-input-21-6fdaa8cde069>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
return np.asarray(array_like, dtype=dtype)
File "/usr/local/python/conda/lib/python2.7/site- packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
che sto ottenendo lo stesso errore da ogni possibile accordo mi viene in mente:
imagerdd = traindf.map(lambda row: Vectors.dense(row.image))
imagerdd = traindf.map(lambda row: row.image)
imagerdd = traindf.map(lambda row: np.array(row.image))
Se provo
> imagedf = traindf.select("image")
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
Traceback (chiamata più recente scorso):
File "<ipython-input-26-a8cbdad10291>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
return np.asarray(array_like, dtype=dtype)
File "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Sai come fare lo stesso in scala? https://stackoverflow.com/questions/47010126/calculate-cosine-similarity-spark-dataframe –