AttributeError: ‘function’ object has no attribute
Problem You are selecting columns from a DataFrame and you get an error message. ERROR: AttributeError: 'function' object has no attribute '_get_object_id' in job Cause The DataFrame API contains a small number of protected keywords. If a column in your DataFrame uses a protected keyword as the column name, you will get an error message. For example...
Convert Python datetime object to string
There are multiple ways to display date and time values with Python, however not all of them are easy to read. For example, when you collect a timestamp column from a DataFrame and save it as a Python variable, the value is stored as a datetime object. If you are not familiar with the datetime object format, it is not as easy to read as the common Y...
Create a cluster with Conda
Conda is a popular open source package management system for the Anaconda repo. Databricks Runtime for Machine Learning (Databricks Runtime ML) uses Conda to manage Python library dependencies. If you want to use Conda, you should use Databricks Runtime ML. Attempting to install Anaconda or Conda for use with Databricks Runtime is not supported. Fol...
Display file and directory timestamp details
In this article we show you how to display detailed timestamps, including the date and time when a file was created or modified. Use ls command The simplest way to display file timestamps is to use the ls -lt <path> command in a bash shell. For example, this sample command displays basic timestamps for files and directories in the /dbfs/ folde...
Install and compile Cython
This document explains how to run Spark code with compiled Cython code. The steps are as follows: Creates an example Cython module on DBFS (AWS | Azure). Adds the file to the Spark session. Creates a wrapper method to load the module on the executors. Runs the mapper on a sample dataset. Generate a larger dataset and compare the performance with nat...
Reading large DBFS-mounted files using Python APIs
This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs. Problem If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error: /databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextRead...
Use the HDFS API to read files in Python
There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts. AWS Use the following example code for S3 bucket storage. %python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apa...
How to import a custom CA certificate
When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints. ConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb73dc3b3d0>: Failed t...
Job remains idle before starting
Problem You have an Apache Spark job that is triggered correctly, but remains idle for a long time before starting. You have a Spark job that ran well for awhile, but goes idle for a long time before resuming. Symptoms include: Cluster downscales to the minimum number of worker nodes during idle time. Driver logs don’t show any Spark jobs during idl...
List all workspace objects
You can use the Databricks Workspace API (AWS | Azure | GCP) to recursively list all workspace objects under a given path. Common use cases for this include: Indexing all notebook names and types for all users in your workspace. Use the output, in conjunction with other API calls, to delete unused workspaces or to manage notebooks. Dynamically get t...
Load special characters with Spark-XML
Problem You have special characters in your source files and are using the OSS library Spark-XML. The special characters do not render correctly. For example, “CLU®” is rendered as “CLU�”. Cause Spark-XML supports the UTF-8 character set by default. You are using a different character set in your XML files. Solution You must specify the character se...
Python commands fail on high concurrency clusters
Problem You are attempting to run Python commands on a high concurrency cluster. All Python commands fail with a WARN error message. WARN PythonDriverWrapper: Failed to start repl ReplId-61bef-9fc33-1f8f6-2 ExitCodeException exitCode=1: chown: invalid user: ‘spark-9fcdf4d2-045d-4f3b-9293-0f’ Cause Both spark.databricks.pyspark.enableProcessIsolation...
Cluster cancels Python command execution after installing Bokeh
Problem The cluster returns Cancelled in a Python notebook. Inspect the driver log (std.err) in the Cluster Configuration page for a stack trace and error message similar to the following: log4j:WARN No appenders could be found for logger (com.databricks.conf.trusted.ProjectConf$). log4j:WARN Please initialize the log4j system properly. log4j:WARN S...
Cluster cancels Python command execution due to library conflict
Problem The cluster returns Cancelled in a Python notebook. Notebooks in all other languages execute successfully on the same cluster. Cause When you install a conflicting version of a library, such as ipython, ipywidgets, numpy, scipy, or pandas to the PYTHONPATH, then the Python REPL can break, causing all commands to return Cancelled after 30 sec...
Python command execution fails with AttributeError
This article can help you resolve scenarios in which Python command execution fails with an AttributeError. Problem: 'tuple' object has no attribute 'type' When you run a notebook, Python command execution fails with the following error and stack trace: AttributeError: 'tuple' object has no attribute 'type' Traceback (most recent call last): File "/...
Python REPL fails to start in Docker
Problem When you use a Docker container that includes prebuilt Python libraries, Python commands fail and the virtual environment is not created. The following error message is visible in the driver logs. 20/02/29 16:38:35 WARN PythonDriverWrapper: Failed to start repl ReplId-5b591-0ce42-78ef3-7 java.io.IOException: Cannot run program "/local_disk0/...
How to run SQL queries from Python scripts
You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script. Info Pytho...
Python 2 sunset status
Python.org officially moved Python 2 into EoL (end-of-life) status on January 1, 2020. What does this mean for you? Databricks Runtime 6.0 and above Databricks Runtime 6.0 and above support only Python 3. You cannot create a cluster with Python 2 using these runtimes. Any clusters created with these runtimes use Python 3 by definition. Databricks Ru...