Jobs failing with shuffle fetch failures
Problem You are seeing intermittent Apache Spark job failures on jobs using shuffle fetch. 21/02/01 05:59:55 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 10.79.1.45, executor 0): FetchFailed(BlockManagerId(1, 10.79.1.134, 4048, None), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Failed to conne...
1 min reading timeEnable retries in init script
Init scripts are commonly used to configure Databricks clusters. There are some scenarios where you may want to implement retries in an init script. Example init script This sample init script shows you how to implement a retry for a basic copy operation. You can use this sample code as a base for implementing retries in your own init script. %scala...
0 min reading timeReplay Apache Spark events in a cluster
The Spark UI is commonly used as a debugging tool for Spark jobs. If the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay notebook to replay the Spark events. Warning Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise there ...
1 min reading timeHow to import a custom CA certificate
When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints. ConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb73dc3b3d0>: Failed t...
1 min reading timeSet Apache Hadoop core-site.xml properties
You have a scenario that requires Apache Hadoop properties to be set. You would normally do this in the core-site.xml file. In this article, we explain how you can set core-site.xml in a cluster. Create the core-site.xml file in DBFS You need to create a core-site.xml file and save it to DBFS on your cluster. An easy way to create this file is via a...
1 min reading timeHow to run SQL queries from Python scripts
You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script. Info Pytho...
1 min reading timePython REPL fails to start in Docker
Problem When you use a Docker container that includes prebuilt Python libraries, Python commands fail and the virtual environment is not created. The following error message is visible in the driver logs. 20/02/29 16:38:35 WARN PythonDriverWrapper: Failed to start repl ReplId-5b591-0ce42-78ef3-7 java.io.IOException: Cannot run program "/local_disk0/...
1 min reading timeAWS services fail with No region provided error
Problem Your code snippets that use AWS services fail with a java.lang.IllegalArgumentException: No region provided error in Databricks Runtime 7.0 and above. The same code worked in Databricks Runtime 6.6 and below. You can verify the issue by running the example code snippet in a notebook. In Databricks Runtime 7.0 and above, it will return the ex...
0 min reading timeGeoSpark undefined function error with DBConnect
Problem You are trying to use the GeoSpark function st_geofromwkt with DBConnect (AWS | Azure | GCP) and you get an Apache Spark error message. Error: org.apache.spark.sql.AnalysisException: Undefined function: 'st_geomfromwkt'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; T...
1 min reading timeS3 connection reset error
Problem Your Apache Spark job fails when attempting an S3 operation. The error message Caused by: java.net.SocketException: Connection reset appears in the stack trace. Example stack trace from an S3 read operation: Caused by: javax.net.ssl.SSLException: Connection reset; Request ID: XXXXX, Extended Request ID: XXXXX, Cloud Provider: AWS, Instance I...
1 min reading timePyPMML fails with Could not find py4j jar error
Problem PyPMML is a Python PMML scoring library. After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error. %python from pypmml import Model modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml') Error : Py4JError: Could not find py4j jar at Cause This error occurs due to a dependency on the defa...
1 min reading timeCluster slowdown due to Ganglia metrics filling root partition
Note This article applies to Databricks Runtime 7.3 LTS and below. Problem Clusters start slowing down and may show a combination of the following symptoms: Unhealthy cluster events are reported: Request timed out. Driver is temporarily unavailable. Metastore is down. DBFS is down. You do not see any high GC events or memory utilization associated w...
1 min reading timePython commands fail on Machine Learning clusters
Problem You are using a Databricks Runtime for Machine Learning cluster and Python notebooks are failing. You find an invalid syntax error in the logs. SyntaxError: invalid syntax File "/local_disk0/tmp/1593092990800-0/PythonShell.py", line 363 def __init__(self, *args, condaMagicHandler=None, **kwargs): Cause Key values in the /etc/environmen...
0 min reading timeUse the HDFS API to read files in Python
There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts. AWS Use the following example code for S3 bucket storage. %python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apa...
1 min reading time