Updated March 15th, 2022 by arjun.kaimaparambilrajan

S3 connection reset error

Problem Your Apache Spark job fails when attempting an S3 operation. The error message Caused by: java.net.SocketException: Connection reset appears in the stack trace. Example stack trace from an S3 read operation: Caused by: javax.net.ssl.SSLException: Connection reset; Request ID: XXXXX, Extended Request ID: XXXXX, Cloud Provider: AWS, Instance I...

1 min reading time
Updated May 19th, 2022 by arjun.kaimaparambilrajan

How to run SQL queries from Python scripts

You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script. Info Pytho...

1 min reading time
Updated June 1st, 2022 by arjun.kaimaparambilrajan

GeoSpark undefined function error with DBConnect

Problem You are trying to use the GeoSpark function st_geofromwkt with DBConnect (AWS | Azure | GCP) and you get an Apache Spark error message. Error: org.apache.spark.sql.AnalysisException: Undefined function: 'st_geomfromwkt'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; T...

1 min reading time
Updated March 4th, 2022 by arjun.kaimaparambilrajan

Enable retries in init script

Init scripts are commonly used to configure Databricks clusters. There are some scenarios where you may want to implement retries in an init script. Example init script This sample init script shows you how to implement a retry for a basic copy operation. You can use this sample code as a base for implementing retries in your own init script. %scala...

0 min reading time
Updated May 19th, 2022 by arjun.kaimaparambilrajan

Python REPL fails to start in Docker

Problem When you use a Docker container that includes prebuilt Python libraries, Python commands fail and the virtual environment is not created. The following error message is visible in the driver logs. 20/02/29 16:38:35 WARN PythonDriverWrapper: Failed to start repl ReplId-5b591-0ce42-78ef3-7 java.io.IOException: Cannot run program "/local_disk0/...

1 min reading time
Updated February 23rd, 2023 by arjun.kaimaparambilrajan

Jobs failing with shuffle fetch failures

Problem You are seeing intermittent Apache Spark job failures on jobs using shuffle fetch. 21/02/01 05:59:55 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, 10.79.1.45, executor 0): FetchFailed(BlockManagerId(1, 10.79.1.134, 4048, None), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Failed to conne...

1 min reading time
Updated May 19th, 2022 by arjun.kaimaparambilrajan

Use the HDFS API to read files in Python

There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts. AWS Use the following example code for S3 bucket storage. %python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apa...

1 min reading time
Updated March 4th, 2022 by arjun.kaimaparambilrajan

Set Apache Hadoop core-site.xml properties

You have a scenario that requires Apache Hadoop properties to be set. You would normally do this in the core-site.xml file. In this article, we explain how you can set core-site.xml in a cluster. Create the core-site.xml file in DBFS You need to create a core-site.xml file and save it to DBFS on your cluster. An easy way to create this file is via a...

1 min reading time
Updated May 16th, 2022 by arjun.kaimaparambilrajan

PyPMML fails with Could not find py4j jar error

Problem PyPMML is a Python PMML scoring library. After installing PyPMML in a Databricks cluster, it fails with a Py4JError: Could not find py4j jar error. %python from pypmml import Model modelb = Model.fromFile('/dbfs/shyam/DecisionTreeIris.pmml') Error : Py4JError: Could not find py4j jar at Cause This error occurs due to a dependency on the defa...

1 min reading time
Updated May 16th, 2022 by arjun.kaimaparambilrajan

Python commands fail on Machine Learning clusters

Problem You are using a Databricks Runtime for Machine Learning cluster and Python notebooks are failing. You find an invalid syntax error in the logs. SyntaxError: invalid syntax   File "/local_disk0/tmp/1593092990800-0/PythonShell.py", line 363     def __init__(self, *args, condaMagicHandler=None, **kwargs): Cause Key values in the /etc/environmen...

0 min reading time
Updated March 4th, 2022 by arjun.kaimaparambilrajan

Cluster slowdown due to Ganglia metrics filling root partition

Note This article applies to Databricks Runtime 7.3 LTS and below. Problem Clusters start slowing down and may show a combination of the following symptoms: Unhealthy cluster events are reported: Request timed out. Driver is temporarily unavailable. Metastore is down. DBFS is down. You do not see any high GC events or memory utilization associated w...

1 min reading time
Updated September 19th, 2023 by arjun.kaimaparambilrajan

AWS services fail with No region provided error

Problem Your code snippets that use AWS services fail with a java.lang.IllegalArgumentException: No region provided error in Databricks Runtime 7.0 and above. The same code worked in Databricks Runtime 6.6 and below. You can verify the issue by running the example code snippet in a notebook. In Databricks Runtime 7.0 and above, it will return the ex...

0 min reading time
Updated February 10th, 2023 by arjun.kaimaparambilrajan

Replay Apache Spark events in a cluster

The Spark UI is commonly used as a debugging tool for Spark jobs. If the Spark UI is inaccessible, you can load the event logs in another cluster and use the Event Log Replay notebook to replay the Spark events. Warning Cluster log delivery is not enabled by default. You must enable cluster log delivery before starting your cluster, otherwise there ...

1 min reading time
Updated February 29th, 2024 by arjun.kaimaparambilrajan

How to import a custom CA certificate

When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints. ConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb73dc3b3d0>: Failed t...

1 min reading time
Load More