Databricks Help Center

Main Navigation

  • Help Center
  • Documentation
  • Knowledge Base
  • Community
  • Training
  • Feedback

Python with Apache Spark (AWS)

These articles can help you to use Python with Apache Spark.

43 Articles in this category

  • Home
  • All articles
  • Python with Apache Spark (AWS)

AttributeError: ‘function’ object has no attribute

Using protected keywords from the DataFrame API as column names results in a function object has no attribute error message....

Last updated: May 19th, 2022 by noopur.nigam

Convert Python datetime object to string

Display date and time values in a column, as a datetime object, and as a string....

Last updated: May 19th, 2022 by Adam Pavlacka

Display file and directory timestamp details

Display file creation date and modification date using Python....

Last updated: May 19th, 2022 by rakesh.parija

Install and compile Cython

Learn how to install and compile Cython with Databricks....

Last updated: May 19th, 2022 by Adam Pavlacka

Reading large DBFS-mounted files using Python APIs

Learn how to resolve errors when reading large DBFS-mounted files using Python APIs....

Last updated: May 19th, 2022 by Adam Pavlacka

Use the HDFS API to read files in Python

Learn how to read files directly by using the HDFS API in Python....

Last updated: June 22nd, 2023 by arjun.kaimaparambilrajan

How to import a custom CA certificate

Learn how to import a custom CA certificate into your Databricks cluster for Python use....

Last updated: February 29th, 2024 by arjun.kaimaparambilrajan

Job remains idle before starting

Apache Spark jobs remain idle for a long time before starting....

Last updated: May 19th, 2022 by ashish

List all workspace objects

List all Databricks workspace objects under a given path....

Last updated: May 19th, 2022 by Adam Pavlacka

Load special characters with Spark-XML

Special characters are not rendering correctly. Use charset with Spark-XML....

Last updated: May 19th, 2022 by annapurna.hiriyur

Python commands fail on high concurrency clusters

Python commands fail on high concurrency clusters with Apache Spark process isolation and shared session enabled. WARN error message....

Last updated: May 19th, 2022 by xin.wang

Cluster cancels Python command execution after installing Bokeh

Learn what to do when your Databricks cluster cancels Python command execution after you install Bokeh....

Last updated: May 19th, 2022 by Adam Pavlacka

Cluster cancels Python command execution due to library conflict

Learn what to do when your Databricks cluster cancels Python command execution due to a library conflict....

Last updated: May 19th, 2022 by Adam Pavlacka

Python command execution fails with AttributeError

Learn what to do when a Python command in your Databricks notebook fails with AttributeError....

Last updated: May 19th, 2022 by Adam Pavlacka

Python REPL fails to start in Docker

Learn how to fix a Python virtualenv error that prevents REPL from starting in a Docker container...

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

How to run SQL queries from Python scripts

Learn how to run SQL queries using Python scripts....

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

Run C++ code in Python

Learn how to run C++ code in Python....

Last updated: May 19th, 2022 by Adam Pavlacka

Python 2 sunset status

Learn about the sunset status of Python 2 in Databricks....

Last updated: May 19th, 2022 by Adam Pavlacka

Job fails with Java IndexOutOfBoundsException error

When groupby() is used along with applyInPandas it generates an exception due to an arrow buffer limitation....

Last updated: December 21st, 2022 by rakesh.parija

Job fails with NoSuchElementException error

NoSuchElementException errors can occur when using Apache Arrow....

Last updated: March 3rd, 2023 by ashish

Job fails with IndexOutOfBoundsException and ArrowBuf errors

When Groupby is used with applyinPandas it can result in Apache Arrow buffer size estimation errors....

Last updated: March 3rd, 2023 by ashish

Field name sorting changes in Apache Spark 3.x

Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically....

Last updated: April 21st, 2023 by sergios.lalas

Job fails with "not enough memory to build the hash map" error

You should use adaptive query execution instead of explicit broadcast hints to perform joins on Databricks Runtime 11.3 LTS and above....

Last updated: May 12th, 2023 by saritha.shivakumar

Create a DataFrame from a JSON string or Python dictionary

Create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary....

Last updated: October 9th, 2024 by ram.sankarasubramanian

Apache Spark driver stops and restarts while reading JSON file data in an S3 bucket

Use a predefined schema to read the files, or read folder contents as text instead. ...

Last updated: December 24th, 2024 by G Yashwanth Kiran

Behavioral changes for the CHAR data type on Serverless

Pad your reads with spaces to match the declared length of the CHAR field or set the legacy charVarcharAsString config to true....

Last updated: October 18th, 2024 by shanmugavel.chandrakasu

Column value errors when connecting from Apache Spark to Databricks using Spark JDBC

Use overriding quote identifiers in the JdbcDialect class and register them under JDBCDialects in Java or Python....

Last updated: April 9th, 2025 by swetha.nandajan

Trying to decode a protocol buffer and getting error [PROTOBUF_DEPENDENCY_NOT_FOUND]

Use the option --include_imports while creating the protobuf descriptor file, and then use this descriptor file in the from_protobuf() function....

Last updated: November 4th, 2024 by saikrishna.pujari

Error java.io.FileNotFoundException when job attempts to read or write intermediary files

Cache the DataFrame before performing write operations and ensure you are using a compatible version of the com.crealytics.spark.excel library. ...

Last updated: November 14th, 2024 by John Benninghoff

Using Pyspark testing library assertDataFrameEqual throws OutOfMemoryError

Verify schemas are equivalent, then either ensure sufficient driver memory or compare a subset of DataFrames. ...

Last updated: November 17th, 2024 by brock.baurer

Expensive transformation on DataFrame is recalculated even when cached

Understand how Apache Spark DataFrame caching works....

Last updated: December 6th, 2024 by jayant.sharma

Error java.lang.UnsupportedOperationException when trying to read datetime data files

Set spark.sql.legacy.parquet.datetimeRebaseModeInRead to LEGACY. ...

Last updated: December 23rd, 2024 by Vidhi Khaitan

Apache Spark PySpark job using a Python threading API function taking hours instead of minutes

Use the Databricks Spark connector and ensure your cluster configuration is optimized for the workload....

Last updated: January 10th, 2025 by John Benninghoff

PySparkValueError when working with UDFs in Apache Spark

Ensure that the Python UDF output matches the schema defined in the source code. ...

Last updated: January 16th, 2025 by raphael.balogo

Unable to parallelize the code using the apply API from Pandas on PySpark

Directly use the apply function from pyspark.pandas without wrapping it in a lambda function....

Last updated: January 29th, 2025 by Amruth Ashoka

Runtimes increase when using .loc() and assignment(=) operations

Use vectorized operations instead....

Last updated: March 11th, 2025 by vinay.mr

Unable to get Apache Spark SparkEnv settings via PySpark

To get the same output using PySpark, broadcast the “test” value to the executors so you can perform the map operation on the executors....

Last updated: March 18th, 2025 by Vidhi Khaitan

RESOURCES_EXHAUSTED error message when trying to perform self-joins with Spark Connect

Increase the max message size using the spark.sql.session.localRelationCacheThreshold config or use temporary views. ...

Last updated: March 18th, 2025 by Lucas Ribeiro

DataFrame in an interactive cluster still showing cached data after calling unpersist() function

Use unpersist(blocking=True) to ensure unpersist() is performed before proceeding with further actions....

Last updated: March 21st, 2025 by MuthuLakshmi.AN

Apache Spark Submit job clusters do not terminate after sc.stop()

Explicitly invoke System.exit(0) after SparkContext.stop()....

Last updated: March 28th, 2025 by Vidhi Khaitan

Error PySparkNotImplementedError when using an RDD to extract distinct values on a standard cluster

Use .collect() and list comprehension to extract distinct column values....

Last updated: April 14th, 2025 by anshuman.sahu

Use snappy and zstd compression types in a Delta table without rewriting entire table

Test your compression type, generate, and insert sample records using zstd, then write the zstd files to your Delta table....

Last updated: April 16th, 2025 by chandan.kumar

Using collect_list after transformations such as JOIN returns inconsistent counts even though the underlying data doesn’t change

Sort the collect_list output using array_sort before performing joins....

Last updated: April 29th, 2025 by manikandan.ganesan

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.


© Databricks 2022-2025. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Notice (Updated) | Terms of Use | Your Privacy Choices | Your California Privacy Rights Privacy Rights icon


Knowledge Base Software powered by Helpjuice

Definition by Author

0
0