Databricks Knowledge Base

Main Navigation

  • Help Center
  • Documentation
  • Knowledge Base
  • Community
  • Training
  • Feedback

Python with Apache Spark (Azure)

These articles can help you to use Python with Apache Spark.

19 Articles in this category

Contact Us

If you still have questions or prefer to get help directly from an agent, please submit a request. We’ll get back to you as soon as possible.

Please enter the details of your request. A member of our support staff will respond as soon as possible.

  • Home
  • Azure
  • Python with Apache Spark (Azure)

AttributeError: ‘function’ object has no attribute

Problem You are selecting columns from a DataFrame and you get an error message. ERROR: AttributeError: 'function' object has no attribute '_get_object_id' in job Cause The DataFrame API contains a small number of protected keywords. If a column in your DataFrame uses a protected keyword as the column name, you will get an error message. For example...

Last updated: May 19th, 2022 by noopur.nigam

Convert Python datetime object to string

There are multiple ways to display date and time values with Python, however not all of them are easy to read. For example, when you collect a timestamp column from a DataFrame and save it as a Python variable, the value is stored as a datetime object. If you are not familiar with the datetime object format, it is not as easy to read as the common Y...

Last updated: May 19th, 2022 by Adam Pavlacka

Create a cluster with Conda

Conda is a popular open source package management system for the Anaconda repo. Databricks Runtime for Machine Learning (Databricks Runtime ML) uses Conda to manage Python library dependencies. If you want to use Conda, you should use Databricks Runtime ML. Attempting to install Anaconda or Conda for use with Databricks Runtime is not supported. Fol...

Last updated: May 19th, 2022 by Adam Pavlacka

Display file and directory timestamp details

In this article we show you how to display detailed timestamps, including the date and time when a file was created or modified. Use ls command The simplest way to display file timestamps is to use the ls -lt <path> command in a bash shell. For example, this sample command displays basic timestamps for files and directories in the /dbfs/ folde...

Last updated: May 19th, 2022 by rakesh.parija

Install and compile Cython

This document explains how to run Spark code with compiled Cython code. The steps are as follows: Creates an example Cython module on DBFS (AWS | Azure). Adds the file to the Spark session. Creates a wrapper method to load the module on the executors. Runs the mapper on a sample dataset. Generate a larger dataset and compare the performance with nat...

Last updated: May 19th, 2022 by Adam Pavlacka

Reading large DBFS-mounted files using Python APIs

This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs. Problem If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error: /databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextRead...

Last updated: May 19th, 2022 by Adam Pavlacka

Use the HDFS API to read files in Python

There may be times when you want to read files directly without using third party libraries. This can be useful for reading small files when your regular storage blobs and buckets are not available as local DBFS mounts. AWS Use the following example code for S3 bucket storage. %python URI = sc._gateway.jvm.java.net.URI Path = sc._gateway.jvm.org.apa...

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

How to import a custom CA certificate

When working with Python, you may want to import a custom CA certificate to avoid connection errors to your endpoints. ConnectionError: HTTPSConnectionPool(host='my_server_endpoint', port=443): Max retries exceeded with url: /endpoint (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb73dc3b3d0>: Failed t...

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

Job remains idle before starting

Problem You have an Apache Spark job that is triggered correctly, but remains idle for a long time before starting. You have a Spark job that ran well for awhile, but goes idle for a long time before resuming. Symptoms include: Cluster downscales to the minimum number of worker nodes during idle time. Driver logs don’t show any Spark jobs during idl...

Last updated: May 19th, 2022 by ashish

List all workspace objects

You can use the Databricks Workspace API (AWS | Azure | GCP) to recursively list all workspace objects under a given path. Common use cases for this include: Indexing all notebook names and types for all users in your workspace. Use the output, in conjunction with other API calls, to delete unused workspaces or to manage notebooks. Dynamically get t...

Last updated: May 19th, 2022 by Adam Pavlacka

Load special characters with Spark-XML

Problem You have special characters in your source files and are using the OSS library Spark-XML. The special characters do not render correctly. For example, “CLU®” is rendered as “CLU�”. Cause Spark-XML supports the UTF-8 character set by default. You are using a different character set in your XML files. Solution You must specify the character se...

Last updated: May 19th, 2022 by annapurna.hiriyur

Python commands fail on high concurrency clusters

Problem You are attempting to run Python commands on a high concurrency cluster. All Python commands fail with a WARN error message. WARN PythonDriverWrapper: Failed to start repl ReplId-61bef-9fc33-1f8f6-2 ExitCodeException exitCode=1: chown: invalid user: ‘spark-9fcdf4d2-045d-4f3b-9293-0f’ Cause Both spark.databricks.pyspark.enableProcessIsolation...

Last updated: May 19th, 2022 by xin.wang

Cluster cancels Python command execution after installing Bokeh

Problem The cluster returns Cancelled in a Python notebook. Inspect the driver log (std.err) in the Cluster Configuration page for a stack trace and error message similar to the following: log4j:WARN No appenders could be found for logger (com.databricks.conf.trusted.ProjectConf$). log4j:WARN Please initialize the log4j system properly. log4j:WARN S...

Last updated: May 19th, 2022 by Adam Pavlacka

Cluster cancels Python command execution due to library conflict

Problem The cluster returns Cancelled in a Python notebook. Notebooks in all other languages execute successfully on the same cluster. Cause When you install a conflicting version of a library, such as ipython, ipywidgets, numpy, scipy, or pandas to the PYTHONPATH, then the Python REPL can break, causing all commands to return Cancelled after 30 sec...

Last updated: May 19th, 2022 by Adam Pavlacka

Python command execution fails with AttributeError

This article can help you resolve scenarios in which Python command execution fails with an AttributeError. Problem: 'tuple' object has no attribute 'type' When you run a notebook, Python command execution fails with the following error and stack trace: AttributeError: 'tuple' object has no attribute 'type' Traceback (most recent call last): File "/...

Last updated: May 19th, 2022 by Adam Pavlacka

Python REPL fails to start in Docker

Problem When you use a Docker container that includes prebuilt Python libraries, Python commands fail and the virtual environment is not created. The following error message is visible in the driver logs. 20/02/29 16:38:35 WARN PythonDriverWrapper: Failed to start repl ReplId-5b591-0ce42-78ef3-7 java.io.IOException: Cannot run program "/local_disk0/...

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

How to run SQL queries from Python scripts

You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script. Info Pytho...

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

Run C++ code in Python

Run C++ from Python example notebook Review the Run C++ from Python notebook to learn how to compile C++ code and run it on a cluster....

Last updated: May 19th, 2022 by Adam Pavlacka

Python 2 sunset status

Python.org officially moved Python 2 into EoL (end-of-life) status on January 1, 2020. What does this mean for you? Databricks Runtime 6.0 and above Databricks Runtime 6.0 and above support only Python 3. You cannot create a cluster with Python 2 using these runtimes. Any clusters created with these runtimes use Python 3 by definition. Databricks Ru...

Last updated: May 19th, 2022 by Adam Pavlacka


© Databricks 2022. All rights reserved. Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation.

Send us feedback | Privacy Policy | Terms of Use

Definition by Author

0
0