Pyspark sql distinct values. collect() But this takes a lot of time.


Pyspark sql distinct values count () method and the countDistinct () function of PySpark. 1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. collect_set () de-dupes the data and return unique values whereas collect_list () return the values as is without eliminating the duplicates. Once again we use pyspark. ReduceByKey() doesnt help here May 16, 2024 · By using countDistinct () PySpark SQL function you can get the count distinct of the DataFrame that resulted from PySpark groupBy (). agg(fn. Here, we use the select() function to first select the column (or columns) we want to get the distinct values for and then apply the distinct() function. Or count () the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one from pyspark. The function works with strings, numeric, binary and compatible array columns. pandas udf Quick reference for essential PySpark functions with examples. All these PySpark Functions return pyspark. sql. These come in handy when we need to perform operations on an array (ArrayType) column. PySpark SQL provides a DataFrame API for manipulating data in a distributed and fault-tolerant manner. Learn how to get unique values in a column in PySpark with this step-by-step guide. Learn how to use the distinct () function and the dropDuplicates () function to get the unique values in a column. cols Column or column name other columns to compute on. select Apr 26, 2024 · Spark with Scala provides several built-in SQL standard array functions, also known as collection functions in DataFrame API. agg () and pyspark. It would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe. In this article, we will discuss how to count distinct values in one or multiple columns in pyspark. functions import max The max function we use here is the pySPark sql library function, not the default max function of Oct 10, 2023 · Learn the syntax of the array\\_distinct function of the SQL language in Databricks SQL and Databricks Runtime. from pyspark. distinct() → pyspark. It allows developers to seamlessly integrate SQL queries with Spark programs, making it easier to work with structured data using the familiar SQL language. Oct 6, 2023 · This tutorial explains how to find unique values in a column of a PySpark DataFrame, including several examples. approx_count_distinct(col, rsd=None) [source] # This aggregate function returns a new Column, which estimates the approximate distinct count of elements in a specified column or a group of columns. distinct () is Jul 4, 2021 · In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe. concat # pyspark. any_value # pyspark. I generate a dictionary for aggregation with something like: from pyspark. show() 1 It seems that the way F. Introduction to the array_distinct function The array_distinct function in PySpark is a powerful tool that allows you to remove duplicate elements from an array column in a DataFrame. newrdd = [('A', [1, 2, 4, 5]), ('B', [2, 3, 1, 5, 10], ('C', [3, 2, 5, 10])] meaning, I have to get the distinct elements of values. groupby(['Year']) df_grouped = gr. The order of elements in the output array may differ from the order in the input arrays. This method returns a new DataFrame with only the distinct rows based on the specified column (s). As countDistinct is not a build in aggre Jul 29, 2016 · df = df. sql import functions as F df. ReduceByKey() doesnt help here Dec 19, 2023 · apache-spark pyspark apache-spark-sql count distinct edited Dec 19, 2023 at 14:04 ZygD 24. pivot(pivot_col: str, values: Optional[List[LiteralType]] = None) → GroupedData ¶ Pivots a column of the current DataFrame and perform the specified aggregation. Oct 10, 2023 · This tutorial explains how to select distinct rows in a PySpark DataFrame, including several examples. DataFrame ¶ Returns a new DataFrame containing the distinct rows in this DataFrame. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. countDistinct("a","b","c")). Null handling: The array_union function treats null values as distinct elements. Parameters col Column or column name first column to compute on. Series to a scalar value, where each pandas. Oct 30, 2023 · This tutorial explains how to use groupBy with count distinct in PySpark, including several examples. distinct() but if you have other value in date column, you wont get back the distinct elements from host: Oct 13, 2025 · PySpark SQL provides several built-in standard functions pyspark. sum_distinct(col: ColumnOrName) → pyspark. Jun 21, 2016 · 40 edf. The latter is more concise but less Oct 6, 2025 · pyspark. Is there a way of doing it for all columns at the same time? I have a PySpark dataframe with a column URL in it. Here are five key points about distinct (): Get the distinct values in a column in PySpark with this easy-to-follow guide. May 16, 2024 · By using countDistinct () PySpark SQL function you can get the count distinct of the DataFrame that resulted from PySpark groupBy (). select ('column'). This guide also includes code examples. . Extract unique values in a column using PySpark. countDistinct() is a SQL function that could be used to get the count distinct of the selected multiple columns. I just need the number of total distinct values. For spark2. dataframe. This function is particularly useful when working with large datasets that may contain redundant or Jan 14, 2019 · The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame? The describe method provides only the count but not the distinct co Oct 31, 2016 · import pyspark. Jun 14, 2024 · In this example, we are creating pyspark dataframe with 11 rows and 3 columns and get the distinct count from rollno and marks column. A)). fu Parameters col Column or column name target column to compute on. All I want to know is how many distinct values are there. In this article, we will discuss how to select distinct rows or values in a column of a pyspark dataframe using three different ways. functions as F df. 8k 41 106 144 Jan 19, 2024 · In this video, You will get to know the differences between Distinct () and DropDuplicates () functions in Apache Spark. show() shows the distinct values that are present in x column of edf DataFrame. distinct ()” function returns a new DataFrame with unique rows, making it a simple and efficient way to count distinct values. Jul 10, 2025 · PySpark SQL is a very important and most used module that is used for structured data processing. The “. Note that it ignores the null/none values from the column when get the maximum value. Learn how to use the distinct () function, the nunique () function, and the dropDuplicates () function. The distinct() function allows you to eliminate duplicate records and focus on unique data. Jul 30, 2009 · Functions ! != % & * + - / < << <= <=> <> = == > >= >> >>> ^ abs acos acosh add_months aes_decrypt aes_encrypt aggregate and any any_value approx_count_distinct approx_percentile array array_agg array_append array_compact array_contains array_distinct array_except array_insert array_intersect array_join array_max array_min array_position array_prepend array_remove array_repeat array_size array Jun 16, 2018 · Not sure if this is going to be very helpful. Series represents a column within the group or window. Aug 2, 2024 · Understanding the differences between distinct () and dropDuplicates () in PySpark allows you to choose the right method for removing duplicates based on your specific use case. Mar 27, 2024 · In this article, you have learned how to perform PySpark select distinct rows from DataFrame, also learned how to select unique values from single column and multiple columns, and finally learned to use PySpark SQL. agg(F. Mar 21, 2025 · When working with data manipulation and aggregation in PySpark, having the right functions at your disposal can greatly enhance efficiency and productivity. Jun 2, 2016 · Grouped aggregate Pandas UDFs are used with groupBy (). I have tried the following df. Sep 16, 2021 · I have a PySpark dataframe and would like to groupby several columns and then calculate the sum of some columns and count distinct values of another column. # import the below modules You can use the Pyspark countDistinct() function to get a count of the distinct values in a column of a Pyspark dataframe. any_value(col, ignoreNulls=None) [source] # Returns some value of col for a group of rows. Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method. count(col('Student_ID')). functions import max df. By chaining these you can get the count distinct of PySpark DataFrame. g. Mar 27, 2024 · PySpark distinct() transformation is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns. countDistinct ¶ pyspark. It defines an aggregation from one or more pandas. Jun 19, 2019 · I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group. count () function or the pyspark. The default value is None. Jun 7, 2022 · I am trying to get the pyspark. It returns a new DataFrame containing distinct rows, leaving the Jan 24, 2018 · from pyspark. agg(max(df. In Pyspark, there are two ways to get the count of distinct values. Nov 6, 2024 · Explore various methods to retrieve unique values from a PySpark DataFrame column without using SQL queries or groupby operations. Case 3: PySpark Distinct multiple columns If you want to check distinct values of multiple columns together then in the select add multiple columns and then apply distinct on it. Nov 6, 2023 · This tutorial explains how to use groupby and concatenate strings in a PySpark DataFrame, including an example. Understanding PySpark’s SQL module is becoming increasingly important as more Python developers use it to leverage the Jun 2, 2019 · I have an RDD and I want to find distinct values for multiple columns. These are very important and frequently used function in Raw Data Cleaning Mar 27, 2024 · In summary, PySpark SQL function collect_list() and collect_set() aggregates the data into a list and returns an ArrayType. Get the unique values in a PySpark column with this easy-to-follow guide. count() is a function provided by the PySpark SQL module (pyspark. DataFrame. Oct 16, 2023 · This tutorial explains how to count distinct values in a PySpark DataFrame, including several examples. In pandas I could do, Jun 16, 2018 · Not sure if this is going to be very helpful. concat(*cols) [source] # Collection function: Concatenates multiple input columns together into a single column. This tutorial covers both the `distinct ()` and `dropDuplicates ()` functions, and provides code examples for each. select(c). Returns Column the column for computed results. Here are the differences and use cases to understand behavior, particularly in SQL-like environments (e. Jan 4, 2024 · PySpark SQL has become synonymous with scalability and efficiency. 44 I just tried doing a countDistinct over a window and got this error: AnalysisException: u'Distinct window functions are not supported: count (distinct color#1926) Is there a way to do a distinct count over a window in pyspark? Here's some example code: Parameters col Column or str name of column or expression Returns Column A new column that is an array of unique values from the input column. countDistinct deals with the null value is not intuitive for me. collect() But this takes a lot of time. Dec 6, 2018 · I think the question is related to: Spark DataFrame: count distinct values of every column So basically I have a spark dataframe, with column A has values of 1,1,2,2,1 So I want to count how many Sep 22, 2024 · DISTINCT and COLLECT_SET are two vital functions used in Data analysis. Jul 17, 2023 · When using a pyspark dataframe, we sometimes need to select unique rows or unique values from a particular column. Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value) May 13, 2024 · pyspark. Counter, which exists for the express purpose of counting distinct values. Aug 8, 2017 · I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row (no_children=0)" but I need only the value as I will use it for another part of my code. Window. posexplode but this time it's just to create a column to represent the index in each array to extract. Jun 6, 2021 · In this article, we are going to display the distinct column values from dataframe using pyspark in Python. groupby ('column'). In this article, we’ll explore their capabilities, syntax, and practical examples to help you use them effectively. functions to work with DataFrame and SQL queries. functions provides two functions concat () and concat_ws () to concatenate DataFrame multiple columns into a single column. Or count () the occurrences of a value if its more than 1 the delete the all the duplicates other than the first one Oct 21, 2024 · It operates on a column and collects all the distinct values of that column into an array-like structure, removing duplicates in the process. , SQL, PySpark, etc. This guide also includes code examples and tips for optimizing your performance. Get distinct non-null values from a DataFrame. In this article, I will explain the differences between concat () and concat_ws () (concat with separator) by examples. I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's. Feb 25, 2017 · I don't know a thing about pyspark, but if your collection of strings is iterable, you can just pass it to a collections. Example: Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10) I would like to find have a Jul 7, 2021 · I am trying to run aggregation on a dataframe. May 30, 2021 · In this article we are going to get the distinct data from pyspark dataframe in Python, So we are going to create the dataframe using a nested list and get the distinct data. Examples Example 1: Using sum_distinct function on a column with all distinct values Nov 4, 2023 · So how do we tidy up messy big data into a streamlined analytical dataset in Apache Spark using Python (PySpark)? This is where the handy distinct () function comes in! In this comprehensive 2,500+ word guide, you‘ll learn how to fully utilize distinct () in PySpark to wrangle your data by: Removing duplicates across entire datasets or specified columns Fetching the unique values for a pyspark. It returns the maximum value present in the specified column. distinct (), df. max() is used to compute the maximum value within a DataFrame column. DISTINCT Purpose: It filters out duplicate rows or values in a result set. functions as fn gr = Df2. alias('total_student_by_year')) The problem that I discovered that so many ID's are repeated, so the result is wrong and huge. Learn data transformations, string manipulation, and more in the cheat sheet. collect_list("values")) but the solution has this WrappedArrays Mar 27, 2024 · How to get distinct values from a Spark RDD? We are often required to get the distinct values from the Spark RDD, you can use the distinct () function of RDD to achieve this. countDistinct () is used to get the count of unique values of the specified column. ). Click on each link to learn with example. pivot ¶ GroupedData. distinct ()” function, the “. 4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. pyspark. com Differences: DISTINCT Vs. First, we’ll create a Pyspark dataframe that we’ll be using throughout this tutorial. Distinct rows are rows with unique values across all columns. do your thing Here's a class I created to do this: Oct 25, 2024 · Introduction In this tutorial, we want to count the distinct values of a PySpark DataFrame column. Finally we use this trick that allows you to use a column value as a parameter. Jul 24, 2023 · While handling data in pyspark, we often need to find the count of distinct values in one or multiple columns in a pyspark dataframe. functions. countDistinct(col: ColumnOrName, *cols: ColumnOrName) → pyspark. column. The column contains more than 50 million records and can grow large Apr 3, 2024 · Counting the distinct values in PySpark can be done using three different methods: the “. Get distinct rows from a DataFrame with null values. 0 Make sure you have the correct import: from pyspark. agg ()” function, and the “pivot” function. groupBy("store"). It returns a new array column with distinct elements, eliminating any duplicates present in the original array. approx_count_distinct avg collect_list collect_set countDistinct count grouping first last kurtosis max min mean skewness stddev stddev_samp Nov 8, 2023 · This tutorial explains how to perform a union between two PySpark DataFrames and only return distinct rows, including an example. Apr 17, 2025 · If you’re more comfortable with SQL, PySpark’s SQL module lets you filter duplicates using familiar SQL syntax. Output: Returns a result set where all rows or May 4, 2024 · 1. Oct 14, 2021 · Show all distinct values per column in dataframe Problem Statement: I want to see all the distinct values per column for my entire table, but a SQL query with a collect_set() on every column is not dynamic and too long to write. Let's create a sample dataframe for demonstration: Nov 29, 2023 · distinct() eliminates duplicate records (matching all columns of a Row) from DataFrame, count () returns the count of records on DataFrame. Or Split all values at comma "," list and remove all the duplicates by comparing each value. If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. Examples Select distinct rows in PySpark DataFrame The distinct () method in Apache PySpark DataFrame is used to generate a new DataFrame containing only unique rows based on all columns. Using Spark 1. Examples Example 1: Counting distinct values of a single column +---+---+ Using agg and max method of python we can get the value as following : from pyspark. If the input arrays contain null values, they will be included in the resulting array. By registering a DataFrame as a temporary view, you can use DISTINCT or ROW_NUMBER () to handle duplicates. Get distinct values from a specific column in a DataFrame. Column ¶ Aggregate function: returns the sum of distinct values in the expression. Examples Example 1: Removing duplicate values from a simple array Mar 21, 2016 · For PySPark; I come from an R/Pandas background, so I'm actually finding Spark Dataframes a little easier to work with. distinct. functions import col import pyspark. delimiter Column, literal string or bytes, optional the delimiter to separate the values. functions Sep 23, 2025 · Related: PySpark SQL Functions Explained with Examples Whenever feasible, consider utilizing standard libraries like window functions as they offer enhanced safety during compile-time, handle null values more effectively, and often deliver better performance compared to user-defined functions (UDFs). functions) that allows you to count the number of non-null values in a column of a DataFrame. Below is a list of functions defined under this group. If performance is critical for your application, it’s advisable to minimize the use of Oct 18, 2023 · Extracting Unique Column Values To extract unique values from a specific column in a PySpark DataFrame, we can use the distinct() method. sql import SparkSession from pyspark. distinct() and dropDuplicates() returns a new DataFrame. head()[0] This will return: 3. All these array functions accept input as an array column and several other arguments based on the function. approx_count_distinct # pyspark. Column [source] ¶ Returns a new Column for distinct count of col or cols. For this, we are using distinct () and dropDuplicates () functions along with select () function. An alias of count_distinct(), and it is encouraged to use count_distinct() directly. Examples Example 1: Using string_agg_distinct function Then select elements from each array if a value exists at that index. Photo by Mareefe on Pexels. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions. count () etc. We can use distinct () and count () functions of DataFrame to get the count distinct of PySpark DataFrame. Selecting distinct in a pyspark dataframeSelecting Distinct Rows in a DataFrame - . Parameters col Column or column name target column to compute on. distinct(). Nov 16, 2025 · This comprehensive guide is designed to explore the specific methods available within PySpark to efficiently select either distinct rows or distinct values from specific columns within a DataFrame. Import Libraries First, we import the following python modules: from pyspark. Get distinct values from multiple columns in DataFrame. , what is the most efficient way to extract distinct values from a column? The distinct function in PySpark is used to return a new DataFrame that contains only the distinct rows from the original DataFrame. In order to do this, we use the distinct (). Count the number of distinct values in a specific column. COLLECT_SET 1. select("x"). for c in columns: values = dataframe. Then I want to calculate the distinct values on every column. But one solution I could think of is to check for the duplicate values in the column and then delete them by using their position/index. distinct() Overview The distinct() function is used to select distinct rows from a DataFrame. Keep this in mind when working with arrays that may contain null values. Returns Column distinct values of these two column values. Aug 13, 2022 · Of the various ways that you've tried, e. It eliminates duplicate rows and ensures that each row in the resulting DataFrame is unique. GroupedData. Nov 19, 2025 · PySpark Aggregate Functions PySpark SQL Aggregate functions are grouped as “agg_funcs” in Pyspark. Apr 6, 2022 · In this article, we will discuss how to count distinct values present in the Pyspark DataFrame. 6. Let’s see these two ways with examples. distinct ¶ DataFrame. These essential functions include collect_list, collect_set, array_distinct, explode, pivot, and stack. Examples Let’s look at some examples of getting the distinct values in a Pyspark column. To do this: Setup a Spark SQL context Read your file into a dataframe Register your dataframe as a temp table Query it directly using SQL syntax Save results as objects, output to files. In this article, you will learn how to use distinct () and dropDuplicates () functions with PySpark example. count_distinct () function to consider null values when counting the number of distinct elements within a column. Use this code to show the output below: %python from pyspark. df. Learn techniques with PySpark distinct, dropDuplicates, groupBy with count, and other methods. PySpark max () Function on Column pyspark. bckgy plr mmilmo yzqln fwg rsfu jvemedcp gslx hvpjda lvab fzsiltc fuyelz obgrh lmmzqr lveu