PySpark Interview questions and answers
Hello everyone, today we are here to provide PySpark Interview Questions and Answers Tips for all our web portal viewers. Read the complete article and get success in interview round by following the job interview questions.
PySpark Interview Questions and Answers for Freshers
- What is PySpark?
The Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing is called PySpark.
If you are already familiar with python and libraries like panda then PySpark is a good language to learn and to create more efficient analysis and pipelines.
- What are the characteristics of PySpark?
The characteristics of PySpark are:
- It shows the low latency
- Its compatibility makes it a preferable framework for processing data.
- It provides powerful caching
- What is YARN?
Yarn is a cluster-level operating system. It is a generic resource management framework for distributed work loads. A lot lot of varied compute frameworks are supported by YARN.
- Difference between PySpark and map reduce
The difference between PySpark and map reduce is that PySpark retains and processes data for subsequent steps whereas map reduce processes data on disk. For smaller work loads, PySpark’s data processing speed is 100 times faster than Map reduce.
- Compare Hadoop and PySpark
The comparison between Hadoop and spark is given below:
- Hadoop is designed to handle batch processing efficiently whereas PySpark is designed to handle real time data efficiently.
- Hadoop has a high latency computing framework while PySpark is a low latency computing framework.
- Hadoop does not have interactive mode whereas PySpark works interactively.
So PySpark would be better than Hadoop in most of the cases.
- What are the languages supported by PySpark and which one is more famous among these?
The languages supported by PySpark are Scala, Python , Java and R. Amongst these Scala and Python have interactive shells for PySpark whereas Python is the most famous language and is mostly used in PySpark.
- Explain the concept of inheritance in Python
The concept of inheritance is that it refers to the concept of one class in the properties of another class. It helps us to reuse code and establish a relationship between different classes. Mainly two types of classes are present:
It’s a class whose properties are inherited.
A class which inherits the properties of the parent class.
- What are the various algorithms supported in PySpark?
The algorithm supported in PySpark are:
- Explain the purpose of serialisations in PySpark
PySpark supports a custom serializer to transfer data for improving performance. There are two types of serializer that PySpark uses. These are:
It supports only fewer data types. It is faster than pickle serializer.
It is used for serialising objects by default. It supports python object but at very low speed.
- What is PySpark storage level?
It can control the storage of RDD. It can also tell us how to manage or store RDD in memory or over the disk or sometimes both. It can even control the replicate or serialized RDD partitions.
- What is PySpark spark context?
It is treated as an initial point for entering and using any spark functionality. The spark context uses a library to launch the JVM, and then create the Java spark context. The spark context is available as ‘sc’ by default.
- What are PySpark spark files?
A device that is used to load our files on the Apache spark application is called PySpark spark files. It is one of the functions that are used under spark context and can be used to load the files on Apache Spark. Spark files can be used to get the path using spark file.
- Explain spark execution engine
Apache spark is a graph execution engine which enables users to analyse massive data sets with high performance. To do this, spark first needs to be held in memory to improve performance drastically.
- What do you mean by spark conf in PySpark?
It helps in setting a few parameters and configurations to run a spark application on a local cluster. In other words, you can say that it provides configurations to run a spark application.
- Name a few attributes of Spark conf.
Few attributes of Spark conf are given below:
Set (key, value)
This attribute is used in setting the configuration properly.
Set spark home (value)
This attribute helps in setting spark installation paths on worker nodes.
Set app name (value)
This attribute assists in setting the application name.
Set master ( value)
This attribute assists in setting the master URL.
This attribute supports you in getting the configuration value of the key.
- What is RDD in PySpark?
In PySpark RDD stands for Resilient Distributed Datasets. It is known as the core data structure of PySpark. It is a low level object which is highly efficient in performing distributed tasks. These are immutable elements. If you create an RDD, you can not change it.
- What are the advantages of PySpark?
The advantages of PySpark are:
- PySpark is an easy language to learn. It can be learned and implemented easily if you know python and Apache spark. PySpark is easy to use. It provides parallelized codes which are easy to write.
- Error handling is simple in PySpark Framework. You can easily handle errors and much more.
- What are the disadvantages of PySpark?
The disadvantages of PySpark are:
Sometimes it becomes difficult to handle the problems as PySpark is based on Hadoop’s Map Reduce model.
As Apache Spark was originally written in Scala while using PySpark in Python programs, they are not as efficient as other programming models.
As the nodes are abstracted in PySpark , and it uses the abstracted network, it cannot be used to modify the internal function of the Spark. Scala is preferred in this case.
- What are the partitions immutable in PySpark?
Every transformation generates a new partition in PySpark. These partitions are aware of data locality. Partition immutables are made using HDFS API
- What do you understand about data cleaning?
The process of preparing data by analysing the data and removing or modifying data if it is incorrect, incomplete, irrelevant, duplicated, or improperly formatted is called data cleaning.
- What is Spark core?
A general execution engine for the Spark platform, including all the functionalities is known as Spark core. It offers in-memory computing capabilities in order to deliver a good speed.
- What are the key functions of Spark core?
The key functions of Spark core:
- Perform all the basic I/O functions
- Monitoring jobs
- Job scheduling
- Memory management
- Interaction with storage systems
- What is PySpark array type?
A collection data type that extends the PySpark’s DataType class, which is the superclass for all kinds, is called PySpark array type. The array type accepts two arguments.
- Value type
- Value contains null.
- What do you mean by PySpark data frames?
The distributed collection of well-organised data is called PySpark data frames. They are the same as relational databases tables and are placed into named columns. The most important advantage of PySpark DataFrame is that the data in the PySpark DataFrame is distributed across different machines in the cluster, and the operations performed on this would be run parallel on all the machines. This helps in handling a large collection of structured or semi-structured data of a range of petabytes.
- What are the advantages of a parquet file in PySpark?
The advantages of parquet file in PySpark are:
It consumes less space and is small
It helps us to fetch specific columns for access
It precedes type-specific encoding
It gives better summarised data