Azure Databricks basic

Rating & reviews (0 reviews)
Study notes

Azure Databricks
  • Microsoft analytics service, part of the Microsoft Azure cloud platform (integration of Apache Spark's Databricks, natively integrated into Azure security and data services)
  • It runs on top of a proprietary data processing engine called Databricks Runtime
  • Offers a fast, easy, and collaborative Spark based analytics service
  • Key concepts:
    • workspaces
      It groups objects (like notebooks, libraries, experiments) into folders,
      Provides access to your data
      Provides access to the compute resources used (clusters, jobs).
    • clusters
      set of compute resources on which you run your code
      Before we can use a cluster, we have to choose one of the available runtimes:
      • Databricks Runtime
        includes Apache Spark, components and updates that optimize the usability, performance, and security for big data analytics.
      • Databricks Runtime for Machine Learning
        a variant that adds multiple machine learning libraries such as TensorFlow, Keras, and PyTorch.
      • Databricks Light
        for jobs that don’t need the advanced performance, reliability, or autoscaling of the Databricks Runtime.

Data in a Databricks workspace
To access our data:
  • Importour files to DBFS using the UI
    • Upload a local file and import the data.
    • Use data already existing under DBFS.
      Once the data is uploaded, it will be available as a table or as a mountpoint under the DBFS filesystem (/FileStore).
  • Mount and use supported data sources via DBFS
    • Mount external data sources, like Azure Storage, Azure Data Lake and more.
  • read data on cluster nodes using Spark APIs

DBFS mounted data
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters, it is an abstraction on top of scalable object storage.
  • Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
  • Allows you to interact with object storage using directory and file semantics instead of storage URLs.
  • Persists files to object storage, so you won’t lose data after you terminate a cluster.
The default storage location in DBFS is known as the DBFS root.

With DBFS you can access:
  • Local files (previously imported). For example, the tables you imported above are available under /FileStore
  • Remote files, objects kept in separate storages as if they were on the local file system

Notebooks in Databricks

Special here is that you can:
  • choose default language / cells (Python, Scala, R, and SQL). You can override the default language by specifying the language magic command %<language> at the beginning of a cell, supported magic commands are:
    • %python
    • %r
    • %scala
    • %sql
  • cluster where that cell will run
Dataframes inDatabricks
Spark uses 3 different APIs: Resilient Distributed Dataset(RDD), DataFrames, and DataSets. In Azure ML most common is Dataframe.
DataFrames are the distributed collections of data, organized into rows and columns. Each column in a DataFrame has a name and an associated type.

Load data in dataframe:
df = spark.sql("SELECT * FROM nyc_taxi_csv")
Other common statements (Dataframe API):
df = spark.read.format('json').load()
df.write.format('parquet').bucketBy(100, 'year', 'month').mode("overwrite").saveAsTable('table1'))
df.select('*')
df.select(COLUMNS)
...

Available statistics are:
  • Count
  • Mean
  • Stddev
  • Min
  • Max
  • Arbitrary approximate percentiles specified as a percentage (for example, 75%).
Correlations between specific columns
df.corr('COLUMN1', 'COLUMS2')


Visualize data
  • show()
    Spark bulit in
  • display()
    Azure Databricks
  • displayHTML()
    Azure Databricks

Resources:
Get started with Azure Databricks - Training | Microsoft Learn
Work with data in Azure Databricks - Training | Microsoft Learn