How to Download Blobs from Azure Storage Using Python

Azure blob storage offers a cheap and reliable solution for storing large amounts of unstructured data(such as images). Blob storage has no hierarchical structure, but you can emulate folders using blob names with slashes(/) in it. In this article, I will explore how we can use the Azure Python SDK to bulk download blob files from an Azure storage account.

When it comes to Python SDK for Azure storage services, there are two options,

The following code samples will be using the latest Azure Python SDK(v12).

Pre-requisites for Sample Python Programs

The samples below requires python 3.6 or above. On windows, you can download it from the official python website. On a mac machine, use the Homebrew to install python 3,

brew install python3

Next you will need the azure python sdk for blob storage access. Use pip to install the azure python sdk,

pip3 install azure-storage-blob --user

Now you are all set to run the following python programs.

How to Bulk Download Files from Azure Blob Storage Using Python

The following python program uses Azure python SDK for storage to download all blobs in a storage container to a specified local folder. The program will create local folders for blobs which use virtual folder names(name containing slashes).

Before running the program ensure you give proper values for MY_CONNECTION_STRING, MY_BLOB_CONTAINER and LOCAL_BLOB_PATH. The connection string can be obtained from Azure portal and it contains the account url and access key inside it. Note that the program may take a while if your storage account contains a large number of blob files. See the next program below to see how this can be speeded up using python's ThreadPool class.

# download_blobs.py
# Python program to bulk download blob files from azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above
import os
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS"

# Replace with blob container
MY_BLOB_CONTAINER = "myimages"

# Replace with the local folder where you want files to be downloaded
LOCAL_BLOB_PATH = "REPLACE_THIS"

class AzureBlobFileDownloader:
  def __init__(self):
    print("Intializing AzureBlobFileDownloader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
    self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)


  def save_blob(self,file_name,file_content):
    # Get full path to the file
    download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)

    # for nested blobs, create local path as well!
    os.makedirs(os.path.dirname(download_file_path), exist_ok=True)

    with open(download_file_path, "wb") as file:
      file.write(file_content)

  def download_all_blobs_in_container(self):
    my_blobs = self.my_container.list_blobs()
    for blob in my_blobs:
      print(blob.name)
      bytes = self.my_container.get_blob_client(blob).download_blob().readall()
      self.save_blob(blob.name, bytes)

# Initialize class and upload files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()

Fast/Parallel File Downloads from Azure Blob Storage Using Python

The following program uses ThreadPool class in Python to download files in parallel from Azure storage. This substantially speeds up your download if you have good bandwidth. The program currently uses 10 threads, but you can increase it if you want faster downloads. Don't forget to change MY_CONNECTION_STRING, LOCAL_BLOB_PATH and MY_BLOB_CONTAINER variables.

# download_blobs_parallel.py
# Python program to bulk download blobs from azure storage
# Uses latest python SDK() for Azure blob storage
# Requires python 3.6 or above

import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient, BlobClient
from azure.storage.blob import ContentSettings, ContainerClient

# IMPORTANT: Replace connection string with your storage account connection string
# Usually starts with DefaultEndpointsProtocol=https;...
MY_CONNECTION_STRING = "REPLACE_THIS_CONNECTION"

# Replace with blob container name
MY_BLOB_CONTAINER = "myimages"

# Replace with the local folder where you want downloaded files to be stored
LOCAL_BLOB_PATH = "REPLACE_THIS_PATH"

class AzureBlobFileDownloader:
  def __init__(self):
    print("Intializing AzureBlobFileDownloader")

    # Initialize the connection to Azure storage account
    self.blob_service_client =  BlobServiceClient.from_connection_string(MY_CONNECTION_STRING)
    self.my_container = self.blob_service_client.get_container_client(MY_BLOB_CONTAINER)

  def download_all_blobs_in_container(self):
    # get a list of blobs
    my_blobs = self.my_container.list_blobs()
    result = self.run(my_blobs)
    print(result)

  def run(self,blobs):
    # Download 10 files at a time!
    with ThreadPool(processes=int(10)) as pool:
     return pool.map(self.save_blob_locally, blobs)

  def save_blob_locally(self,blob):
    file_name = blob.name
    print(file_name)
    bytes = self.my_container.get_blob_client(blob).download_blob().readall()

    # Get full path to the file
    download_file_path = os.path.join(LOCAL_BLOB_PATH, file_name)
    # for nested blobs, create local path as well!
    os.makedirs(os.path.dirname(download_file_path), exist_ok=True)

    with open(download_file_path, "wb") as file:
      file.write(bytes)
    return file_name

# Initialize class and upload files
azure_blob_file_downloader = AzureBlobFileDownloader()
azure_blob_file_downloader.download_all_blobs_in_container()