Friday, January 31, 2014

Refactoring Blob Store Access

My last post was a translation of Microsoft's examples of accessing the Windows Azure blob storage from Java to Clojure. The post consisted of a series of interop calls without any context.

I thought it would be interesting to see how they looked in an application. The project I am building is a file backup program.

In the Microsoft examples, each example began with creating a connection string and a reference to the blob store container. In my translation, I just created set up the reference to the container once at the top of the file.


I decided to hold the reference to the container in a closure. The container function takes a connection string and container name, uses the Azure sdk classes to build the reference to the container, create the container in Azure if it doesn't already exists, and returns a map of functions that can be executed on the container.
(defn container [conn-str container-name]
  (let [ctr
        (-> conn-str
            (.getContainerReference container-name))]
    (.createIfNotExist ctr)

    {:upload (fn [{:keys [file target]}]
               (let [blob-ref (.getBlockBlobReference ctr target)]
                 (with-open [r (FileInputStream. file)]
                   (.upload blob-ref r (.length file)))))

     :download (fn [{:keys [blob target]}]
                   (.mkdirs (.getParentFile (File. target)))
                   (with-open [w (FileOutputStream. target)]
                   (.download blob w))))

     :find-blob (fn [blobname]
                  (.getBlockBlobReference ctr blobname))

     :delete (fn [blob]
               (.delete blob))

     :remove-container (fn []
                         (.delete ctr))

     :blob-seq (fn []
                 (filter #(instance? CloudBlockBlob %)
                   (fn [f] (not (instance? CloudBlockBlob f)))
                   (fn [f] (.listBlobs f))

     :delete-container (fn []
                         (.delete ctr))

There is one other change I want to call attention to. Testing the code in the repl, I discovered that the FileOutputStream in the download function was keeping a connection to the file on the file system. I assume the FileInputStream in the upload function works the same way. To fix this, I used the with-open macro to cleanup the streams when I was done using them.

I created a new code file to hold this function. I wanted my core.clj file to make the decisions about what needed to be done, but to know nothing about how anything would be done.


My ns declaration looks like:
(ns filer.core
  (:require [filer.config :as config]
            [filer.blobstore :as store]
            [filer.filestore :as files]))

Jumping to the bottom of the file the main function causes one of three actions to be taken: The default is for file system folders specified in a configuration file to be backed up to the blob store; a specified blob container can be downloaded to a restore folder specified in the config file; or a blob container can be deleted.

(defn -main [& args]
   (= "delete" (first args))
     (delete-blobs (store/container config/conn-str (second args)))
   (= "restore" (first args))
    (restore-folder config/restore-folder (store/container config/conn-str (second args)))
     (doseq [p config/back-folders]
       (backup-folder p (make-container p)))))

The backup function operates on a collection of folders to backup. Each root folder is stored in a separate container in the blob store. To create a naming system for my backup containers, I added a function to my config.clj file that returns a container name to use based on the file system folder and the date.

The call to subs in the upload-settings function is to strip off the part of the file name that pertains to the root folder, which is already represented by the name of the container the file is being put into. Looking at it now, this definitely violates my goal of separating what to do from how to do it. I may want to move my whole upload settings function into the blobstore.clj file, but certainly the details of translating file system names to blob store names does not belong here.

(defn make-container [root-folder]
  (store/container config/conn-str
    (config/container-name root-folder)))

(defn upload-settings [f root-folder]
  {:file f
   :target (subs (.toString f) (inc (count root-folder)))})

(defn upload-file [file container root-folder]
  ((:upload container) (upload-settings file root-folder)))

(defn backup-folder [folder container]
  (doseq [f (files/all-files folder)]
    (upload-file f container folder)))

The restore function is similar to the backup function, in that both walk through a sequence of files on a source system, determine their name on the destination system and then call the appropriate function on the container.

The primary difference is that backing up files is done for a collection of root folders, which each get their own container, so I need a function to execute for each folder. The program is set up to only restore a single container specified as a command line argument. The -main function gets the single reference to the container, and passes it to the restore folder.

(defn get-destination [blob folder]
   (str folder "/" (.getName blob))))

(defn download-settings [blob folder]
  {:blob blob
   :target (get-destination blob folder)})

(defn restore-folder [folder container]
  (doseq [f ((:blob-seq container))]
    ((:download container) (download-settings f folder))))

The delete function is the simplest of all. Deleting a container also deletes all of the files in it. The delete container function could be called straight from -main, but for now it is its own function.

(defn delete-blobs [ctr]
  ((:delete-container ctr)))

Thoughts about this design

Creating the container in one place and then returning a map of functions that reference the container works pretty well. The one bit of awkwardness is that it means that all of the functions have to be invoked with double parentheses. The inner set is for the lookup on the map, the outer set invokes the returned function.

I can make the code look better by binding the function to a symbol in a let, and then using that symbol in the function call. For the restore-folder function it should also help performance some.

(defn restore-folder [folder container]
  (let [downfn (:download container)]
    (doseq [f ((:blob-seq container))]
      (downfn (download-settings f)))))

Now I look up the function only once, and then use the same function for every file I download. The cost of looking up a function compared to the cost of downloading a file is minimal, so I will think about it for a while, and keep the version I decide looks better.

Using a .clj file for my configuration file was a pretty obvious choice. Clojure is a superset of edn, so I could probably make use of tagged elements, but I just used def statements. The functions to provide standardized folder names seemed right at home here.

All of the calls in this program are synchronous calls. In many applications it makes sense to make calls out to the file system or the cloud asynchronous. For this application, however, I don't think it would add much. This is a program that is meant to be called from the command line, with no user interface to block. At one point, I did have an asynchronous version of my upload function but I didn't think it added much besides complexity.


Writing this post I found several errors in my code, and a couple of ways that I could have expressed things differently. Adding let bindings for the function lookups seems obvious now, but I hadn't yet thought of it an hour ago.

Thank you to anyone who reads this post. I hope you have gotten some benefit from seeing my thought process. I will be doing more posts like this in the future. And if you don't benefit from these posts, sorry about the noise. :)

No comments:

Post a Comment