BIDMach, Scaling Up Deep Learning on Clusters

Posted in Academic, Projects


  • Extended state-of-the-art machine learning library, BIDMach on clusters of cloud machines with GPUs
  • Improved BIDMach distributed communication system that allows clusters to run model-parallel algorithms

Promotional Video
Today, machine learning benefits many parts of our lives. However, in this big data age, there are some challenges for us…

(The video is on YouTube, might be unavailable in your region.)

Bill Clinton Said:

The challenge of the 21st century is to find out what works and scale it up.

Bill Gates Said:

A breakthrough in machine learning will be worth 10 Microsofts.

Disclaimer: BIDMach is an state-of-the-art machine learning framework lead by Prof. John Canny. Scaling Up Deep Learning on Clusters is my capstone project for Master of Engineering degree at UC Berkeley. I worked on it with Aleks Kamko and Jiaqi Xie. The article below is part of my degree thesis. It mainly discusses my technical contribution in this project. For a overlook of the whole BIDMach project, please visit


Capstone Members

Our capstone project is Scaling Up Deep Learning on Clusters, advised by Professor John F. Canny1, read by Assistant Professor Joseph E. Gonzalez2. The project is focused on a noval machine learning framework called BIDMach3. BIDMach is a component of BID Data Project4, by Berkeley Institute of Design.

Our capstone team has three MEng students, Quanlai “Qualia” Li5, Jiaqi “Max” Xie and Aleks Kamko. An undergraduate Kevin Peng and a software engineer Abdelrahman Elbashandy from Laurence Lab are also working for the project. Different backgrounds in Computer Science, Mathematics and Software Engineering help us better understand and solve the problems in this project.

BID Data Project

BID Data is an open-source6 project initiated by Prof. John Canny. It offers resources for fast big data tools. BID Data is equipped with interactive environment, thus easy to build and use machine learning. BID Data suite has three major elements[1][2]:

  • Underlying hardware (e.g. CPU, GPU7).
  • Software:
    • BIDMat, a matrix library8 that takes care of data management, calculation and hardware acceleration.
    • BIDMach, a machine learning framework that includes efficient algorithms[3].
  • Scaling Up:
    • Butterfly Mixing: an efficient inter-machine communication strategy.
    • Sparse AllReduce: a MapReduce9 like primitive for scalable communication[4].

One feature of BIDMach is its low cost of energy. OpenChai wants to provide a machine learning solution that runs on local machines using mobile processors, rather than on cloud. This solution will be more affordable, since mobile processors are cheaper. It will also be more secure for the enterprises and individuals who do not want to upload their data to cloud services controlled by big companies.

BIDMach before the Capstone

Before our capstone project started, BIDMach could run on one single machine. It supported several machine learning algorithms, like K-Means, Random Forest and General Linear Model[5]. We want to scale up these algorithms on a cluster of machines. BIDMach did not support Neural Network models completely at that time. For example, it did not support the Convolutional Layer, thus unable to process images. We need to complete different types of neural network by ourselves first, and then scale it up[6][7].

Objectives of the Capstone Project

We have two main objectives. The first is to scale up BIDMach on clusters. Spark10, a communication layer, is used for this objective[8]. And the second is to cooperate with our industry partner, OpenChai11 to provide machine learning solutions to enterprises and individuals (Li, 2016). The objectives can be broken down to several parts, as listed below:

  • Familiarize with system and tools (e.g. Spark framework, TX1 hardware, Scala12 language).
  • Extend model-parallel algorithms (e.g. KMeans, Random Forest) to Spark.
  • Extend data-parallel algorithms (e.g. General Linear Model) to Spark.
  • Bring Neural Network algorithms to BIDMach (e.g. Convolutional Neural Network, Sequence to Sequence Model, etc.)
  • Implement the Mechanism to Send Worker Progress to Master
  • Benchmark algorithms on OpenChai hardware, TX1.
  • Optimize parallel algorithms on TX1.
  • Help marketing OpenChai’s machine learning solution.

The purpose of this chapter is to demonstrate my technical contributions to this project, and the importance of the contributions. Specifically, I focused on objectives 1, part of 3, part of 4, and 5.

Elastic Average Stochastic Gradient Descent

Elastic Average Stochastic Gradient Descent is a mechanism that enables General Linear Model to run on a cluster of machines. Predictions and classifications are made much faster in this way.

Three members of our team worked together to accomplish the first three tasks. We got familiar with the system and tools, and then started extending machine learning algorithms on BIDMach. Aleks Kamko first focused on data-parallel algorithms[9]. Max worked on a function called ParCall, which is communication method for distributed machines, useful for model-parallel workers[10].

A distributed system (a.k.a. cluster) usually has one master machine and several worker machines. In our project, the master machine can assign tasks to worker machines. In model parallel models, there are worker-machine-wise communication.

In machine learning context, a matrix of weights is updated when the algorithm is taking more data. Most machine learning models want to find a representation of weights and find a direction to faster update the matrix.

Under General Linear Model, the master machine distributes a partition of data to every worker machine and asks them to update the matrix using data. The challenge is to construct a communication method for the worker machines. GLM is indispensable for BIDMach because they are widely applied and effective. BIDMach will not be a legitimate machine learning framework if GLM is not supported.

Elastic Stochastic Gradient Descent (ESGD) is an algorithm to update the matrix for worker machines[11]. There is a master machine who periodically collects matrixes from worker machines and calculates the average. After each pass the master machine broadcasts the matrix and worker machines update their matrix elastically based on average matrix. With elasticity, a worker machines update its matrixes using a weighted average of previous matrix from itself and received matrix from the master.

We scrutinized the research by Zhang and implemented the ESGD function on our models. ESGD deserves our attention since it finds a balance between keeping each worker machine’s independence and enabling them to communicate. According to our experience, models with ESGD have better prediction accuracy and lower running time.

Convolutional Neural Network


Apart from General Linear Model, there are other effective machine learning models that we also want to run on BIDMach. Deep Neural Network, or the Deep Learning model is the most notable one.

We focused more on one type of Neural Network, Convolutional Neural Network (CNN). CNN is able to process images, with Convolutional Layer taking pixels of an image as input. CNN also consists other layers like Pooling Layer, ReLU Layer and Fully Connected Layer. These layers process information in different ways. By combining these layers in different orders, we can orchestrate Convolutional Neural Network models for different image processing tasks[12][13].

Implementation of Convolutional Neural Network models is necessary, in that it can do high-performance image processing.


Most of our work on CNN is to implement the Convolutional layer. Implementing Convolutional Layer requires a lot of computation. We are using an underlying library, BIDMat, which is also designed by Prof. Canny. BIDMat provides us with efficient matrix operation APIs13.

I worked together with Jiaqi on this part. While doing this, we first read through the code of BIDMat and got a better comprehension of its underlying design, and found out useful APIs that could help us build BIDMach. Meanwhile, we referred to some code of other Neural Network layers. For example, Linear Layer was already implemented, and shared some similarity with Convolutional Layer. Later we utilized these APIs to write code for Convolutional Layers.


In order to test the code written for Convolutional Layer. I wrote a script14 to test it. With references to other scripts, I transplanted an image classification problem from TensorFlow15 to BIDMach. An image prediction dataset, rcv116 is used for this task. Below is the design of my neural network:

(design of neural network)

With this script, we can do more than just testing. On the one hand, this script serves as an instruction for new users to design their CNN for another specific task. It is easier to deploy another CNN by just changing some lines in this script.

On the other hand, since we have the same network design, configuration (a.k.a. hyper-parameters) and dataset, we can compare our performance with that of TensorFlow.

Benchmarking of Convolutional Neural Network

Once we have finished this Convolutional Neural Network, we can benchmark BIDMach on some datasets like Cifar10 or Rcv1. Cifar10 is a standardized dataset with 60,000 labeled images. We can train BIDMach and other systems (e.g. MXNet, TensorFlow) on this dataset and compare our prediction accuracy, time-consumption and energy-consumption.

Other Neural Networks to Implement on BIDMach

Besides Convolutional Neural Network, we implemented other types of Neural Networks, like Sequence-to-Sequence (Seq2Seq) Model. This model has same type of input and output. A typical application is human language translation, where both input and output are human languages. Our team member Aleks puts more effort in this part.

Worker Progress Sending

Users of BIDMach may also want to have a better control and understanding of how BIDMach is working. Sending worker progress to master can give user who controls master more information about the whole system.

We want to provide basic information about worker progress (e.g. number of iterations, training and validation accuracy) and performance criteria (e.g. current calculation speed and data throughput). Calculation speed is measured in GFLOPS17, data throughput (a.k.a. bandwidth) is measured in MB/S.

There are several components in BIDMach related to data throughput. Technically, java classes like Machine, Learner, Worker, Master are all related to throughput. I first needed to decide with class to put my measurement in. Machine is a class at lower level of this systems, and has more interaction with the underlying communication layers (e.g. Spark, or Direct Memory Access, DMI). Thus, I decided to put my measurement in this part.

Class Machine in BIDMach is running asynchronously, which means different Machine instances are running at the same time, and could influence each other. This design is faster for the whole system, but made it harder for me to evaluate the throughput, since the running time cannot be precisely calculated. By gauging the total number of nanoseconds spent during a socket data transmission, I got the overall time spent on one transmission. I could also get the number of bytes transmitted one time. Therefore, calculating the average bandwidth (a.k.a. throughput speed) is possible by doing the division.

Visualization of Worker Progress

We also want to visualize the throughput while the program is still running. This enables the end users to better understand what is going on among different machines. By analyzing the throughput information, the designer of the machine learning algorithm could also make adjustments to fully utilize the computational resources of the distributed system.


In this chapter, I introduced the tasks of our capstone project first. Then I discussed my accomplishments and solved problems. To be specific, I tried to implement elastic average gradient descent, convolutional neural network script, and built a mechanism to send the worker progress to master machine.

Meanwhile, I demonstrated the validity of my contributions. Convolutional Neural Network is an essential part of deep learning able to processing images. Elastic gradient descent enables General Linear Model to run on a cluster of machines, which speeds up classification and prediction. Sending worker progress to master can help users better analyze the status of the whole system. Users can further improve the design of the machine learning model according to the status.

To move one step forward, I would like to collaborate with Prof. Canny and other teammates, and construct an algorithm running on a master machine that sends timeout value to worker machines with regard to its progress. This will make the whole system more efficient.


Machine Learning and Deep Learning

Machine learning is a way to solve artificial intelligence problems. It usually has a mathematical or statistical model and takes a large amount of data as input. A machine learning problem generally has two phases, training phase and predicting phase. Training is to fit the model to data, in other words, to change the numbers (a.k.a. weights) in the model with regard to data, in order to better represent the real world. Predicting is to calculate the value given a weighted model and data. While predicting is trivial, training takes a longer time and needs more human work to optimize[14][15][16].

Distributed System and Scalability

A distributed system, or a cluster, is a group of computers working on one task. With large number of small machines, a distributed system could have a synergized computational power. Scalability is how the computational power increases with regard to the number of machines. Ideally the computational power could be proportional to the number of machines. However, oftentimes it is not the case. For some tasks (e.g. machine learning algorithms), good scalability is a research topic.


  6. Open-source projects are projects that share their code online, and are sometimes open for modification and redistribution
  7. Central Processing Unit and Graphics Processing Unit, GPUs are good at matrix computation and thus widely used in machine learning
  8. A software library is a reusable programming component
  9. MapReduce is a programming model that deals with big data operation
  13. Application Program Interface, used as standardized protocols and tools in programming
  14. A small piece of easy-modifiable code
  17. Number of billion floating point operations per second


  1. C. Volkmann, K. Tokarski, and K. Ernst, “Social entrepreneurship and social business”, An Introduction and Discussion with Case Studies. Gabler. Wiesbaden, 2012.
  2. J. Canny, “Interactive machine learning”, University of California, Berkeley, 2014.
  3. J. Canny, H. Zhao, B. Jaros, Y. Chen, and J. Mao, “Machine learning at the limit”, in Big Data (Big Data), 2015 IEEE International Conference on, IEEE, 2015, pp. 233-242.
  4. J. Canny and H. Zhao, “Bidmach: Large-scale learning with zero memory allocation”, in BigLearn Workshop, NIPS, 2013.
  5. H. Zhao and J. Canny, Kylix: A sparse allreduce for commodity clusters”, in Parallel Processing (ICPP), 2014 43rd International Conference on, IEEE, 2014, pp. 273-282.
  6. J. Jia, P. Kalipatnapu, R. Chiou, Y. Yang, and J. F. Canny, “Implementing a gpubased machine learning library on apache spark” 2016.
  7. R. Chiou, J. Jia, P. Kalipatnapu, Y. Yang, and J. F. Canny, “Building a distributed, gpu-based machine learning library”, 2016.
  8. A. Kamko, 2017.
  9. J. Xie, “Jiaqi xie’s final report” 2017.
  10. S. Zhang, A. E. Choromanska, and Y. LeCun, “Deep learning with elastic averaging sgd,” in Advances in Neural Information Processing Systems, 2015, pp. 685-693.
  11. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097-1105.
  12. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, vol. 521, no. 7553, pp. 436-444, 2015.
  13. V. N. Vapnik, “An overview of statistical learning theory”, IEEE transactions on neural networks, vol. 10, no. 5, pp. 988-999, 1999.
  14. J. Schmidhuber, “Deep learning in neural networks: An overview”, Neural networks, vol. 61, pp. 85-117, 2015.