[1] Bottou, L., Curtis, R., Keskin, M., Krizhevsky, A., Lalande, A., Liu, Y., ... & Yu, L. (2018). Large-scale machine learning: Concepts and techniques. Foundations and Trends® in Machine Learning, 10(1-2), 1-200.
[2] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[3] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2125-2159.
[4] Zhang, Y., Zhou, Y., & Liu, Y. (2015). Allreduce: High-performance synchronous and asynchronous collective communication for deep learning. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA '15).
[5] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[6] Li, H., Zhang, Y., Zhang, H., & Liu, Y. (2014). GraphLab: A System for Large-Scale Graph-Based Machine Learning. In Proceedings of the 18th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '14).
[7] Recht, B., & Hsu, D. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 27th International Conference on Machine Learning (ICML '10).