Using machine learning for intelligent shard sizing on the cloud

Narayanan Venkateswaran, Anurag Shekhar, Suvamoy Changder, Narayan C Debnath


Sharding implementations use conservative approximations for determining the number of cloud instances required and the size of the shards to be stored on each of them. Conservative approximations are often inaccurate and result in overloaded deployments, which need reactive refinement. Reactive refinement results in demand for additional resources from an already overloaded system and is counterproductive.

This paper proposes an algorithm that eliminates the need for conservative approximations and reduces the need for reactive refinement. A multiple linear regression based machine learning algorithm is used to predict the latency of requests for a given application deployed on a cloud machine. The predicted latency helps to decide accurately and with certainty if the capacity of the cloud machine will satisfy the service level agreement for effective operation of the application. Application of the proposed methods on a popular database schema on the cloud resulted in highly accurate predictions. The results of the deployment and the tests performed to establish the accuracy have been presented in detail and are shown to establish the authenticity of the claims.


Machine Learning; Sharding; Horizontal Partitioning; Cloud; Server Sizing; Deployment Planning; Resource Allocation; Data Sizing

Full Text:



Lionel C. Briand et al. “An Assessment and Comparison of Common Software Cost Estimation Modeling

Techniques”. In: Proceedings of the 21st International Conference on Software Engineering. ICSE ’99. Los

Angeles, California, USA: ACM, 1999, pp. 313–322.

Cassandra Architecture. Accessed: Jan, 2019.

Cloud At Cost. Accessed: October, 2018.

Carlo Curino et al. “Schism: A Workload-driven Approach to Database Replication and Partitioning”. In:

Proc. VLDB Endow. 3.1-2 (Sept. 2010), pp. 48–57.

Deniz Hastorun et al. “Dynamo: amazon’s highly available key-value store”. In: In Proc. SOSP. 2007, pp.


Chao-Wen Huang et al. “The improvement of auto-scaling mechanism for distributed database - A case

study for MongoDB”. In: Network Operations and Management Symposium (APNOMS), 2013 15th Asia

Pacific. Sept. 2013, pp. 1–3.

InnoDB Buffer Pool Size. Accessed:

January, 2019.

S. Jamil et al. “Impact of facebook intensity on academic grades of private university students”. In: 2013

th International Conference on Information and Communication Technologies. Dec. 2013, pp. 1–10.

David Karger et al. “Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving

Hot Spots on the World Wide Web”. In: In ACM Symposium on Theory of Computing. 1997, pp. 654–

Peter Kennedy. A Guide to Econometrics, 5th Edition. 5th ed. Vol. 1. The MIT Press, 2003.

P. Kookarinrat and Y. Temtanapat. “Analysis of Range-Based Key Properties for Sharded Cluster of

MongoDB”. In: Information Science and Security (ICISS), 2015 2nd International Conference on. Dec.

, pp. 1–4.

John J. Marciniak. Encyclopedia of Software Engineering. 2nd. New York, NY, USA: John Wiley &

Sons, Inc., 2002. isbn: 0471210072.

Floyd A. Miller. “Improving Heuristic Regression Analysis”. In: Proceedings of the 6th

Annual Southeastern Regional Meeting of the Associatio for Computing Machinery and National Meeting

of Biomedical Computing- Volume 1. ACM-SE 6. Chapel Hill, North Carolina: ACM, 1967, pp. 1–23.

MySQL Employee Sample Database.

Accessed: January, 2019.

Sam Newman. Building Microservices. O’Reilly Media, Inc., Feb. 2015.

Oracle MySQL Cloud Service. https: / / www. Accessed: 2018-06-22.

M. G. E. Peterson. “Multiple comparisons and the p-value in evaluation”. In: Proceedings 12th IEEE

Symposium on Computer-Based Medical Systems (Cat. No.99CB36365). 1999, pp. 260–263.

Man Qi et al. “Big Data Management in Digital Forensics”. In: Computational Science and Engineering

(CSE), 2014 IEEE 17th Internationa Conference on. Dec. 2014, pp. 238–243.

Riak Architecture.

Accessed: January, 2019.

R. Rivest. The MD5 Message-Digest Algorithm. RFC 1321. Apr. 1992.

T. Rögnvaldsson et al. “Estimating p-Values for Deviation Detection”. In: 2014 IEEE Eighth

International Conference on Self-Adaptive and Self Organizing Systems. Sept. 2014, pp. 100–109.

Rebecca Taft et al. “E-store: Fine-grained Elastic Partitioning for Distributed Transaction Processing

Systems”. In: Proc. VLDB Endow. 8.3 (Nov. 2014), pp. 245–256.2735514.

Rebecca Taft et al. “P-Store: An Elastic Database System with Predictive Provisioning”. In: Proceedings

of the 2018 International Conference on Management of Data. SIGMOD ’18. Houston, TX, USA: ACM,

, pp. 205–219. isbn: 978-1-4503-4703-7.

Hee Beng Kuan Tan, Yuan Zhao, and Hongyu Zhang. “Conceptual Data Model-based Software Size

Estimation for Information Systems”. In: ACM Trans. Softw. Eng. Methodol. 19.2 (Oct. 2009), 4:1–4:37

Typical cloud applications. https: //

Accessed: October, 2018.

Xiaolin Wang, Haopeng Chen, and Zhenhua Wang. “Research on Improvement of Dynamic Load

Balancing in MongoDB”. In: Dependable, Autonomic and Secure Computing (DASC), 2013 IEEE 11th

International Conference on. Dec. 2013, pp. 124–130.

Wikipedia page view statistics. Accessed: January,



  • There are currently no refbacks.

Copyright (c) 2019 Narayanan Venkateswaran, Anurag Shekhar, Suvamoy Changder, Narayan C Debnath

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

ISSN: 2303-4521

Digital Object Identifier DOI: 10.21533/pen

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License