Always New Mistakes

July 20, 2009

Scalability issues for dummies

Filed under: Business, Technology — Tags: , , , , , , — Alex Barrera @ 2:34 pm

Every once in a while I get people asking me what’s taking me so long to open my startup Inkzee to the public. They also ask me what exactly have I been doing as the web seems exactly the same. I normally answer that things aren’t easy, that it takes time, specially if you are alone, like I am. After a while I end up explaining my problems with scalability and that’s the point where people just can’t follow you. I’m going to explain here what are scalability problems and how deep the repercussions are for a small company.

Most web applications, like Inkzee, Facebook, Twitter, … are made of 2 parts. What we, the tech nerds, call frontend and backend. The frontend is the part of the application that’s exposed to the users, that is, the user interface (UI), the emails, the information that is shown. All that UI is a mix of different programming codes, let it be PHP, javascript, html, etc. The frontend is in charge of drawing the UI on the user’s screen and to display all the information the user is expecting from the application. But this information has to come from somewhere, well, that’s the backend.

concentro-rackable-data-center

The backend are all the programs and software applications that run behind the scenes and that are in charge of generating, maintaining and delivering the information the frontend displays to the user. The backend can be very homogeneous or very heterogeneous, but it’s normally comprised of 2 parts, the database (where the information and data is stored) and the software that deals with that database, does the data crunching and connects this to the frontend.

Now, some web applications have a barebone backend, very simple and light weighted. Normally some software that gets what the user inputs on the interface and stores it in the database and viceversa, retrieves it from database and shows it to the user. Other web applications have an extremely complex backend (i.e Twitter, Facebook, …). These not only manage the data retrieval, but have to do really complex operations with the data. Not only complex, but very expensive operations in terms of computational power. For example, each time a user uploads a picture to Facebook follows this path:

  • The picture is stored in a specific hard drive. The backend has to determine which hard drive corresponds to that user (yes, there are multiple hard drives and each one is assigned to a bunch of users so the load is distributed).
  • Once stored, the picture is sent to a processing queue where it will be turned into a thumbnail by an image processing software. This process is expensive as it has to analyze the picture and reduce it to a smaller representation if the image but still maintaining part of its quality.
  • After processing it, the backend stores the newly created thumbnail in the database and stores, both the picture and the thumbnail in an intermediate “database” in memory for faster access (cache). This is because it’s faster to retrieve data from memory than from a hard drive.

This is an approximation of what a picture does when you upload it to a social network. I’m pretty sure it goes through a lot more processes though. So, supposing 1% of a social network’s users are uploading pics at any single moment, imagine uploading ~20 photos per user, 2.5 million users at the same time (Facebook has around 250 million users currently). Trust me when I tell you, that’s a lot of data crunching.

The problem

The best user interfaces (frontend) are designed so that all that complexity that goes behind the scenes is never showed to the end user. The problem is that the frontend depends gravely on the backend. If the backend is slow, the frontend won’t be able to have the info the user is requesting or expecting and it will seem SLOW to the end user. Not only slow, but in many cases inefficient or just not available to use at all (meet the Twitter Fail whale :P).

whale

So, now, what will cause the backend to be slow? Ohhhhhh don’t get me started!! There are so many reasons why the backend might be slow or broken! But, most of them are triggered by growth. That is, as the web application is being used by more and more users, the backend will start to fall apart. That’s what, in the tech world is known as scalability problems. That is, the backend can’t scale at the same speed the users pour into the application. The problem is that it’s not only a problem of more users, but having users that interact more heavily with the site. For example you might have 100,000 active users but never had experience big scalability problems. Suddenly you release a feature that allows your users to share pictures more easily… BAM!! Your backend goes down in 10 minutes. Why!! Why?!! you might scream while you watch your servers go down in flames. After all you have the same amount of users, so what happened? Well, most probably your backend system that handles picture sharing was designed and tested only with few users. Now it chokes with the big deal.

scal_image06

The REAL problem

Once you have scalability problems, the next logical step is to find where the bottleneck is and why is it happening. This, which might seem very easy, isn’t at all. It’s like looking for a needle in a haystack. Big backends are normally screamVERY complex with many parts coded in different programming languages by different persons. Not only that, but sometimes problems arise in different parts of the backend. So after a couple of really stressful hours you find the bottlenecks and think of a solution to fix them. Ahh my friend, then you realize it’s not as easy to fix as you thought. First of all, you have no clue if the fixes your team has come up with are good enough. Why? Because you’re stepping into unexplored territory. Few persons have had to tackle a similar problem and even less people have dealt with your data and systems. So even if you find someone else with the same problem, the solution might be slightly different depending on what systems you use for your backend or which architecture you have. This is the point where you realize that developers aren’t engineers, but craftsmen and that fixing these problems isn’t exactly a science but black voodoo magic.

So, here you are, with a bunch of possible fixes to a problem but with no clue if they will really work or it will just be a patch that will need extra fixes in 2 weeks. Normally you try to benchmark the solutions, but that’s not an easy task, specially because you have no real load to test it against except in your production servers and no, you don’t want to fuck the productions servers more than they are.

Finally, after some black magic and some simple testes you cross your fingers and try the fix on the production servers. After several hours of monitoring the backend for new “leaks”, you scream of happiness as the patch seems to work. Then you start to realize that the patch won’t hold on forever and that you need some extreme solution to the problem.

You sit down with your tech team (our on your own as it’s my case 😦 ) and you start drafting a new solution. Suddenly you realize that the best fix implies changing the way your backend works. And by change I mean, you need to redevelop a big chunk of your backend to fix the problem. This implies a couple of things, you’ll need to invest a lot of time and resources, you’ll loose the stability your backend had (prior to the incident), you’ll walk into a new unexplored territory for your team and worst of all, you can’t just unplug your production servers and change the backend, you need to do it so both backends coexist for a while until you switch all of your servers from using the old one to the new one.

Now, the REAL problem is that this change, this new redesign grinds the whole company to a halt. All msntv-tech-teamresources, let it be people or money are invested in redesigning efforts so nothing new can be done. Most outsiders just don’t understand the depth of this change and will bash the company for not doing new things, for not releasing new features, for not fixing old bugs, etc. Not only that, investors will start to get anxious and will demand things to start moving. So, the outside world only sees that you’ve stalled, while the inside teams are suffering the pressure. Not only that, developers inside the company will get extremely frustrated by the pace of things. They won’t be able to add new features and even when fixing bugs they’ll need to fix them twice, one in the old backend, one in the new backend.

So, in the end, you realize the shit hit the fan and you got all of it. It’s hard, very hard to be there. If you haven’t experienced it you have no idea how hard it is. Not only as a developer but as a founder, CEO, or executive position you’ll feel the pain. You won’t be able to publicize your site cause more stress might accelerate the old backend problems, you can’t give users new features because you have no resources, you will try to explain the problem to investors but they won’t understand a clue of what you’re talking about… “backend what?”. Current customers will be pissed at you because the site is running slow and you are doing nothing to fix it. So, in the end, everything freezes until the new backend is in place.

How long does this takes? Depends. Depends on the size of the redesign, the size of the tech team, the skills of the team and specially, the skills of the management. During this phase, management must execute impeccably. Sadly, this is not the case in most places and so priorities are changed, mistakes are made and the redesign gets delayed over and over again.

It takes a very good leadership to make it through this period. Someone that knows where their priorities lie and that is able to foresee the future and the importance of the task ahead. Needless to say that such figure is lacking in most companies. That’s the reason it took so long for Twitter to pull their act together, to speed up Facebook, etc.

I am there, I am suffering the redesign phase (twice now). It’s hard, it’s lonely, it’s discouraging and frustrating, but it needs to be done. I just wrote this post so that outsiders can get a glimpse of what is it to be there and how it affects the whole company, not just the tech department. Scalability problems aren’t something you can discard as being ONLY technical, it’s roots might be technical but its effects will shake the whole company.

Let there be light 🙂

42 Comments »

  1. That’s why you should have repeatable performance tests.

    You should be able to have vmware simulating multiple machines and run the performance tests on them — it won’t be a perfect simulation, but it should be good enough to indicate if your solution is scalable.

    Also, using automated configuration management tools (i.e. chef, puppet, cfengine, etc) will let you test out system changes on virtual machines. Then when you are happy with them, you can push the changes out to the production systems.

    Comment by Joe Van Dyk — July 20, 2009 @ 8:18 pm

  2. Hi Joe!

    The vmware solution is pretty neat, I’ve used it in the past, but currently I’m working just with a bunch of development servers. The problem is that it’s very hard to reconstruct the load the production servers have sometime. You can reproduce concurrent users with synthetic load but sometimes it’s more than that, it’s the db performance, the state of memcache servers, the state of the indexer, etc.

    So, even with a vmware server with multiple images, it’s hard to get the exact same conditions 😦

    About automated config tools you’re very right. I’ve played with puppet some time ago and I’ll probably end up using it. That or chef, haven’t decided yet.

    Some weeks ago I discovered Fabric (python) which, although it’s not like puppet, it’s rather simpler for small deploys. Check out these python tools if you use/like python:
    http://clemesha.org/blog/2009/jul/05/modern-python-hacker-tools-virtualenv-fabric-pip/

    Thanks for the comment Joe! 🙂

    Comment by Alex Barrera — July 20, 2009 @ 8:30 pm

  3. You should be able to capture your production traffic via webserver logs and then play them back on your development machines.

    I use vmware because I’m cheap.

    Comment by Joe Van Dyk — July 20, 2009 @ 8:49 pm

  4. Isn’t scalability problem a good problem to have?
    If you have lots of users — you probably have quite significant revenue as well.
    I’m looking forward scalability problems for my web site PostJobFree.com

    Comment by Dennis Gorelik — July 20, 2009 @ 9:00 pm

  5. thanks for the post Alex! It really shows the importance of building to scale initially and knowing that there will be problems down the road regardless. I am currently in the process of building a site that mimics ebay in regard to features, but not quite as robust. Being the lone developer, and a newbie at that, I am trying to build it right the first time.

    Do you have any advice for a newbie programmer concerned about scaling?

    thanks again, very good post!

    Comment by mike — July 20, 2009 @ 9:03 pm

  6. @joe webserver logs only gets you web hits. My backend (and the social network where I do consulting work) does a lot more things that just web hits. With Inkzee for example, I need to keep all feeds up to date, it’s not easy to replicate the memcache state, the data stored in the database. In my case I’m still able to do that as my db isn’t huge, but with Terabytes of db as we have on the social network, you simply can’t do that, it’s not cost effective. But hey, I do get your point, I’m also very cheap hehe the wonders of bootstrapping 😀

    Comment by Alex Barrera — July 20, 2009 @ 9:45 pm

  7. @dennis yeah indeed, theoretically it’s good. But don’t be mistaken, more users doesn’t always implies scalability problems. Few users but huge interaction of these users with the site can also bring on some problems.

    That’s exactly what I’m experiencing right now. I don’t have too many alpha users, but the amount of information I’m already processing is starting to be more than respectable. The same probably happens to startups that deal with Twitter streams. It’s not the same 1000 users with 3 tweets per day than 10 users with 100 tweets per day and thousands of followers 😉

    So, don’t think that scalability problems are exclusively because of more users (which to some extend is true but not enough to be happy about it 😉

    Comment by Alex Barrera — July 20, 2009 @ 9:49 pm

  8. […] further reading; an interesting article about the importance of doing it right This entry was posted in web development and tagged […]

    Pingback by not scaling equals failing | mike sudyk — July 20, 2009 @ 9:55 pm

  9. @mike thanks a lot! My pleasure 😀

    My advice is to always think about what will happen with the parts your developing if you get 1M users hitting it.

    As what I experienced, it’s hard, if you don’t have experience to foresee the problems and you’ll have to refactor your code many times.

    So, 6 tips:
    – Learn that you’ll be refactoring your code in the future so don’t stress too much about specifics of the development, you’ll probably change it in a month
    – Develop things in a clean way. Isolate everything as much as you can. That way when you have to change something you just have to touch a file and not 100 scattered files.
    – As I said before, always code with the scalability glasses on. Always try to think what will happen if that loop you’re coding is executed by 1M users. Will it be a resource hog? Will it just break due to memory consumption? Will it be fast enough? Is this or that query slow now? What will happen with 1M users? Will it bring the database to a halt?
    – As someone in HN commented, be sure to speed the front-end first and discard any slowness problems there first. Sometimes it’s just a matter of slow javascript or too many .js or .css files being downloaded, etc.
    – If something is slow, be sure you know WHY it’s slow. If you think it should be that slow, research alternative methods and always open your mind to changes.
    – Beware of too much changes, you sometimes have to stick with something, even though you just found something cooler. Tech goes VERY fast, your code stability can’t match that speed so just learn to live with it and to change your code only when it really makes sense

    Hope they help 🙂

    Thanks to you for your comment!

    Comment by Alex Barrera — July 20, 2009 @ 10:02 pm

  10. Alex — what company (web site) do you run, how many visitors does it have and how many page vies?

    Comment by Dennis Gorelik — July 20, 2009 @ 10:37 pm

  11. Hi Alex,
    feel your pain.

    agree with nearly everything you say apart from your comment “Always try to think what will happen if that loop you’re coding is executed by 1M users.”

    I agree that one needs to always have the scalability glasses on when developing web apps however when developing new web apps trying to write big scalability may not be always wise. If you are prototyping to test the maket initially for example. Sometimes better to get some thing working taking the sensible approach and refactor later when needed but this is easier because scale glasses were on when doing the original work.

    Comment by JA — July 21, 2009 @ 1:11 am

  12. Hadn’t heard of Inkzee before but it sounds like a nice idea. Hope it works.

    Comment by sashang — July 21, 2009 @ 10:56 am

  13. […] Alex Barrera has a very interesting post about how frustrating it is to figure out that you have a problem and how much trouble it is to fix it after the product is live. I am there, I am suffering the redesign phase (twice now). It’s hard, it’s lonely, it’s discouraging and frustrating, but it needs to be done. I just wrote this post so that outsiders can get a glimpse of what is it to be there and how it affects the whole company, not just the tech department. Scalability problems aren’t something you can discard as being ONLY technical, it’s roots might be technical but its effects will shake the whole company. […]

    Pingback by Scalability for dummies « Scalable web architectures — July 21, 2009 @ 3:23 pm

  14. @dennis I work as a consultant at Tuenti.com Can’t give u the exact numbers as they don’t allow us but let’s say that several million users and a couple of billion page views per month 😉

    @JA yeah I fully agree with you. I guess we could add a corollary, balance the first and third tips, think with the scalability glasses but don’t always try to crack the problems under those conditions, know when it’s going to be slow and it doesn’t matter cause you’ll change it very soon and when you must change it. (I guess this sounds very ethereal hehe)

    @sashang Thanks! It works, although it’s hard to keep up with users expectations. Most people think I’m in a place I haven’t reached yet, but I will eventually. It’s a tough problem as development takes a lot of time. Thanks for the comment! 😀

    Comment by Alex Barrera — July 21, 2009 @ 11:28 pm

    • Alex, what country most of Tuenti.com visitors are coming from?

      Quantcast shows spike up to ~20K US visitors per month in January 2009:
      http://www.quantcast.com/Tuenti.com

      However this Quantcast graph illustrate your point of how minor growth in number of users can go together with huge spike in number of page views.

      Comment by Dennis Gorelik — July 21, 2009 @ 11:51 pm

      • Tuenti is basically based on Spanish traffic. It’s all in Spanish so I’m surprise Quantcast has some numbers hehe

        Yeah, I see it at Tuenti and I see it too with my own startup Inkzee.com. In the former case, I scale with number of feeds in the system, not that much with users (which I do too but not as fast).

        Comment by Alex Barrera — July 22, 2009 @ 12:05 am

  15. Alex,

    Does Inkzee.com compete with Google Reader?

    Comment by Dennis Gorelik — July 22, 2009 @ 12:11 am

    • I guess so, although it’s closer to techmeme.com You could describe it as an hybrid between both 🙂 Although I’m not even close to be a menace to any of them hehe

      It works? Yeah sure it does. Is it perfect? Not yet. Will it grow to be big enough and have cool features? Let’s hope so. First roadblock, damn scaling problems, way to much info to process but hopefully I’ll have it fixed pretty soon so I’ll keep working on the features and not on the damn backend.

      Btw, check the blog if you want to read more tech stuff from Inkzee: blog.inkzee.com

      Comment by Alex Barrera — July 22, 2009 @ 12:15 am

  16. Alex,

    Are you running Inkzee.com alone or you have a co-founder(s)?

    Where are you geographically located?

    Comment by Dennis Gorelik — July 22, 2009 @ 12:20 am

    • All on my own, not by pleasure though. It’s hard to find a good cofounder here in Spain… I’m located at Madrid, Spain, although my future plans will be to move to the Bay Area (where I go back and forth every year hehe) but still a while for that. My biggest problem is the visa issues. I need an investor visa to move there and it’s expensive, like ~$100k expensive.

      Comment by Alex Barrera — July 22, 2009 @ 12:25 am

  17. Your co-founder doesn’t have to be geographically in the same place as you are. Internet connects everybody, even co-founders.

    Do you work full-time or part-time on inkzee?

    Comment by Dennis Gorelik — July 22, 2009 @ 12:31 am

    • I know, but co-founders aren’t freelance developers. I still like to meetup on a Starbucks and brainstorm stuff you know? Although you’re right, it’s not the first time I work like that, but still… I like to meet with people, at least once before joining forces. I had a bunch of guys I talked to in April when I was around the Bay Area, but as usual it always stays in that, talks 😛

      I’m working “full-time”. As I do consulting work for this other company, some days I can’t work full-time on it, but lets say that most days I’m full-time with it.

      Comment by Alex Barrera — July 22, 2009 @ 8:28 am

  18. […] Scalability issues for dummies Every once in a while I get people asking me what’s taking me so long to open my startup Inkzee to the public. […] […]

    Pingback by Top Posts « WordPress.com — July 22, 2009 @ 1:26 am

  19. Very nice article. I even wonder if it possible to translate it. It would help explain my Polish friends and family what I’m generally doing.

    In booking.com we are using puppet with some addition. If anybody interesting some info on check presentation made by one of my workmate:
    http://blog.koehntopp.de/index.php?url=archives/2440-Kickstart-Puppet.html

    And another good idea is to make a test of new functionality on small subset of users.

    Comment by Wawrzek — July 22, 2009 @ 1:29 pm

  20. I like the points you have made – especially the ones you have brought out about modifying the backend and support from management during these times. Management buy in into these improvements is a must, alongwith someone who can see the big picture beyond the near future chaos.

    Comment by Hrishikesh — July 22, 2009 @ 5:13 pm

  21. […] Alex Barrera wrote a nice little article about why “scalability issues” can prevent any visible progress on a web project for months at a time: Scalability Issues for Dummies. […]

    Pingback by Scalability for dummies like me - Digital Digressions by Stuart Sierra — July 23, 2009 @ 8:07 pm

  22. @Wawrzek thanks! Feel free to translate it 🙂 I wished I could help you but my polish is a little bit rusty 😛

    @Hrishikesh yeap, I guess it has to be someone that has experienced this before and that’s hard

    Comment by Alex Barrera — July 23, 2009 @ 8:41 pm

    • The translation is nearly ready, my wife is proofreading it. I wonder if is possible to get the account on Inkzee – it’s a bit strange write about website you don use ;).

      Comment by Wawrzek — August 26, 2009 @ 12:06 am

      • Awesome man! About the account, I’m currently finishing a new deploy. The thing is that the current beta is having a hard time, I can’t give any more accounts right now there cause it just won’t work for you, not that I don’t want to :((

        So I’m moving all the architecture to AWS and I’m migrating all the database from MySQL to a schemaless db (Tokyo Cabinet). I already finished the migration and I’m currently stabilizing the system in AWS so hopefully I’ll resume the invites process next week or so 🙂 But don’t worry, I’ll give you one asap.

        Again, thanks a ton for the translation 😀 Let me know the link so I can send it to my Polish friends 🙂

        Comment by Alex Barrera — August 26, 2009 @ 9:22 am

  23. […] Scalability issues for dummies […]

    Pingback by Sandbox » Blog Archive » Weekly Inspiration #3: our — July 27, 2009 @ 2:47 pm

  24. […] Scalability issues for dummies – “Every once in a while I get people asking me what’s taking me so long to open my startup Inkzee to the public. They also ask me what exactly have I been doing as the web seems exactly the same. I normally answer that things aren’t easy, that it takes time, specially if you are alone, like I am. After a while I end up explaining my problems with scalability and that’s the point where people just can’t follow you. I’m going to explain here what are scalability problems and how deep the repercussions are for a small company.” […]

    Pingback by Software Quality Digest - 2009-07-27 | No bug left behind — July 27, 2009 @ 6:10 pm

  25. So you read the story now watch the movie:

    And something leadership – think about this keep of …

    Comment by Wawrzek — August 27, 2009 @ 9:10 pm

    • Hahaha awesome clips! I had seen the one of the airplane long ago, but it’s just an awesome ad!

      I really feel like doing the cat herding haha

      Thanks for the clips!

      Comment by Alex Barrera — August 28, 2009 @ 12:43 pm

  26. So I’ve finished the translation.

    I made some changes in the style – it’s a bit less informal. I got some discussion about it with my wife, but she is has the Master of Polish Philology degree, so I decided she is right about difference in writing and speaking language.

    http://jogger.wawrzek.name/2009/09/10/skalowalnosc-nie-tylko-dla-orlow/#more

    Comment by Wawrzek — September 10, 2009 @ 10:48 pm

    • Hey Wawrzek, great job! Thanks a lot for the translation! No worries about the change in style, whatever fits your readers 🙂

      I just wished I could understand the comments 😛
      Thanks again for the translation effort 🙂

      Comment by Alex Barrera — September 12, 2009 @ 1:03 am

  27. I saw at Google Analysis that post has been read over 200 times until now and seems to be the most popular entry on my blog since restart at June. The comments are generally positive, one person complained that the article might be a more deeper, but other pointed that article is not professional reader. There was also some complaints about quality translation 🙂

    Comment by Wawrzek — September 13, 2009 @ 8:27 pm

  28. Nice problem to have 🙂 (nice post Alex!)

    Comment by Gregor — December 29, 2009 @ 12:34 pm

  29. […] Escalabilidad. Ten en cuenta el gran número de usuarios al que te estás abriendo potencialmente. No es fácil conseguir 100.000 usuarios la primera semana, pero ¿y si los consiguies? ¿Estás preparado para afrontar la demanda? Te arriegas a morir de éxito. (Si este punto te suena a Chino pregunta a ún técnico o léete Scalability issues for dummies). […]

    Pingback by 6 cosas a tener en cuenta al hacer una aplicación para facebook | Distrito Nube — December 10, 2010 @ 2:42 pm

  30. I normally answer that things are not easy, that it takes time, After a while 🙂

    Comment by Akilesh Singh — April 20, 2012 @ 6:53 am

  31. This is one of the best articles I have seen explaining issues around scalability in terms an entire organisation can understand. Thank you.

    Comment by mickpick — March 24, 2014 @ 11:41 pm

  32. Today, I went to the beachfront with my children. I found a sea shell and gave it to my 4 year
    old daughter and said “You can hear the ocean if you put this to your ear.” She
    placed the shell to her ear and screamed. There was a hermit crab inside and it pinched her ear.
    She never wants to go back! LoL I know this is totally off topic but I had to tell someone!

    Comment by fat loss secrets get ripped — August 28, 2014 @ 12:46 pm

  33. pg slot ทางเข้า

    blog topic

    Trackback by pg slot ทางเข้า — November 4, 2021 @ 7:15 am


RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.