Always New Mistakes

July 20, 2009

Scalability issues for dummies

Filed under: Business, Technology — Tags: , , , , , , — Alex Barrera @ 2:34 pm

Every once in a while I get people asking me what’s taking me so long to open my startup Inkzee to the public. They also ask me what exactly have I been doing as the web seems exactly the same. I normally answer that things aren’t easy, that it takes time, specially if you are alone, like I am. After a while I end up explaining my problems with scalability and that’s the point where people just can’t follow you. I’m going to explain here what are scalability problems and how deep the repercussions are for a small company.

Most web applications, like Inkzee, Facebook, Twitter, … are made of 2 parts. What we, the tech nerds, call frontend and backend. The frontend is the part of the application that’s exposed to the users, that is, the user interface (UI), the emails, the information that is shown. All that UI is a mix of different programming codes, let it be PHP, javascript, html, etc. The frontend is in charge of drawing the UI on the user’s screen and to display all the information the user is expecting from the application. But this information has to come from somewhere, well, that’s the backend.

concentro-rackable-data-center

The backend are all the programs and software applications that run behind the scenes and that are in charge of generating, maintaining and delivering the information the frontend displays to the user. The backend can be very homogeneous or very heterogeneous, but it’s normally comprised of 2 parts, the database (where the information and data is stored) and the software that deals with that database, does the data crunching and connects this to the frontend.

Now, some web applications have a barebone backend, very simple and light weighted. Normally some software that gets what the user inputs on the interface and stores it in the database and viceversa, retrieves it from database and shows it to the user. Other web applications have an extremely complex backend (i.e Twitter, Facebook, …). These not only manage the data retrieval, but have to do really complex operations with the data. Not only complex, but very expensive operations in terms of computational power. For example, each time a user uploads a picture to Facebook follows this path:

  • The picture is stored in a specific hard drive. The backend has to determine which hard drive corresponds to that user (yes, there are multiple hard drives and each one is assigned to a bunch of users so the load is distributed).
  • Once stored, the picture is sent to a processing queue where it will be turned into a thumbnail by an image processing software. This process is expensive as it has to analyze the picture and reduce it to a smaller representation if the image but still maintaining part of its quality.
  • After processing it, the backend stores the newly created thumbnail in the database and stores, both the picture and the thumbnail in an intermediate “database” in memory for faster access (cache). This is because it’s faster to retrieve data from memory than from a hard drive.

This is an approximation of what a picture does when you upload it to a social network. I’m pretty sure it goes through a lot more processes though. So, supposing 1% of a social network’s users are uploading pics at any single moment, imagine uploading ~20 photos per user, 2.5 million users at the same time (Facebook has around 250 million users currently). Trust me when I tell you, that’s a lot of data crunching.

The problem

The best user interfaces (frontend) are designed so that all that complexity that goes behind the scenes is never showed to the end user. The problem is that the frontend depends gravely on the backend. If the backend is slow, the frontend won’t be able to have the info the user is requesting or expecting and it will seem SLOW to the end user. Not only slow, but in many cases inefficient or just not available to use at all (meet the Twitter Fail whale :P).

whale

So, now, what will cause the backend to be slow? Ohhhhhh don’t get me started!! There are so many reasons why the backend might be slow or broken! But, most of them are triggered by growth. That is, as the web application is being used by more and more users, the backend will start to fall apart. That’s what, in the tech world is known as scalability problems. That is, the backend can’t scale at the same speed the users pour into the application. The problem is that it’s not only a problem of more users, but having users that interact more heavily with the site. For example you might have 100,000 active users but never had experience big scalability problems. Suddenly you release a feature that allows your users to share pictures more easily… BAM!! Your backend goes down in 10 minutes. Why!! Why?!! you might scream while you watch your servers go down in flames. After all you have the same amount of users, so what happened? Well, most probably your backend system that handles picture sharing was designed and tested only with few users. Now it chokes with the big deal.

scal_image06

The REAL problem

Once you have scalability problems, the next logical step is to find where the bottleneck is and why is it happening. This, which might seem very easy, isn’t at all. It’s like looking for a needle in a haystack. Big backends are normally screamVERY complex with many parts coded in different programming languages by different persons. Not only that, but sometimes problems arise in different parts of the backend. So after a couple of really stressful hours you find the bottlenecks and think of a solution to fix them. Ahh my friend, then you realize it’s not as easy to fix as you thought. First of all, you have no clue if the fixes your team has come up with are good enough. Why? Because you’re stepping into unexplored territory. Few persons have had to tackle a similar problem and even less people have dealt with your data and systems. So even if you find someone else with the same problem, the solution might be slightly different depending on what systems you use for your backend or which architecture you have. This is the point where you realize that developers aren’t engineers, but craftsmen and that fixing these problems isn’t exactly a science but black voodoo magic.

So, here you are, with a bunch of possible fixes to a problem but with no clue if they will really work or it will just be a patch that will need extra fixes in 2 weeks. Normally you try to benchmark the solutions, but that’s not an easy task, specially because you have no real load to test it against except in your production servers and no, you don’t want to fuck the productions servers more than they are.

Finally, after some black magic and some simple testes you cross your fingers and try the fix on the production servers. After several hours of monitoring the backend for new “leaks”, you scream of happiness as the patch seems to work. Then you start to realize that the patch won’t hold on forever and that you need some extreme solution to the problem.

You sit down with your tech team (our on your own as it’s my case 😦 ) and you start drafting a new solution. Suddenly you realize that the best fix implies changing the way your backend works. And by change I mean, you need to redevelop a big chunk of your backend to fix the problem. This implies a couple of things, you’ll need to invest a lot of time and resources, you’ll loose the stability your backend had (prior to the incident), you’ll walk into a new unexplored territory for your team and worst of all, you can’t just unplug your production servers and change the backend, you need to do it so both backends coexist for a while until you switch all of your servers from using the old one to the new one.

Now, the REAL problem is that this change, this new redesign grinds the whole company to a halt. All msntv-tech-teamresources, let it be people or money are invested in redesigning efforts so nothing new can be done. Most outsiders just don’t understand the depth of this change and will bash the company for not doing new things, for not releasing new features, for not fixing old bugs, etc. Not only that, investors will start to get anxious and will demand things to start moving. So, the outside world only sees that you’ve stalled, while the inside teams are suffering the pressure. Not only that, developers inside the company will get extremely frustrated by the pace of things. They won’t be able to add new features and even when fixing bugs they’ll need to fix them twice, one in the old backend, one in the new backend.

So, in the end, you realize the shit hit the fan and you got all of it. It’s hard, very hard to be there. If you haven’t experienced it you have no idea how hard it is. Not only as a developer but as a founder, CEO, or executive position you’ll feel the pain. You won’t be able to publicize your site cause more stress might accelerate the old backend problems, you can’t give users new features because you have no resources, you will try to explain the problem to investors but they won’t understand a clue of what you’re talking about… “backend what?”. Current customers will be pissed at you because the site is running slow and you are doing nothing to fix it. So, in the end, everything freezes until the new backend is in place.

How long does this takes? Depends. Depends on the size of the redesign, the size of the tech team, the skills of the team and specially, the skills of the management. During this phase, management must execute impeccably. Sadly, this is not the case in most places and so priorities are changed, mistakes are made and the redesign gets delayed over and over again.

It takes a very good leadership to make it through this period. Someone that knows where their priorities lie and that is able to foresee the future and the importance of the task ahead. Needless to say that such figure is lacking in most companies. That’s the reason it took so long for Twitter to pull their act together, to speed up Facebook, etc.

I am there, I am suffering the redesign phase (twice now). It’s hard, it’s lonely, it’s discouraging and frustrating, but it needs to be done. I just wrote this post so that outsiders can get a glimpse of what is it to be there and how it affects the whole company, not just the tech department. Scalability problems aren’t something you can discard as being ONLY technical, it’s roots might be technical but its effects will shake the whole company.

Let there be light 🙂

July 13, 2009

Anchoring an idea or product

Filed under: Business — Tags: , , , , — Alex Barrera @ 6:00 pm

I’m currently reading Crossing the Chasm (Geoffrey A. Moore). A friend recommended it to me when I told him I was struggling with the idea of selling services from my startup, Inkzee, to the enterprise. It’s an old book, (1991, revisited in 1999, old in terms of the tech scene), but the ideas and tips are surprisingly valid nowadays.

crossing_the_chasmOne of the key ideas for a successful “chasm crossing“, or selling an idea to the mainstream markets is to create a reference market to which your product/service can be compared on the heads of your clients. This principle is fairly easy to follow, but quite complex in nature. It taps into the way our brain works and it’s not the first time I’ve stepped onto it. Some months ago I finished reading another book, Predictably Irrational by Dan Ariely. It’s not an awesome book, but holds some very interesting insights into how humans react to different situations. One of the examples was about getting the correct pricing, or what the author calls the anchor price.

m_chasm_sf

Both ideas are rooted on the same principle. On the first case, the author suggests that when selling something innovative, that has no competition, the way to go is to create that competition. How do you create that competition? Easy, you introduce 2 new concepts, the alternative market and the alternative product. You need to position your product close to a well know market by the customer. For example, if you’re selling an online word editor, you could make yourself close to the market of desktop word processors, aka Microsoft Word, that will be your alternative market. It’s a market known by the customer, where they buy from and most importantly, they have an allocated budget to buy from it. By positioning next to that market, the client can make comparisons between your product and what they’re already using. In other words, you create an anchor they can use to compare you against. You set yourself into a preexisting category in the customers head. The problem is that you need to differentiate your product from that preexisting market. The way to do this is by referencing your alternative product, that is, a product or service that is similar to yours, and is market leader but in a different market niche. In this example, you could name something like Salesforce.com. So you end up with a punch line like, “our online word processor is like the salesforce word processor”. So, in conclusion, the idea is to create an anchor point and a differentiating value proposition.

Now, in Predictably Irrational they idea was very similar. Instead of focusing on a product sales proposition, tpredictably-irrationalhe domain was the pricing of a product. The example the author gave was the pricing of a subscription. An offer goes as follows, an annual subscription to The Economist (online access) costs $59. An annual subscription to The Economist (print) costs $125. Finally, an annual subscription (print and online access) costs $125. Which one would you choose? Chances are that the last one. Why is it like that? Truth is, that humans can’t value things without any reference. We always draw conclusions from comparisons. Our mind works under a cause/effect paradigm, that is, if the paper edition (lets symbolize the concept paper edition with the A symbol) costs $125 (B is the price), and online + paper (C) costs $125 (B as it’s the same price symbol as before) and online + paper (C) is better than only paper (A) then (here comes the effect) option C (paper + internet edition for $125) is the best one.

  1. A -> B
  2. C -> B
  3. C > A

As we see from the simple logic equations from above, without [3] we can’t choose between the first 2 options. We need something to compare against. In the prices example, we are creating an anchor price, $125, something we know the value of, the printed edition of a magazine (our alternative market). We then offer a new product, innovative, something we aren’t familiar with, the online access to a publication. By virtue of putting it next to the magazine realm (in this case by virtue of the same price) we create a connection between both propositions. The problem is that we need a 3rd cornerstone to allow humans to see the difference, to be able to choose. In this example we are using a quantitative approach to choose, 2 things are better than 1, specially if that 1 thing is part of the other offer.

Irrational

In the product example we use the product alternative to create a point of reference to which we compare the product to. In the former case the equation [3] isn’t as clear and powerful as in the subscription example, but plays the same role, a way to quantify and compare your product. For example, Microsoft Word (A) is part of the desktop publishing tools market (B), our product (C) is on a similar market (B). Salesforce.com (C’) is doing great and it’s similar than our product but in a different market niche (D). There fore, our product must be as good as Salesforce but in the market of desktop publishing.

  1. A -> B
  2. C -> B
  3. C’ ~= C (similar products)
  4. C’ best in D
  5. C > A

As you see, the train of thought is slightly more complex, but ends up with a similar conclusion. Granted that it’s not as straightforward as the pricing example and you need to probe both [3] and [4] to get the client to buy into your proposition, but it’s much easier to do that, than to try and sell it blindly.

Ahh, the beauties of neuromarketing 😉

July 6, 2009

Is the cloud the beginning of Skynet?

Filed under: Technology — Tags: , , , , , , — Alex Barrera @ 6:54 pm

I recently went to see the latest Terminator movie, Terminator Salvation. I have to say that I’ve always been a great fan of this movie, even though I don’t really believe in such a catastrophic future. Nevertheless, after watching the movie, which was pretty decent by the way (a little soft at the end though), I start thinking about how smart Skynet is depicted in any of the Terminator movies. I though, hey, if you could just nuke the datacenter where Skynet is, you would eliminate it. But then I started thinking about cloud computing.

terminator

For those unfamiliar with cloud computing (those familiar can skip this paragraph), it’s basically a new way of using computational resources (I’m oversimplifying the idea here though). Instead of buying or renting servers to deploy any web application, you rent computational power from a provider and pay by the hour. In a simple way, companies with spare computational capacity on their own servers, will rent you that time for you to use. There is no need to buy expensive hardware or maintain it. Instead you use the computational units the time you need them and as many as you need. That way you can take care of temporal spikes of usage in your applications by means of using more computational units and switching them off after the spike. The cool thing about it is that you don’t need to care about the underlying hardware you are using, nor the replication of your data. That is, the cloud system will maintain several copies of your data transparently so that if you loose data, you’ll still be able to recover it.

Cloud_Computing

So, back to Skynet. Most cloud computing systems are built so that they are extremely reliable, that is, if any of the servers that are used fails, the system will switch to a new server transparently. The end user won’t even notice the underlying hardware had a problem. The same happens for the data too. Those advances are part of a field known as high reliability and, although it’s not perfect, they are getting there. In a close future, few web applications will experience downtime because of faulty hardware or problems in the datacenter (like the recent lightning that struck an Amazon datacenter). That means that servers will be so extended that even if you nuked one of the cloud provider’s datacenter, systems won’t go down. Most probably a bunch of other datacenter all over the world will take over and you, as a user, wont notice anything.

Now, if you think about Skynet and strip out the AI, the backbone of it is just what cloud computing is trying to achieve right now. How many years more will we need to build a system that has not a single point of failure? Scary thoughts…

Blog at WordPress.com.