Correlation vs Causation


Back in February I read Rand Fishkin’s blog on correlation being as good if not better an indicator of search engine ranking signals as causation. From his viewpoint, he debunks the idea that correlation between a variable and search engine rank is not relevant to help direct a search campaign. He even goes so far as to say that correlation can be a better indicator. 


Image credit:


This might be a controversial point of view for some. For me, having spent most of my waking hours (and some of my sleeping hours) over the past 3.5 years trying to solve many of the challenges of SEO, I completely agree. For those who aren’t familiar with the correlation versus causation argument, here are some hilarious and supportive arguments.

Having digested some of Rand’s views on the matter, I was inspired to share some of my own personal thoughts and solutions on the intricate, real world challenges of ranking well in search engines. 


1. Can I find out exactly how the algorithm works?

Short Answer: No, you can’t.


The algorithm is Google’s most closely guarded secret – the IP behind their ability to deliver the most relevant search results to users. No one – not even within Google – actually knows the whole algorithm.


Image Credit:


There are many components to it, including assessing hundreds of data points. This means that many different teams work on many different components. But Google also employs machine learning techniques, meaning that it’s a computer system that self-learns.

Let’s suppose that you did know all of the data points Google uses. Even then, it would be technically and financially unfeasible to completely evaluate all of these data points in order to come up with a precise answer. Effectively, you would have to build a mirror Google system that evaluates every result, and then travel back in time to incorporate all the learnings they’ve found and used over the years.

The algorithm changes more than once every day and is constantly undergoing experiments for further enhancements. So even if you could get access to all of that data (which you can’t) you would then have to analyse it all over again, every single day, to determine what’s relevant to each search ranking result – which you would also have to get from Google somehow.

The search results also vary by device, locale, time of day, search history, whether you’re logged into your account or not, and more. So basically, you can’t predict the algorithm results.


Because you will never get an exact understanding of the algorithm, I believe any conclusion you reach will be vastly supported if it includes both a correlation and a confidence score. The correlation is the strength by which one independent data variable (e.g. number of high-quality back links) could explain the dependant data variable (rank.) The confidence score provides a view of how likely it is that an insight gained is not a random occurrence. Between the two scores a user can determine how confident they can be that a certain course of action would have an impact. To learn more about confidence scores, try this useful article on p-values.


Image Credit:


2. Can I use correlation rather than causation?

Short Answer: Yes, but…


As Rand’s article already covers, correlation – particularly on a large scale – can help to understand an activity type, rather than shifting an individual data signal that has direct causality. The latter in fact is more risky, and I tend to agree for the long-term game.

Expanding the data signals you evaluate creates other data issues that you may not be aware of. Just one of these is the problem of multicollinearity. Let’s say that you used both ‘number of unique referring domains,’ and ‘total number of backlinks’ as accessible signals to determine a relationship with search rankings.

The problem could be that you should research both unique domains to backlink from, and a high number of link building opportunities within individual large websites. Multicollinearity issues occur when the these two signals are also related to each other, thereby creating a three-way link between two independent variables and the one dependent variable (the search engine ranking.) Which task do you focus on first and which one is driving the rank?

You need to consider the diminishing return of a data signal according to your particular website. If the number of links is correlated and 1,000 is the “right number,” it makes a difference if you currently have 999 links or just 1. If you are already close to the “goal” it might be that there is something else worth putting your time into first.


Image Credit:



Solving the multicollinearity problem is relatively straightforward from a mathematics perspective. Two data signals are considered multicollinear if they have a correlation of 0.7 between them. In this case, ignore the data signal with a lower confidence score.

To solve for the diminishing return issue, you must design a mathematical formula that caters for your website’s current performance on the correlated data point, the expected return of moving that data point, confidence, and effort required to change each data point.

3. Can I use the conclusion to plan my SEO strategy?

Short answer: No


If you simply added all of the highest payoff tasks to a plan and started working through them one by one, you would get very bogged down, very quickly. For example, some of the highest payoff tasks require expansion into government or education websites to gain authoritative links back to your website. These tasks can be extremely hard to complete, time consuming, and may not even be successful.

Best practice for some tasks change over time. Take for example the use of schemas in articles. This means that the difficulty of completing a task within these guidelines can also change over time. Does your content team need this training before they start writing?

The time a task takes can also change over time. As CMS platforms evolve, changing meta data can become easier or harder. There are more tools becoming available all the time for updating redirects, improving page speed via content delivery networks, performing link clean-ups and more. To plan effectively, you have to stay on top of all these trends and adjust your priorities accordingly.

In some companies, certain tasks may be easier or harder to implement. A company that already employs a PR team may find editing and distribution articles easier, and a company with an archaic or custom CMS may find updating meta data very hard, or require a website re-launch. One size does not fit all.

Tasks have a diminishing return, too. It might be easy to squeeze out some additional page load time from a page with over-sized images, but how long would it take you to compress the code, implementCSS sprites, or install a content delivery network? And what’s the total cost, relative payoff and improvement in ranking or traffic for each?

Beware of the dark practices that can cause more harm than good when it comes to SEO. I’m not just talking about clearly black hat practices such as keyword stuffing and the like. I’m talking grey hat practices, such as use of certain PR news distribution websites. SEO blogs are full of such changes on an ongoing basis, so if you’re not following the guidance you may easily be doing more harm than good.

The final issue is scale. Let’s say that you figure out some tasks that are working for you. How do you build a team to implement them, and what if the priorities are regularly changing? It’s impossible for one person to have all of the necessary knowledge. Even if you did find that person, it would be impossible for them to scale or replicate their knowledge in a cost-effective manner.


To solve for these changing environmental factors, you need to maintain a central, authoritative database of SEO activities that are up-to-date with latest trends and captures how long it takes you to complete. You must systematically collect and experiment on best practices from a variety of authoritative sources. You need to build a broad team of reliable expertise that you can rely on as and when required., You also need to implement control and measurement processes to assure they are following the procedures you have designed. In order to maximise scale, you also need to leverage project and resource management tools to ensure your expanded team are producing work efficiently and to the right quality standards you expect.


4. If the algorithm is updated that often, do I need to change my plan everyday? 

Short answer: No


In most cases, it’s simply impractical to change your plan on a daily basis. Sure, you should stay on top of changes to your website, or major changes to the algorithm, but planning an SEO campaign involves many different disciplines.

Some tasks simply take longer than a day, anyway. You would be wasting a lot of internal effort if you kept stopping and starting tasks according to the Google algorithm changes. Stopping work and starting something else after weeks of implementation because an algorithm changes makes no sense.

You need to balance implementation with planning time. Too often we observe SEO specialists spending too much time evaluating and planning what to do next. If you over-invest in this activity, you aren’t implementing. This also means you aren’t getting results.


I believe that the first step is to go Agile, and pick a sprint cycle that works for your business. Make the plan transparent, using a suitable project management tool such as Jira or Trello.



Image Credit:

Secondly, I still believe in collecting data on a daily basis, to keep on top of the environment and algorithm. Avoid a knee-jerk reaction to every change, and keep implementing the items you feel most confident will have a long-term payoff.


5. Will it work and help my business? 

Short answer: Yes, but…


As many search specialists will tell you, the first thing you need to be clear on is the keywords your business is targeting. Get this part wrong and you’ll start moving keyword rankings up, but not traffic or sales – meaning that all of the effort you put in has been in vain. This is why keyword selection is such a critical component of the planning and over-arching strategy of your campaign.

With so many activities underway on a website and as part of an “SEO campaign” it’s near impossible to prove which actions actually led to improvements, and which didn’t.  SEO companies and specialists all seem to have magical stories of incredible results for their clients. The issue is that they can’t claim to consistently gain results across multiple projects, or what exactly will work for you.


You need a highly-diligent methodology to manage and record all SEO activities, mapped to keywords and URLs (when relevant) and with absolute data about how long they took and when they were implemented. You then need to monitor ranking improvements and traffic to individual URLs, and aim to conclude decisive insights that can build confidence in the campaign. A variety of programming and mathematical solutions can assist, however this will require substantial up front and ongoing investment to get right.


The argument between correlation and causation is sure to be debated for some time to come in certain corners of the internet. Those on each side of the fence are sure to defend and argue their opinion. From our perspective, we believe that finding ways to spend more time implementing and less time deliberating is the best solution for achieving results.

If you’d like to stay in touch with our opinions on using a scientific approach to planning and prioritising Search Engine Optimisation activities, visit the Glasshat blog. Otherwise, we’d love to hear your comments below.