As a digital agency, we’ve done lots of design and development work for clients all over the world and come across many different scenarios along the way. This generally involves building a development environment, designing their new website and adding in some 301 redirects to an .htaccess file to ensure that old pages get correctly redirected to new pages when their new website goes live.
With most of our clients, we manage their existing site and their new site, so we make a conscious effort most the time to ensure that the URLs stay the same, either by duplicating their live site into a development environment and building the new theme / site from there as a base, or mapping links out along the way.
Sometimes, clients only want the main pages redirecting, or only pages indexed in Google redirecting. However, some clients want every page redirecting as a precaution, especially if they’re regularly posted on forums / blogs / emails where the page may not necessarily be indexed, but still receives traffic.
One of our more recent clients has been working with another agency and approached us to take this work on. They run an online store that has in excess of 9000 products in stock and have been in the process of building their new website from scratch for quite some time. Over the past few months, we’ve built up a good relationship with the client and they asked us to handle migrating their development site over to their live website.
The client has made a conscious effort to try to keep many URLs the same, but wasn’t sure how many URLs were different. We effectively needed to figure out exactly which URLs needed 301 redirects implementing for.
As they had such a large product base, it’s important for us to ensure that everything is done correctly and to make sure that no pages are left with 404 errors causing lots of issues for them moving forward.
I started doing some research on the internet, both on the Google Webmaster website along with on Moz to see if anybody had written a good guide on migrating large sites from a dev to live environment. I couldn’t find anything that made it sound simple, so I’ve put together the following tutorial which I believe is fairly straight forward and should hopefully help people in the future.
Unfortunately, there isn’t a quick way to carry this work out. It’s a very slow process, but hopefully this guide will help you speed up the process and gather the links together in one place much quicker. The aim of this guide is to help others, but also try to refine the tutorial further with the comments that will hopefully be left at the bottom.
I apologise in advance if this tutorial confuses anyone. There’s a lot of renaming data / reorganising it and finding and replacing certain words. If anything is confusing, please leave a message at the bottom and I’ll try my best to explain it more clearly.
Take a deep breath and let’s get started!
Tools required
In order to carry out this work, we need to use the following tools:
- Integrity for Mac – As I’m a Mac user (hooray!), you can sometimes be limited by the SEO tools available. Integrity allowed us to quickly crawl the website and build a list of all the links on our clients website.
OR
- Xenu Link Sleuth – If you’re a PC user, I think Xenu Link Sleuth will do exactly the same job as Integrity. Essentially, you can enter in a URL and crawl every page on the site. You can then export the results and filter them using excel to remove any pages that you don’t require in bulk.
- Microsoft Excel – I’ve never spent so much time in Excel over the years as I have this week. It’s an absolutely fantastic tool that enabled us to figure out which URLs were not currently available on the dev site, indicating that they needed 301 redirects implementing.
Where we started…
Thinking about this logically, we needed to have a list of all the URLs on the live site and all the URLs on the development site in order to know which ones don’t exist on the new site. Integrity for Mac is a small application that allows you to crawl your website for every URL. It allows you to set parameters and blacklist / whitelist rules which is brilliant when working with ecommerce stores such as Magento.
We started off crawling the live website with the following settings:
As we’re using Magento, we were basically telling Integrity not to show us URLs that contained the word ‘review’, ‘product_compare’, ‘productalert’, ‘cache’ in the URL. When we originally ran this without the blacklist parameters added, we had over 35,000 links returned.
We then told Integrity to ignore query strings, not to check external URLs and to ignore trailing slashes.
By the time the report finished, we had a list of every URL that was on the live website. Simples!
Note: If you don’t know what your blacklist rules are, you can always run the search with no rules added and load the full export into Microsoft excel. From here, you can filter the information using advanced filters and entering keywords such as ‘product_compare’ then deleting the results in bulk. It’s not always obvious when crawling a website first time round which pages may get pulled in so this could be a better option for most people.
Once we had the list of the live URLs, we opened the file in excel, removed all the other columns that contained irrelevant information (Status, Appears On, Link Text), and carried out a find and replace command (ctrl + f) on the URL to replace the live URL with the development site URL.
This was as easy as simply replacing:
http://www.yourdomain.com
with
http://dev.yourdomain.com.
Once the find and replace command finished running, we had a list of development URLs which may / may not exist on the development site. We did a ‘Save As’ of this file and called it ‘liveurlsondevsite.csv’.
Step 2 – Scrapebox Link Checker
Note: I required a PC for this section. I’ve never found a tool that works for Mac as well as the Scrapebox link checker works for PC. There will be something available, but we do have both Windows and Mac systems available in the office so this was just preference and ease.
Once we had a CSV containing links that may / may not be on our development site, we copied them into a .txt file and loaded them into the Scrapebox Link Checker tool (yourbacklinks.txt). We then created a second .txt file which contained the text ‘http://www.yourdomain.com’ (Saved it as yoursites.txt) and added this as the ‘Your Sites’ section so that scrapebox had something to search for on those pages.
The tool then sat there and checked every single link one by one to see if the URL string was found, not found or error 404. Really, we’re only interested in the pages that return a 404 error. This is because if the URL you entered in the ‘yoursites.txt’ was ‘found’ or ‘not found’, a page still existed on that URL. If it shows a 404, the page is actually dead and needs a redirect putting in place.
We then exported both the ‘Found entries’ as a txt file along with the ‘Not Found’ entries as a txt file. Unfortunately, scrapebox does not allow you to export a list of 404 pages, which is where Excel comes in handy.
Step 3 – Excel
Note: This section was the part which caused the biggest headache. I’m sure there will be a more straight forward way of carrying this out, but I must stress that I’m not proficient with Excel. If anyone has any suggestions on how to streamline the following formula, please post it in the comments.
At this point, we have a list of URLs which ‘found’ and ‘didn’t find’ a particular URL on the page. We don’t however have a list of the 404 error pages as this wasn’t possible to export. Because of this, we have to run a small snippet of code in excel to remove all the ‘found and not found’ urls from the original list.
Open up the ‘liveurlsondevsite.csv’ and ensure that all the URLs are displayed in Column A. In here, you should have all the URLs that are pointing to your dev site. We know that many of these URLs will be broken, so at this stage, we’re trying to remove any links where a live page was found.
In column B, you should copy in both the ‘Found’ and ‘Not Found’ URLs lists that were exported from scrapebox. You can then highlight the whole of column B only, and go to ‘Data > Remove Duplicates’.
In column C, we can then add the following string:
=IF(OR(NOT(ISERROR(SEARCH(INDIRECT(“B2:B”&(COUNTA($B:$B))),$A2)))),””,$A2)
When adding the above code, Instead of pressing the usual ‘Enter’ when you type in the code you need to press:
- CTRL + SHIFT + ENTER (PC Users)
- CMD + ENTER (Mac Users)
This will define an array formula which you can tell has worked if it wraps the code in curly brackets like these –> { }.
Thanks to the user ‘Peter L’ from Stacked Overflow for this piece of code. I spent a long time trying to figure out exactly how to do this with no success.
Tip: This section took me a long time to figure out and I kept running into problems where results were not showing. The formula above uses a phrase match type search and not an exact match search. In this case, you need to make sure that you remove the main URL from Column A. For example, if your domain is http://dev.yourdomain.com, you need to make sure this doesn’t appear on its own in Column A (obviously every URL will have the prefix http://dev.yourdomain.com but it’s important it doesn’t appear on its own with no extensions). Anyone with knowledge of Excel should be able to correct the formula and make it an exact string match.
Once you’ve got this list compiled, you can sort the list into alphabetical order to remove all the blank spaces that will appear.
At this point, you have a list of URLs which will not work on the new website once the development store is transferred over. Basically, these are pages that need 301 redirects creating in order to ensure that when the site does swap, people don’t hit old pages.
Creating the 301 redirects
Copy this new list into Column A in a new Excel file and give it a heading of ‘Old URL’. At this point, you should run a find and replace command again and replace the ‘http://dev.yourdomain.com’ with the ‘http://www.yourdomain.com’.
Tip: To check you’re on the right lines, the list of all the URLs in the file above should work if the prefix is the live URL, but fail if the prefix is the development URL. For example:
http://www.yourdomain.com/page1.html – should work
http://dev.yourdomain.com/page1.html – should 404
At this point, you can create a heading in column B called ‘New URL’ and go through the list manually to find the URLs where the page should redirect to. This is a slow process and unfortunately, there isn’t a way to automate this.
Once completed, you’ll have two columns, your old website URLs (Column A) and your new website URLs (Column B).
To be on the safe side, highlight all of Column B and run a Find and Replace command again and replace http://dev.yourdomain.com with http://www.yourdomain.com to ensure that none of these links contain the development URL anymore, this will soon by your live website when you come to transferring over.
We can then start creating the file for your .htaccess. For the purpose of this tutorial, I’m going to assume that you’re on a linux based server that supports .htaccess and mod rewrite. If this isn’t the case, then I would suggest searching Google for a way to create 301 redirects and amending the formulas below accordingly.
There is a brilliant guide on Moz that explains 301 redirects in more detail. This can be found here – https://moz.com/learn/seo/redirection – I’m going to assume you have a little bit of knowledge in altering .htaccess files and just tell you how to bulk create the rules.
In this particular excel file, you’ll now have the following columns:
Column A – Old URL
Column B – New URL
First of all, we need to remove the URL from the beginning of column A. Highlight all of column A and do a find and replace command to find http://www.yourdomain.com with (blank). Make sure you do not remove the / from the end of the domain.
At this point, Column A should just be left with a list of URLs such as:
/category/product-name.html
/category/product-name-two.html
/cmspage.html
and so on…
Do not alter Column B, this will stay exactly the same as it currently is. This should contain the full URL.
In Column C, you can enter the following text:
RedirectMatch 301
Column D will then combine the following together:
Column C + Column A + Column B
To do this, we will run a simple formula which is as follows
=C1&” “&A1&” “&B1
You can then copy this all the way down your document until the end. Once done, you should have a list of .htaccess redirects that look something like the following:
RedirectMatch 301 /category/product-name.html http://www.yourdomain.com/category-new/newproducturl.html
Once you’re done, it’s simply a case of copying and pasting all the above lines into your .htaccess file (test one first!) and then handling the swap of your actual development to live store.
Tips:
- Keep an eye on your Google Webmaster Tools account, they’ll notify you of any dead pages that appear or that may have been missed.
- If moving your actual domain name (for example: http://www.pinpointdesigns.co.uk to http://www.yournewdomain.co.uk), tell Google via your webmaster tools account that you’re moving the actual domain name. They can then update their records and handle the change alongside your 301s.
- Test everything along the way to make sure you’re on the right track. When dealing with large numbers of URLs, it’s easy to make a mistake and end up on the wrong track. Test your 301s, make sure they work and if in doubt, crawl your website in full once the 301s are in place to check they all work correctly.
I hope you’ve found this guide useful and I look forward to hearing all your comments below!