Feb 27

For Google to index the pages of your website the Google crawler first needs to know how to find the web pages. The best way to tell Google how to find your pages is to submit a sitemap to Google. You will need to have shell access to your server and have python 2.2 or greater installed for this script to work. Start by downloading sitemap_gen-1.5.tar.gz from http://code.google.com/p/sitemap-generators/downloads/list. Next unzip and untar the file. Next cd to the directory created by the untar command. The directory name should be something like sitemap_gen_1.53. Once in the directory you will need to create the file yoursite_config.xml. Google gives you a sample file in this directory. Here is the config file I made..

<?xml version="1.0" encoding="UTF-8"?>
 <site
  base_url="http://www.mysite.com/"
  store_into="/home/me/public_html/sitemap.xml"
  verbose="1"
  sitemap_type="web"
 >
   <directory  
   path="/home/me/public_html"
   url="http://www.mysite.com" 
   default_file="index.html"/>

<filter action="drop" type="wildcard" pattern="*/TEST/*" />  
<filter action="drop" type="wildcard" pattern="*/backup/*" />  
<filter action="drop" type="wildcard" pattern="*/.*" />  
<filter action="drop" type="wildcard" pattern="*/*.tar" />  
<filter action="drop" type="wildcard" pattern="*/blank/*" />  
 </site>

Notice I used the “filter action=drop” to overlook files I do not want to submit to Google. You can use regular expressions in the pattern matching here. Now lets run the script to make the sitemap.
python sitemap_gen.py --config=mysite_config.xml --testing
Now have a look at the sitemap.xml. It should be located in /home/me/public_html/sitemap.xml as we specified this in the config file. Review that you have all the pages listed in the sitemap that you would like to submit to Google. If you need to make changes in your config file make sure to rerun sitemap_gen.py and then review your sitemap.xml until you get everything correct. Notice we are running the sitemap_gen.py with –testing. Always use testing until you are ready to submit your sitemap to Google. Then run
python sitemap_gen.py --config=mysite_config.xml
This will submit your sitemap to Google.
You can also resubmit your sitemap using a http request to Google. Here is my http request to resubmit my sitemap to Google.
www.google.com/webmasters/tools/ping?sitemap=http://www.mysite.com/sitemap.xml
Before we submit the request we must url encode everythingafter the “?”. So my http request now looks like
www.google.com/webmasters/tools/ping?sitemap=http%3A%2F%2Fwww.mysite.com%2Fsitemap.xml
Now issue the http request with curl or wget.
wget http://www.google.com/webmasters/tools/ping?sitemap=http%3A%2F%2Fwww.mysite.com%2Fsitemap.xml
Lastly add your sitemap to your robots.txt file.
sitemap: http://www.example.com/sitemap.xml
You have now told Google how to find pages on your site that Google might now have normally found.

Leave a Reply