Tuesday, September 04, 2007

creating a Google sitemap for your website

As I wanted Google to better index the website I administer, I needed to create a sitemap. Here is a nice description of the benefits of having a sitemap and the Sitemap XML-based protocol in general:
https://www.google.com/webmasters/tools/docs/en/protocol.html
You'd think creating a sitemap would be a fairly simple, mundane task. It is and it isn't. You need to know about how your website is organized on the webservers' file system. And, to do a sitemap correctly for a large site with more than 500 pages of web accessible content, you need to either spend a bit of money on some software to create a sitemap (see http://www.xml-sitemaps.com/ or http://www.auditmypc.com/free-sitemap-generator.asp) or get your hands dirty using Google's free Python script to generate a sitemap.

The second choice is the more complicated and time consuming, as it involves the following steps:
- install Python
- setup a config file to point at a log or your webroot to get a list of your content
- run a Python script to generate the sitemap
- tell Google you've got a new sitemap

Being the frugal masochist that I am, I chose the second option. The steps are described in great detail here:
https://www.google.com/webmasters/tools/docs/en/sitemap-generator.html

I will try to add value by reviewing the stumbling blocks I encountered along the way to building my first sitemap. Here are the general steps I performed:
1) downloaded the sitemap generator files from Sourceforge
2) created a configuration file for my website
3) downloaded and installed Python from the official website
4) ran sitemap_gen.py
5) added the sitemap I generated to Google Webmaster tools

I suggest you run these steps from a server with a development instance of your current website running.

1) download the sitemap generator files from Sourceforge
Since our website runs on Windows 2000 Server, I grabbed the ZIP version of the sitemap generator files from Sourceforge. I unzipped them to a temporary directory.

2) created a configuration file for my website
I chose access logs as the source
This was a time consuming one, as I first had to figure out whether I wanted to generate my sitemap based on URL or directory listings, access logs or another sitemap. As our website is made up of mostly dynamic pulls from our database via ASP/ASP.net page, I felt that using a sample access log from one of the servers in our farm would give me the most reliable source of data for the sitemap.

I reduced the size of my access log
This part ended up being the most time consuming, because our daily web logs are about 8GB each. Most of the requests for our website are images, so I wrote a simple AWK script to extract out only the 10-15% requests that are ASP related.

***
Be aware that you will get memory errors if you don't have enough RAM installed. I found that it took about 1.9GB of memory for the script to analyze a 1.3GB logfile. So you'll need about 1.5x GB RAM for a filesize of y. (1.9/1.3=1.5)
***

In my logfiles, the seventh column is the uri-stem:
#Fields: date time c-ip cs-username s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referer)

Here is the AWK command to pull the seventh column with the code file path:
awk '{if ($7 ~ /asp/) print $0}' access.log > access2.log

I use the Cygwin Unix tools for NT. These tools are invaluable for munging through large logfiles.

The sitemap_gen.py script is going to look for the #FIELD header in the log, so I modified my awk command to include it:
awk '{if ($7 ~ /asp/ ¦¦ $1 ~ /#Fields/) print $0}' access.log > access2.log

OK! So that trimmed my weblog down to about 1.5GB. This should be enough data to make a valid sitemap file!

I also used some filters to reduce the amount of redundant data in the logfile
My awk script should have removed any gif or jpeg requests, but if not, I went ahead and added those to the filter section in config.xml, in addition to any requests for our store locator:

In the end, here is what my config.xml file looked like:
<?xml version="1.0" encoding="UTF-8"?>
<site

base_url="http://www.mywebsite.com/"
store_into="/Inetpub/development/sitemap.xml.gz"

verbose="1"
suppress_search_engine_notify="1"
>
<accesslog path="/temp/access.log" encoding="UTF-8" />
<filter action="drop" type="wildcard" pattern="*.jpg" />
<filter action="drop" type="wildcard" pattern="*.gif" />
<filter action="drop" type="regexp" pattern="/storeLocator/" />
</site>


3) downloaded and installed Python from the official website
Since I didn't have Python installed on my development server, I went ahead and followed this nice little instructional video on Installing Python from ShowMeDo.com. The process was very easy..point and click and accept the defaults. But the video made it even easier.

The one other thing you should do is add the Python executable to your PATH. In Windows, I added C:\Python25\ to my PATH system variable. The PATH system variable is found in MyComputer -> Properties -> Advanced tab -> Environment Variables.

4) ran sitemap_gen.py
Alright! The moment of truth is upon me. I ran the python script from the command line:
bash-2.02$ python sitemap_gen.py --config=/temp/sitemap_gen-1.4/config.xml
Reading configuration file: /temp/sitemap_gen-1.4/config.xml
Opened ACCESSLOG file: /temp/access.log
[WARNING] Discarded URL for not starting with the base_url: http://ghome.asp?
Sorting and normalizing collected URLs.
Writing Sitemap file "\Inetpub\development\sitemap.xml.gz" with 50000 URLs
C:\Program Files\Python25\lib\urlparse.py:188: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal cached = _parse_cache.get(key, None)
C:\Program Files\Python25\lib\urlparse.py:220: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
_parse_cache[key] = v
Sorting and normalizing collected URLs.
Writing Sitemap file "\Inetpub\development\sitemap1.xml.gz" with 50000 URLs
[WARNING] Discarded URL for not starting with the base_url: http://ghome.asp?
Sorting and normalizing collected URLs.
Writing Sitemap file "\Inetpub\development\sitemap2.xml.gz" with 10985 URLs
Writing index file "\Inetpub\development\sitemap_index.xml" with 3 Sitemaps
Search engine notification is suppressed.
Count of file extensions on URLs:
110962 .asp
23 .aspx
Number of errors: 0
Number of warnings: 525

Interpreting this output, we learn the following things:
1) there is a 50,000 URL limit to the sitemap files
2) some URLs will get bounced if they don't begin with your base_url
This is the "[WARNING] Discarded URL for not starting with the base_url" line.
3) some unicode parsing errors will occur

In terms of speed, here are the performance stats of sitemap_gen.py on my dual Xeon3.2 workstation with 3.2GB of RAM:
- 10 minutes to parse a 2,000,000 record logfile (about 1.9GB in size)
- 1.3GB of RAM used

***
Be aware that you will get memory errors if you don't have enough RAM installed. I found that it took about 1.9GB of memory for the script to analyze a 1.3GB logfile. So you'll need about 1.5x GB RAM for a filesize of y. (1.9/1.3=1.5)
***

Troubleshooting
Be prepared to encounter failures like the below if you don't have enough memory:
bash-2.02$ python sitemap_gen.py --config=/temp/sitemap_gen-1.4/config.xml
Reading configuration file: /temp/sitemap_gen-1.4/config.xml
Opened ACCESSLOG file: /temp/access.log
Traceback (most recent call last):
File "sitemap_gen.py", line 2203, in
sitemap.Generate()
File "sitemap_gen.py", line 1775, in Generate
input.ProduceURLs(self.ConsumeURL)
File "sitemap_gen.py", line 1115, in ProduceURLs
for line in file.readlines():
MemoryError

In case you do run out of memory, reduce the size of your logfile. As I stated above, a 2GB logfile will take up about 1.3GB of physical memory to process. So sitemap_gen.py needs physical memory equal to about 66% of the total size of the logfile you are analyzing. In other words, I found that it took about 1.9GB of memory for the script to analyze a 1.3GB logfile. So you'll need about 1.5x GB RAM for a filesize of y. (1.9/1.3=1.5)

Also, if you neglect to have a #Fields record header in your access log, you'll get a warning like this:
[WARNING] No URLs were recorded, writing an empty sitemap.

Finally, if you point to a missing logfile, you'll get an error like this:
[ERROR] Can not locate file: /temp/access.log

5) added the sitemap I generated to Google Webmaster tools
OK. This is the last step and a fairly simple one:
1) Upload the sitemap(s) to your website and
2) Point the Google Webmaster tools at it:


Update 2009/11/17
Actually, the best workaround for memory errors is to limit each logfile to 50,000 lines long by using the Unix "split" command:
http://www.worldwidecreations.com/google_sitemap_generator_memoryerror_workaround_fix.htm

This is the best solution because the python script can accept the "*" wildcard for multiple file inputs.

That's it!
'sodo

5 comments:

Ian Ozsvald said...

Hi Cacasodo, it is great to see that you find our series useful.
Just to note - the first 3 videos in my Python on XP series show the user how to configure Python 2.5 on Windows. Note that the rest of that series needs to be purchased to view but the first 3 are free.
I have another video which gives guidance on setting up the Windows Path for Python and, for first-timers, here I show a new user how they might start in their first 5 minutes with Python.
Cheers,
Ian (ShowMeDo co-founder)

Cacasodo said...

Thanks Ian,
I will check out these videos. I'm sure my readers will appreciate the links as well. Good luck!
scott

Jafo232 said...

If you are getting the MemoryError problem, there is a workaround. You can find it here:

http://www.worldwidecreations.com/google_sitemap_generator_memoryerror_workaround_fix.htm

Cacasodo said...

Thanks Jafo!
That's a good link. Their solution is essentially the same as mine, reduce the size of the logfile. They just use the unix "split" command to do this..great idea! They also give this hint:

"It can now handle any size log file. Now if you still get the MemoryError error, try reducing the size of the split files to less than 50000."

nice.
'sodo

Anonymous said...

hi. Thanks for article. If you need to create XML, HTML sitemap files for your sites or to notify search engines about updated sitemap files use Sitemap Writer Pro. I tried different software for creating the sitemap for my website. This is the most effective one among 5-6 programs I tried.
This powerful webtool generates the sitemap for your website in just seconds. Mayor search engines (Google, Yahoo, Ask, MSN, Moreover) are able to index and read your sitemaps without any problems.

Feel free to drop me a line or ask me a question.