As I wanted Google to better index the website I administer, I needed to create a sitemap. Here is a nice description of the benefits of having a sitemap and the Sitemap XML-based protocol in general:
https://www.google.com/webmasters/tools/docs/en/protocol.htmlYou'd think creating a sitemap would be a fairly simple, mundane task. It is and it isn't. You need to know about how your website is organized on the webservers' file system. And, to do a sitemap correctly for a large site with more than 500 pages of web accessible content, you need to either spend a bit of money on some software to create a sitemap (see
http://www.xml-sitemaps.com/ or
http://www.auditmypc.com/free-sitemap-generator.asp) or get your hands dirty using Google's free Python script to generate a sitemap.
The second choice is the more complicated and time consuming, as it involves the following steps:
- install Python
- setup a config file to point at a log or your webroot to get a list of your content
- run a Python script to generate the sitemap
- tell Google you've got a new sitemap
Being the frugal masochist that I am, I chose the second option. The steps are described in great detail here:
https://www.google.com/webmasters/tools/docs/en/sitemap-generator.htmlI will try to add value by reviewing the stumbling blocks I encountered along the way to building my first sitemap. Here are the general steps I performed:
1) downloaded the
sitemap generator files from Sourceforge2) created a configuration file for my website
3)
downloaded and installed Python from the official website4) ran sitemap_gen.py
5) added the sitemap I generated to
Google Webmaster toolsI suggest you run these steps from a server with a development instance of your current website running.
1) download the sitemap generator files from SourceforgeSince our website runs on Windows 2000 Server, I grabbed the ZIP version of the sitemap generator files from Sourceforge. I unzipped them to a temporary directory.
2) created a configuration file for my websiteI chose access logs as the sourceThis was a time consuming one, as I first had to figure out whether I wanted to generate my sitemap based on URL or directory listings, access logs or another sitemap. As our website is made up of mostly dynamic pulls from our database via ASP/ASP.net page, I felt that using a sample access log from one of the servers in our farm would give me the most reliable source of data for the sitemap.
I reduced the size of my access logThis part ended up being the most time consuming, because our daily web logs are about 8GB each. Most of the requests for our website are images, so I wrote a simple AWK script to extract out only the 10-15% requests that are ASP related.
***
Be aware that you will get memory errors if you don't have enough RAM installed. I found that it took about 1.9GB of memory for the script to analyze a 1.3GB logfile. So you'll need about 1.5x GB RAM for a filesize of y. (1.9/1.3=1.5)
***
In my logfiles, the seventh column is the uri-stem:
#Fields: date time c-ip cs-username s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referer)Here is the AWK command to pull the seventh column with the code file path:
awk '{if ($7 ~ /asp/) print $0}' access.log > access2.logI use the
Cygwin Unix tools for NT. These tools are invaluable for munging through large logfiles.
The sitemap_gen.py script is going to look for the #FIELD header in the log, so I modified my awk command to include it:
awk '{if ($7 ~ /asp/ ¦¦ $1 ~ /#Fields/) print $0}' access.log > access2.logOK! So that trimmed my weblog down to about 1.5GB. This should be enough data to make a valid sitemap file!
I also used some filters to reduce the amount of redundant data in the logfileMy awk script should have removed any gif or jpeg requests, but if not, I went ahead and added those to the filter section in config.xml, in addition to any requests for our store locator:
In the end, here is what my config.xml file looked like:
<?xml version="1.0" encoding="UTF-8"?>
<site base_url="http://www.mywebsite.com/"
store_into="/Inetpub/development/sitemap.xml.gz" verbose="1"suppress_search_engine_notify="1">
<accesslog path="/temp/access.log" encoding="UTF-8" />
<filter action="drop" type="wildcard" pattern="*.jpg" />
<filter action="drop" type="wildcard" pattern="*.gif" />
<filter action="drop" type="regexp" pattern="/storeLocator/" />
</site>3) downloaded and installed Python from the official websiteSince I didn't have Python installed on my development server, I went ahead and followed this nice little
instructional video on Installing Python from ShowMeDo.com. The process was very easy..point and click and accept the defaults. But the video made it even easier.
The one other thing you should do is add the Python executable to your PATH. In Windows, I added C:\Python25\ to my PATH system variable. The PATH system variable is found in MyComputer -> Properties -> Advanced tab -> Environment Variables.
4) ran sitemap_gen.pyAlright! The moment of truth is upon me. I ran the python script from the command line:
bash-2.02$ python sitemap_gen.py --config=/temp/sitemap_gen-1.4/config.xmlReading configuration file: /temp/sitemap_gen-1.4/config.xmlOpened ACCESSLOG file: /temp/access.log[WARNING] Discarded URL for not starting with the base_url: http://ghome.asp?Sorting and normalizing collected URLs.Writing Sitemap file "\Inetpub\development\sitemap.xml.gz" with 50000 URLsC:\Program Files\Python25\lib\urlparse.py:188: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal cached = _parse_cache.get(key, None)C:\Program Files\Python25\lib\urlparse.py:220: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal _parse_cache[key] = vSorting and normalizing collected URLs.Writing Sitemap file "\Inetpub\development\sitemap1.xml.gz" with 50000 URLs[WARNING] Discarded URL for not starting with the base_url: http://ghome.asp?Sorting and normalizing collected URLs.Writing Sitemap file "\Inetpub\development\sitemap2.xml.gz" with 10985 URLsWriting index file "\Inetpub\development\sitemap_index.xml" with 3 SitemapsSearch engine notification is suppressed.Count of file extensions on URLs:110962 .asp23 .aspxNumber of errors: 0Number of warnings: 525Interpreting this output, we learn the following things:
1) there is a 50,000 URL limit to the sitemap files
2) some URLs will get bounced if they don't begin with your base_url
This is the "[WARNING] Discarded URL for not starting with the base_url" line.
3) some unicode parsing errors will occur
In terms of speed, here are the performance stats of sitemap_gen.py on my dual Xeon3.2 workstation with 3.2GB of RAM:
- 10 minutes to parse a 2,000,000 record logfile (about 1.9GB in size)
- 1.3GB of RAM used
***
Be aware that you will get memory errors if you don't have enough RAM installed. I found that it took about 1.9GB of memory for the script to analyze a 1.3GB logfile. So you'll need about 1.5x GB RAM for a filesize of y. (1.9/1.3=1.5)
***
TroubleshootingBe prepared to encounter failures like the below if you
don't have enough memory:
bash-2.02$ python sitemap_gen.py --config=/temp/sitemap_gen-1.4/config.xmlReading configuration file: /temp/sitemap_gen-1.4/config.xmlOpened ACCESSLOG file: /temp/access.logTraceback (most recent call last): File "sitemap_gen.py", line 2203, in sitemap.Generate() File "sitemap_gen.py", line 1775, in Generateinput.ProduceURLs(self.ConsumeURL) File "sitemap_gen.py", line 1115, in ProduceURLsfor line in file.readlines():MemoryErrorIn case you do run out of memory, reduce the size of your logfile. As I stated above, a 2GB logfile will take up about 1.3GB of physical memory to process. So sitemap_gen.py needs physical memory equal to about 66% of the total size of the logfile you are analyzing. In other words, I found that it took about 1.9GB of memory for the script to analyze a 1.3GB logfile. So you'll need about 1.5x GB RAM for a filesize of y. (1.9/1.3=1.5)
Also, if you
neglect to have a #Fields record header in your access log, you'll get a warning like this:
[WARNING] No URLs were recorded, writing an empty sitemap.Finally, if you point to a
missing logfile, you'll get an error like this:
[ERROR] Can not locate file: /temp/access.log5) added the sitemap I generated to
Google Webmaster toolsOK. This is the last step and a fairly simple one:
1) Upload the sitemap(s) to your website and
2) Point the Google Webmaster tools at it:

Update 2009/11/17
Actually, the best workaround for memory errors is to limit each logfile to 50,000 lines long by using the Unix "split" command:
http://www.worldwidecreations.com/google_sitemap_generator_memoryerror_workaround_fix.htmThis is the best solution because the python script can accept the "*" wildcard for multiple file inputs.
That's it!
'sodo